Let's say you have multiple devices such as disks and the one you're interested in is misbehaving and reading or writing too slowly. Won't the total disk activity also be low? Similary if disk traffic is being reported high on a lightly loaded system won't this also jump out you by simply looking at the total disk activity? The same can be said about networks and most other subsystems for which there is summary data and simply looking at the totals will often alert you to the fact that something is not right. The key thing to keep in mind that you are looking at totals.
CPU monitoring can be a little tricker as these are reported as averages as opposed to totals and as the number of cores increase so does the divisor of the calculation. In most cases when you have a system with excessive load, it will effect all CPUs and so will be very visible even as an average, but in some cases it won't. What if you have a 2 core system and see a CPU load of 45% when you're expecting a much lighter load? Looking at individual CPUs you may see one running at near zero load and the second at 90%. Your only clue was that the 45% load was unexpected and so you looked closer. But what if you had a heavily loaded CPU on a 48 core system? You'd never even realize it. In other words, just pay attention.
Stated slightly differently, summary data is often a starting point to help identify potential trouble ares and from there you can determine if you need to dig deeper.
So why brief data?
It you have ever tried to look at multiple lines of different text and identify what was changing over time you should already know the answer - it's really difficult! For example, here's what collectl might show for CPU, Disk and Network data:
collectl.pl --verbose ### RECORD 1 >>> poker <<< (1314712401.002) (Tue Aug 30 09:53:21 2011) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # User Nice Sys Wait IRQ Soft Steal Idle CPUs Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 0 0 0 0 0 0 0 100 4 1120 192 0 363 0 0.00 0.00 0.00 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 0 0 0 0 0 0 0 0 # NETWORK SUMMARY (/sec) # KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO 0 1 60 0 0 0 0 0 0 0 0 ### RECORD 2 >>> poker <<< (1314712402.002) (Tue Aug 30 09:53:22 2011) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # User Nice Sys Wait IRQ Soft Steal Idle CPUs Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 0 0 0 0 0 0 0 99 4 1111 200 0 363 0 0.00 0.00 0.00 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 0 0 0 0 256 59 5 51 # NETWORK SUMMARY (/sec) # KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO 0 2 60 0 0 0 0 3 328 0 0
Now consider the fact that in many cases seeing network errors or disk merges or even the percentage of time the CPU spent processing interrupts, while important, may not be when trying to identify anomalous behaviors. And that's where brief mode comes in. Here we are identifying those few nuggets of information which will tell us whether or not things are functioning as expected such that we can display them all on the same line and make it easier to spot change. In fact, during the following run I did a ping -f and see how easy it is to spot the network burst?
collectl #<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 0 0 1124 203 0 0 240 4 0 0 0 0 0 0 1105 253 0 0 12 2 0 1 0 1 0 0 1123 206 0 0 0 0 0 3 0 2 2 1 6051 8584 0 0 0 0 173 2099 297 2860 3 2 7828 11270 0 0 0 0 222 2770 411 3936 0 0 1115 204 0 0 92 5 0 5 1 5 0 0 1121 198 0 0 0 0 0 1 0 1
In summary, just keep in mind that there is no single recipe for how to monitor a system, what format to display the output in and how to drill deeper. However, as you become more familiar with the types of data and collectl formats your ability to better utilize collectl will increase.
updated Sept 19, 2011 |