However, there can be a problem that is important to understand and has been seen in the past. A device had the wrong firmware level and under some conditions caused a long delay in the middle of the collection interval. Some samples were collected close the the starting time of that interval while all that followed the delay were actually collected at a time much later than was being reported.
Consider the following in which we're looking at raw data collected for 2 subsystems, call them XXX and YYY. Let's also assume that the counters we're monitoring are increasing at a steady rate of 100 units/sec. In this example, during the 10:00:01 interval there was a 10 second hang in collecting the YYY sample. The XXX sample was correctly recorded, but by the time the YYY sample was collected, 1000 units were recorded. As we move to the next interval which was delayed by 10 seconds, the sample for XXX has accumulated 1000 units and the sample for YYY is 100.
TYPE XXX YYY 10:00:00 100 100 10:00:01 200 1100 10:00:11 1200 1200 10:00:12 1300 1300The problem here is when reporting the 2 rates at 10:00:01, we'll see a rate of 1000 units/sec for YYY because based on the timestamp that interval only appears to be 1 second long. Conversely, the rate reported for that same subsystem at 10:00:11 will be 10 units/sec because this interval is reported as 10 seconds long. Also note that for this interval the counter for XXX has been incremented correctly and the resultant rates are reported correctly. This is because the sampling occured before the delays. If one were to move the timestamp to the end of the interval, it would fix the problem with YYY, but then move it to XXX.
It IS important to understand that this is only a problem if the delay is during the data collection itself. If there is a system delay that causes all data collection to be delayed but once started runs as expected, and this has been seen to be the typical case, the intervals may be longer but the counters will have increased proportionaly and the results consistent.
The only real answer to this problem would be to timestamp individual samples, however it is also felt that this problem is rare enough as to not be of serious concern and changing the methodology of timestamping would cause more problems than it solves.
One other thing to consider is that when selecting only non-zero values be reported, one might be occasionally be surprised by see values of 0 being reported. This will occur if there is a non-zero value that is then nomalized to 0.
If you think you might need to see these close to 0 values, you should include -on which tells collectl not to normalize its output before reporting.
updated Feb 21, 2011 |