Collectl Features

The following are descriptions of just some of collectl's many features but are not intended as a tutorial of how and when to apply them:

Fine Grained, Non-Drifting Monitoring

If the Time:HiRes perl modules is installed, which you can verify with the command collectl -v, you will be able to run with non-integral sampling intervals. Whether integral or fractional, sample times will align extremely close to the whole second and will not drift as it the case with just about every other tool. You will also be able to use -om to have all times reported in msec.

Low Overhead

Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else. The overhead can increase on systems which have dozens of disks or running hundreds processes so you may want to measure its effect on you own system if you are concerned.

Summary vs Detail Data

You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks and even the Lustre file system. However you can also report on individual devices if you want to see how the aggregate load is being generated. Be sure to see the documentation for more details, particularly the examples and tutorials in the Getting Started section.

Brief vs Verbose Format

If is often more useful to see less data but for more devices and collectl recognizes this by providing brief format as a default interactive display format. This allows you to see what a number of subsystems are doing on a single line, making it much easier to spot inconsistencies in the data by scanning a column of numbers. If you want more detail and are willing to look at multiple lines per sample verbose format is what you want. In fact, if you collectl data in record you can play it back in both formats, first brief to look for problems and than again in verbose to see more of what is happening. This technique also applies to Summary and Detail data as well.

Plot Format

Althougy you can also display interactive output in plot format, this is really intended for its namesake, plotting. By generating output (or simply playing back recorded data) in this format, you can then feed the resultant files into plotting tools that recognize delimiter separtated fields such as gnuplot, Excel or even OpenOffice. If you need data with non-space separated data which is the default, you can even change it via the --sep switch.

Aligned Monitoring Intervals

If you've installed Time::HiRes and are using an integral sampling interval, by default collectl will align its sampling on integral second boundaries. In interactive mode samples will be taken as close as possible to the nearest second and when run as a daemon, samples will align to the top of the minute. In the latter case this means if you're running on a cluster with synchronized clocks, all instances of collectl will collectl their samples within a few msec of each other, making it much easier to correlate events across the cluster.

Process and Slab Monitoring

Both ot these are higher overhead activities, but collectl provides a secondary interval so they can be gathered at a lower frequency. In addition one can specify a number of filters to select processes by pid, parent, owner, or even name. One can also request process threads be included. As for slabs, here too one can filter by name and during interactive display can request that only slabs that have changed in value be displayed, significantly reducing the output and enhancing the readability.

Collectl also has the ability to display process and slab data in a way similar to the top and slabtop commands, each with a number of sorting options including the ability to sort processes by the top I/O users and slabs by the changes in allocated memory, neither of which are currently available in any existing utilities.

Process I/O

New to the 2.4.0 release is the monitoring of process i/o statistics. See this page for more details

Interrupt Reporting by CPU

New to the 2.5.0 release you can now report interrupts at the CPU level and even examine them changing in more detail at the individual interrupt levels.

Socket Communications

Rather than dispay its output on the terminal or write it to a file, collectl can send its data over a socket as well, making it possible to integrate it with other programs.

Exportable data formats

If you don't like the format of the data collectl presents, feel free to write your own using --export. There are 4 main ones that come with collectl for writing as S-Exporessionn, List Expressions and even exporting UDP data to ganglia and graphite. In the case of the first 2, this data can also be sent over the TCP socket interface as well.

Tested at scale within the HP Public Cloud

Running on all servers at a monitoring frequency of 1-second, collectl has been validated to run efficiently in some of the most strenuous of environments.

IPMI monitoring for fans and temperature sensors

The main reason this is experimental is because all vendors report IMPI sensor data differently, even within their own product lines. This is an attempt to table drive the parsing of the data in such a way that collectl can rationalize its display.

API for importing additional data

If you've been a collectl user but wanted to import some additional data this is the way to go. This API provides seamless, full-function integration into collectl! You data can be displayed in brief, verbose and detail formats. It can be written and plot files, accessed from multiple systems at the same time with colmux and even sent over a socket to external clients. All while appearing as a core collectl capability.

Top-anything across a cluster

With the inclusion of colmux, you can now run collectl on multiple machines in a cluster at one time and be presented with an intergrated view from all nodes, sorted by the column of your choice.