For the most part, the hardware configuration is static and so is collectl. When you boot, it discovers disks, networks and CPUs and that configuration doesn't change until you shutdown, reconfigure and reboot. Clouds and virtual machines have turned this whole notion on its head, something that was bound to happen anyway. But it wasn't until several years ago when collectl started to be used in OpenStack environments that this design restriction became significant and needed to be changed. Since 2012 collectl has had the notion of dynamic Disks and Networks embedded in its core. Dynamic CPUs were added even earlier. Collectl has been heavily tested in OpenStack clouds and is as rock solid as ever.
But what about cloud-specific subsystems? Once collectl was able to deal with dynamic devices, additional capabilities were added to deal with new cloud-specific subsystems as well. While not everything in an OpenStack cloud can be monitored one has to start somewhere and collectl has chosen to focus on the following:
Nova
In the case of Nova, almost all the data one needs to report on what a VM is doing is already being collected. Specifically, a VM is using a CPU, a Disk and a Network so if one can tell which host resources they corresond to, one can then associate their instance data with the VM and report something that looks like standard collectl process data like this:
# PROCESS SUMMARY (counters are /sec) # PID THRD S VSZ RSS CP SysT UsrT Pct N AccumTim DskI DskO NetI NetO Instance 15622 1 S 5G 562M 8 0.00 0.00 0 1 07:26.72 0 0 0 0 0094eed9 32738 4 S 6G 632M 5 0.01 0.00 0 2 01:11:25 0 0 0 0 0093c0ef 36432 2 S 4G 944M 4 0.32 0.41 73 1 13:24:35 0 16 445 445 009570b9 36841 1 S 4G 935M 7 0.24 0.32 56 1 12:31:27 0 0 445 445 009570bb
The second thing is one needs to figure out how to map the network MAC address to an actual network name and for that a second plugin, this time an import module called vnet has been developed. It doesn't generate any output as do other import modules but rather loads the required data structures that vmsum needs to find the network virtual device.
Finally, the disk stats come from the process data, but are only available when run as root, so the ultimate command one needs to run to see the above output is the following, noting you will be warned to use sudo if are not root.
sudo collectl --import vnet --export vmsum
Swift
Getting swift data is slightly more complicated because it doen't report statistics in am easy-to-use form. Rather its standard mechanism is to use statsd, which requires a statsd listener. Further, since swift can only send to one listener, you can't have multiple consumer's of the data, which is why statstee was developed. It is based on the philosphy of the unix tee command in that it can sit between the source and destination and record data locally, in this case to a file that looks like a /proc data structure.
For example, when you install/run statstee, it creates a file like the following which is updated with rolling counters every tenth of a second (as long as something changes). This means anyone can read that file as often as they choose and simply report the differences between samples as rates, just like collectl already does for all the other data it reports.
cat /var/log/swift/swift-stats V1.0 1425398070.323784 # errs pass fail accaudt 0 2784 0 # errs cfail cdel cremain cposs_remain ofail odel oremain oposs_remain accreap 0 0 0 0 0 0 0 0 0 # diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ accrepl 0 0 167004 0 0 0 167004 0 0 167004 # put get post del head repl errs accsrvr 153 56960 0 0 57398 175140 0 # errs pass fail conaudt 0 4770 0 # diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ conrepl 74 0 551811 0 0 0 1306955 0 0 551885 # put get post del head repl errs consrvr 16884 104 0 7203 630 616300 11 # skip fail sync del put consync 0 0 0 0 0 # succ fail no_chg conupdt 43 0 130683 # quar errs objaudt 0 0 # obj errs objexpr 0 0 # part_del part_upd suff_hashes suff_sync objrepl 0 73646775 7514 0 # put get post del head repl errs quar async_pend putcount puttime objsrvr 16771 3819 0 7189 243 1615248 0 0 49 16614 17031.689711 # errs quar succ fail unlk objupdt 0 0 49 0 49 # put get post del head copy opt bad_meth errs handoff handoff_all timout discon status prxyacc 0 0 0 0 716 0 0 0 0 0 0 0 0 204:716 prxycon 37 195 0 19 1051 0 0 0 0 0 0 0 0 200:195 201:4 202:33 204:1059 409:11 prxyobj 12560 8155 0 7099 533 0 0 0 0 0 0 0 0 200:8681 201:12560 204:7099 404:7
collectl --import statsd,h usage: statsd, switches... d=mask debug mask, see header for details h print this help test f file reads stats from specified file r include return codes with proxy stats s server: a, c, o and/or p t data type to report, noting from the following that not all servers report all types t name servers a auditor acc con obj x expirer obj p reaper acc r replicator acc con obj s server acc con obj y sync con u updater con obj p proxies require their own service type a account service c container service o object service v show version and default settings xx 2 char specific types built from -s, -t and -p NOTE = setting s, t or p to * selects everything
collectl --import statsd,os,cr waiting for 1 second sample... # Container Object #<----------------------Replicator----------------------><-----------------------Server-----------------------> # Diff DCap Nochg Hasm Rsync RMerg Atmpt Fail Remov Succ Put Get Post Dele Head Repl Errs Quar Asyn PutTime 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 0 0 0 17 0 0 0 60 0 0 0 0 0 0 0 0 0 0 0 0.000
Neutron
There is currently one big issue with neutron and that is whenever you create a new VM, you get a new tap and so overtime, the number of these devices continue to grow. If you are generating device specific files, you will see the number of networks collectl is tracking includes ALL networks that have ever existing since collectl started. This in turn means the columns in the net file will continue to grow uncontrolled. If you have a set number of VMs that aren't changing, this may be ok but in most cases is won't be. While there is currently no good solution for how to deal with this, collectl does have a new options for --netopts, specifically o, which tells collectl that whenever there is a change in the network configuration, drop any unused networks from the current list. This means you'll end up breaking the column alignment in the detail file but at least it won't grow uncontrollably. Since the names of the networks ARE retained in the line items, you can still see what's happening but you won't be able to get a consistent view with colplot.
A second situation is the shear volume of virtual network devices that one can have in a large cluster and trying to collect data for them on the neutron nodes themselves can easily involve monitoring thousands of devices which may start to consume more CPU cycles than you wish to use. If this becomes the case, consider using --rawnetfilt which tells collectl to not even collect data on the specified network(s).
updated March 9, 2015 |