collectl - Slow Proc Access?

Is Your Kernel Reading /proc Too Slowly?

Introduction

The 2.6.32 release of the kernel introduced a regression that causes the /proc to be read significantly more slowly and with significantly more overhead on systems with high core counts. In fact this overhead has been measured to be over a factor of 50 reading /proc/stat on a system with 8 sockets and 48 cores.

So why should you care? This probably won't have any impact on any of your running applications - but do you ever run top? iostat? sar? or any other monitoring tools? If you're reading this you probably run collectl. Most monitoring tools are farily light-weight and for a good reason - if you're trying to measure something you don't want the tool's overhead to get in the way. Unfortunately with this regression it will now!

A bug has been logged with RedHat - see bugzilla 761293 and we are also working with a kernel developer to do some patch testing. Results look promising and so should eventually get into a newer kernel release, but there will also be a time period where there will be kernels with higher /proc read overhead. But also remember this only happens with high core counts, not most systems.

The Analysis

As an example, whenever a monitoring change is made to collectl I measure its overhead, just to make sure it's being done efficiently. The way I do this is simulate a day's worth of sampling at an interval of 0 seconds and timing it. In this case I recently added numa monitoring and initially found this problem reading /sys/devices/system/node, but with more testing found it appears elsewhere as well.

In the following example, you can see monitoring CPU data takes about 3 seconds to read almost 9K samples and write them to a file on a 2-socket/dual-core system. Very efficient!

time collectl -sc -i0 -c 8640 -f/tmp
real    0m2.879s
user    0m1.908s
sys     0m0.913s

Next I ran the same command on an 8-socket/48-core system and look at the difference. Note that the overhead was so high and took so long I only took 1/10th the number of samples (I get impatient). This system is running Fedora 14, which is a 2.6.35 kernel, and this alone is over 5 times the overhead of the previous example which normalizing to a full day would be over 50 times the load:

time collectl -sc -i0 -c 864 -f/tmp
real    0m16.783s
user    0m3.003s
sys     0m13.523s

How can you tell if your system has this problem?

Before you panic, there are 2 things to keep in mind:

This only seems to effect systems with higher core counts, on the order of 32 or more. There may be some impact at the 16 core level, but not very significant.
You must be running a kernel with the problematic code from the 2.6.32 release.

Since a simply uname command will tell you your kernel version, you might think that's all it takes, but nothing is always that simple because most vendors patch their kernels and you can't always be sure what code it's actually running.

One simple way to tell for sure is to run the very simple test below which times a read of /proc/stat (which seems to be the most heavily effected) by using strace see how much time is spent in the actually read.

The following is on my 2-socket/dual-core system:

strace -c cat /proc/stat>/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000251         251         1           execve
  0.00    0.000000           0         3           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         4           open
  0.00    0.000000           0         5           close
  0.00    0.000000           0         5           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000251                    37         1 total

while the following in on the 8-socket/48-core system:

strace -c cat /proc/stat >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.014997        4999         3           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        20        16 open
  0.00    0.000000           0         6           close
  0.00    0.000000           0        12        10 stat
  0.00    0.000000           0         5           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         4           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.014997                    66        27 total

As you can see the differences are dramatic! On my 4-core machine virtually 0 time is spent doing the read while on the 48 core machine almost 15 msec is spent reading, and that's only reading /proc/stat one time! Also remember - monitoring tools typically read a lot of different /proc structures. Perhaps now you can get a better appreciation of how significant this problem really is.

What is collectl doing about this?

While this is nothing collectl can do to reduce this overhead, staring with the next release it will be including code that times the reading of /proc on newer kernels with core counts of 32 or more and warn the users if their systems exhibit this problem.

updated Dec 13, 2011