Processes in D state

You can’t do anything with the process in D state (uninterruptible sleep). These are usually waiting on I/O, but not necessarily. The process in a D state cannot be killed or interrupted, and it might gracefully leave the D state after some time.

To troubleshoot this, there are three things to examine:

  1. waiting channel of the process (WCHAN, the kernel routine in which the thread is sleeping)
  2. strace and ltrace the process and cross-reference those outputs
  3. echo w > /proc/sysrq-trigger (produces a report and a list of all processes in D state and a full kernel stack trace to /var/log/messages)

For the first step, you can run one of the following:

$ ps axl
$ ps axl | awk '$10 ~ /D/'
$ ps -eo pid,stat,wchan:30,cmd
$ ps -C httpd -l

Note that sometimes you can see a ‘-‘ for the wchan column even for a process in D state.

Next, you can strace -c the good, and then the misbehaving process and compare the results. Also, strace the process itself to see if it is hanging at a particular call:

$ strace -c -p 24626 # (CTRL+C to interrupt)
$ strace -s 512 -f -o strace.out -p 24626

Also, try the same with the ltrace to possibly determine the issue by looking at library calls for good and misbehaving process:

$ ltrace -c -p 24626
$ ltrace -s 512 -f -o ltrace.out -p 24626

I’ve seen this with web servers under high request count per second, in which the load average shoots up to extreme values (even couple of thousands) and all the httpd processes enter D state. The load average is misleading, for historical reasons D state counts as load even though the code is hung. Everything looks good in top (CPU idling most of the time, disk I/O wait is fine, system is responsive), but it seems that the system enters in a deadlock situation (condition where processes are waiting for resources in a circular chain). There are some tools for deadlock analysis (such as user-space lockdep http://lwn.net/Articles/536363/).

Comments are closed.