Page cache and page writeback

Today I was dealing with an interesting case that required some Linux kernel memory management knowledge, specifically — page I/O. A customer was complaining that when they scp a 8 GB file to their Oracle Enterprise Linux 5.4, shortly the box would become unavailable, and they wouldn’t be able to ssh into it. The only information I had were the outputs from vmstat and iostat. Luckily this was sufficient to conclude what was happening.

Here’s the vmstat excerpt:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 5810460 67552 8915472 0 0 28 628 457 147 2 0 81 18
0 4 0 5810460 67552 8915472 0 0 0 8448 601 111 0 0 59 41
0 4 0 5810460 67552 8915472 0 0 0 4096 301 110 0 0 68 32

Straightaway, I spot to things:

  • there were 4 processes in uninterruptible sleep
  • the page cache contained +8GB

Furthermore, I’ve launched the Oracle AMI to check its default virtual memory settings:

$ grep . /proc/sys/vm/{dirty_background_ratio,dirty_ratio,dirty_writeback_centisecs}
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_ratio:40
/proc/sys/vm/dirty_writeback_centisecs:500

With all this information, we can conclude the following. Since c3.2xlarge EC2 instance that was used has 16 GB RAM, and the customer was writing 8 GB file, the VM subsystem either caches the first 1.6 GB of pages (dirty_background_ratio), or waits for 5 seconds, and starts writing to disk. The dirty list in page cache reaches dirty_ratio at one point (6.4GB), and when this happens — all processes are blocked for writes!

At that point I would expect to see page dirty flush threads ([pdflush] in the process list) in D state, sleeping at blk_congestion_wait, because of the congested, heavy page writeback to a single block device.

The sshd daemon very likely gets blocked at this point as well.

After all the pages from the page cache are written back to disk after some time, the box should retain interactivity. If it does not, there may be some kernel bug, or a deadlock inside the instance that gets triggered (the Oracle Enterprise Linux 5.4 AMI uses an older kernel).

For the simplest solution, I recommended trying to copy files in reverse direction, and use nice and ionice to lower the priority of the scp process and potentially reduce the load.

It is either this, or increase the dirty_ratio.

A very good explanation of how this works can be found at the following sources:

  • Linux Kernel Development, Third Edition
  • http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Comments are closed.