TCP/IP offloading and per-packet optimization

Overhead of network stack processing in high-speed networks (1/10 GbE) is significant, to the point where it becomes the bottleneck. CPU cannot keep up with the busy I/O. The technology to entirely free the CPU from network processing tasks directly to the NIC itself is called TCP offload engine (TOE). If TOE is not supported by the NIC, there are alternatives with the help of OS to offload some of these operations.

Rule of thumb:

  • 1Hz required to send and receive 1 bit/s
  • 5 Gbit/s requires 5 GHz (2 x 2.5 GHz cores)

5 GHz of CPU power solely for network processing is a lot. With packet size capped at 1500 bytes (MTU), a 10 GbE network link running at full speed will be transferring over 800,000 packets per second.

So, per-packet overhead is important, and in 10 GbE local networks we can use jumbo frames (9000 bytes MTU) to make network performance better. On Internet, of course, we cannot use jumbo frames, however we can make use of offloading optimizations.

Sending data (LSO)

Large segment offloading works by queuing up large buffers and letting the NIC split them into separate packets. With some intelligence in the NIC, the host CPU can hand over the 64 KB of data to the NIC in a single transmit-request, the NIC can break that data down into smaller segments.

TSO and GSO implementations.

TCP segmentation offloading (TSO)

If we can’t use a larger MTU, we can go for the next-best thing: pretend that we’re using a larger MTU.

With a TSO-capable adapter, the kernel can prepare much larger packets (64KB, say) for outgoing data; the adapter will then re-segment the data into smaller packets as the data hits the wire. TSO can effectively increase local MTU to 64 KB, and cut the kernel’s per-packet overhead by a large factor.

TSO is well supported in Linux; for systems which are engaged mainly in the sending of data, it’s sufficient to make 10 GB work at full speed.

If TSO is disabled, the CPU performs TCP/IP segmentation.

Generic segmentation offloading (GSO)

The kernel has a generic segmentation offload mechanism which is not limited to TCP.

It turns out that performance improves even if the feature is emulated in the driver. But GSO only works for data transmission, not reception.

Receiving data (LRO)

Large receive offload  (LRO)

Incoming packets are merged at reception time so that the OS sees far fewer of them. This merging can be done either in the driver or in the hardware; even LRO emulation in the driver has performance benefits.

In Linux, it is generally used in conjunction with the New API (NAPI) to also reduce the number of interrupts. The Linux kernel supports LRO for TCP in software only.

LRO can break things since the transformation it performs is lossy.

Generic receive offload (GRO)

GRO is not limited to TCP/IP, as LRO is, and the criteria for which packets can be merged is greatly restricted.

Linux

Check offloading:

$ ethtool -k eth0 
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

Set offloading:

$ ethtool -K eth0 tso on
$ ethtool -K eth0 gso on

 Sources

  • http://en.wikipedia.org/wiki/TCP_offload_engine
  • http://en.wikipedia.org/wiki/Large_receive_offload
  • http://en.wikipedia.org/wiki/Large_segment_offload
  • http://lwn.net/Articles/358910/
  • http://www-archive.xenproject.org/files/summit_3/rdd-tso-xen.pdf

Comments are closed.