MTU

MTU is the maximum frame size that can be transmitted on a network link without fragmentation. It is associated with the NIC, and the link layer protocol.

IP fragmentation and reassembly

IP determines which local interface the IP datagram is to be sent over and what MTU is required. If a packet is larger than the MTU, it must be fragmented before being transmitted.

Fragmentation can take place at the original sending host, and at any intermediate routers along the end-to-end path. Fragmentation is less than desirable: if one fragment is lost, the entire datagram is lost! Also, fragmentation and reassembly takes time and resources.

Routers fragment, end systems reassemble. Routers do not reassemble.

Path MTU discovery (PMTUD)

When two hosts communicate across multiple networks, each link can have a different MTU. The minimum MTU across the network path comprising all of the links is called the path MTU. The path MTU between any two hosts doesn’t have to be constant over time (asymmetric routes).

Senders can discover the lowest-MTU link through which a packet must pass by setting the packet’s “do not fragment” flag. If the packet reaches an intermediate router that cannot forward the packet without fragmenting it, the router returns an ICMP error message to the sender. The ICMP packet includes the MTU of the network that’s demanding smaller packets, and this MTU then becomes the governing packet size for communication with that destination.

The TCP protocol does path MTU discovery automatically. To avoid fragmentation in the IP layer, a host must specify the TCP MSS as equal to the largest IP datagram that the host can handle minus the IP header size and TCP header sizes (e.g. 1500 – 20 – 20 = 1460).

UDP applications should take extra care with fragmentation. If the size of the resulting UDP datagram exceeds the link’s MTU, the IP datagram is split across multiple IP packets, which can lead to performance issues because if any fragment is lost, the entire datagram is lost.

Testing in Linux

The ping utility can set the don’t fragment flag (-M do) and specify arbitrary packet size (-s):

$ ping -M do -s 1600 -c 5 www.google.com
PING www.google.com (74.125.24.106) 1600(1628) bytes of data.
From mars.local (192.168.1.102) icmp_seq=1 Frag needed and DF set (mtu = 1500)
From mars.local (192.168.1.102) icmp_seq=1 Frag needed and DF set (mtu = 1500)
From mars.local (192.168.1.102) icmp_seq=1 Frag needed and DF set (mtu = 1500)
From mars.local (192.168.1.102) icmp_seq=1 Frag needed and DF set (mtu = 1500)
From mars.local (192.168.1.102) icmp_seq=1 Frag needed and DF set (mtu = 1500)

This shows that my interface has a MTU of 1500, and any larger packets would need to be fragmented.

For path MTU discovery, you can use tracepath:

$ sudo tracepath -n www.upc.ie/80
 1:  192.168.1.102                                         0.107ms pmtu 1500
 1:  192.168.1.1                                           1.474ms 
 1:  192.168.1.1                                           3.294ms 
 2:  no reply
 3:  188.141.126.1                                        62.164ms asymm  4 
 4:  84.116.238.50                                        70.985ms asymm 14 
 5:  84.116.134.74                                        86.328ms asymm 13 
 6:  84.116.137.74                                        69.460ms asymm 12 
 7:  84.116.130.6                                         61.852ms asymm 11 
 8:  84.116.134.230                                      139.676ms asymm  7 
 9:  no reply
10:  84.116.138.45                                        77.588ms asymm  8 
11:  195.34.135.118                                      102.762ms asymm  9 
12:  no reply
[...]
31:  no reply
     Too many hops: pmtu 1500
     Resume: pmtu 1500

If any hop after my local network would have a lower MTU, then the resume would include that as the lowest MTU on the path probed.

Also, Linux is usually enabled to do the PMTUD (send packets with DF flag enabled), but this can be disabled:

$ cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0
$ echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc

Example

I’ve recently supported a customer who couldn’t perform a wget to a specific CDN. The wget utility would TCP connect, send a HTTP GET request, and the tool would hang with “HTTP request sent, awaiting response…“, and eventually timeout.

By looking at the packet capture, I’ve understood that hosts had negotiated a TCP MSS of 8960 bytes, since both ends had jumbo frame MTU configured on the NICs (9000 bytes). That’s not good for two Internet clients. Given that the packets larger than the default MTU on the Internet (1500 bytes)  needed to be fragmented on the path, and since the don’t fragment flag was present, there was no response from the remote host and wget was hanging. Lowering our MTU to 1500 bytes was the solution, since the hosts then negotiation a 1460 TCP MSS.

Comments are closed.