Performance Tuning on Linux — Ethernet

Tune Ethernet Performance

Capacity Planning

You are considering upgrading to 10 Gbps Ethernet, but: Does the motherboard and its system bus have the speed to fill a 10 Gbps network? You are probably communicating file system data, can your disk and file system I/O keep up?

The switch backplane needs bandwidth equal to 2 times the number of ports times the speed. A switch with 20 ports running at 10 Gbps full-duplex needs a 20×2×10 = 400 Gbps bus backplane to be non-blocking.

Bonding, or Not

You can bond multiple Ethernet adapters together and potentially multiply your throughput. However...

Kernel
documentation
on bonding

You have a choice of several bonding algorithms, see here for details. However, all of them will send all packets of a flow of data between two endpoints through the same interface. Bonding will not increase the throughput between a pair of hosts, as each host will use just one physical interface to send every frame to the other host.

Bonding is only helpful if your data flow topology looks more like a star or a fairly complete mesh. For example, a file server with several concurrent clients.

If you have the appropriate data flow patterns, then bonding multiple 1 Gbps adapters makes sense compared to the price of upgrading to 10 Gbps hardware.

Bonding multiple 10 Gbps adapters gets expensive quickly, and the CPU is already awfully busy working to keep just the one 10 Gbps link filled.

Measure Recent and Current Utilization

Get link-layer statistics for all interfaces:

# ip -s link
[...]
1: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 00:11:95:1e:8e:b6 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    8028989029 31573824 0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    3272273796 15088848 0       0       0       0
[...]

Field	Meaning of Non-Zero Values
`errors`	Poorly or incorrectly negotiated mode and speed, or damaged network cable.
`dropped`	Possibly due to `iptables` or other filtering rules, more likely due to lack of network buffer memory.
`overrun`	Number of times the network interface ran out of buffer space.
`carrier`	Damaged or poorly connected network cable, or switch problems.
`collsns`	Number of collisions, which should always be zero on a switched LAN. Non-zero indicates problems negotiating appropriate duplex mode. A small number that never grows means it happened when the interface came up but hasn't happened since.

Enable Jumbo Frames

The example of protocol overhead came up back on the introductory page. Every Ethernet frame needs a standard header. Due to design optimization decisions made years ago, the default MTU or maximum Ethernet frame length allows for 1500 bytes of payload. The maximum frame length limits latency, the amount of time a host has to wait to transmit their packet, but that limit of 1500 bytes of payload plus header and CRC was specified when Ethernet ran at 10 Mbps. With 1,000 times the speed, the frames could be much longer in bytes and still have far lower latency. Also, the CPU would be interrupted less often.

A jumbo frame is an Ethernet frame with more than 1500 bytes of payload. A 9000-byte MTU reduces the protocol overhead and CPU interrupts by a factor of six. Much modern Ethernet equipment can support frames up to 9,216 bytes, but make sure to verify that every device on the LAN supports your desired jumbo frame size before making any changes.

Set this interactively with the ip command:

# ip link set enp0s2 mtu 9000

Make the setting persistent by adding MTU=9000 to /etc/sysconfig/network-scripts/ifcfg-enp0s2.

Performance Tuning With ethtool

Be aware that interface names have changed, it's no longer eth0, eth1, and so on, but names that encode physical location like enp0s2. Details on network interface names can be found here.

Not every option will supported on a given network interface, and even if its chipset supports something it's possible that the current Linux driver doesn't. So, don't expect all of the following to work on your system:

Using ethtool

Get current settings including speed and duplex mode and whether a link beat signal is detected, get driver information, and get statistics:

# ethtool enp0s2
# ethtool -i enp0s2
# ethtool -S enp0s2

With most of these options, lower-case options query and display current settings, upper-case options change settings. For help: ethtool -h

Interrupt Coalesce

Several packets in a rapid sequence can be coalesced into one interrupt passed up to the CPU, providing more CPU time for application processing.

# ethtool -c enp0s2

Ring Buffer

The ring buffer is another name for the driver queue. Get the maximum receive and transmit buffer lengths and their current settings. RX and TX report the number of frames in the buffer, the buffer contains pointers to frame data structures. Change the settings to the maximum to optimize for throughput while possibly increasing latency. On a busy system the CPU will have fewer opportunities to add packets to the queue, increasing the likelihood that the hardware will drain the buffer before more packets can be queued.

# ethtool -g enp0s2
Ring parameters for enp0s2:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             512
RX Mini:        0
RX Jumbo:       0
TX:             512
# ethtool -G enp0s2 rx 4096 tx 4096
# ethtool -g enp0s2
Ring parameters for enp0s2:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

Beware: This is appropriate for servers on high-speed LANs, not for personal systems with lower-speed connections. Let's say you have 256 packets of buffer. At 1,500 bytes each that's 384,000 bytes or 3,072.000 bits. Over a 1 Mbps WLAN or ISP connection, that's over 3 seconds of latency. It would be six times worse with 9,000 byte jumbo frames.

Flow Control

Turn on flow control, allowing the host and the switch to pace their transmission based on current receive capability at the other end. This will reduce packet loss and it may provide a significant improvement on high-speed networks.

# ethtool -A enp0s2 rx on
# ethtool -A enp0s2 tx on

Processing Offload

Offload all possible processing from kernel software into hardware.

# ethtool -k enp0s2
# ethtool -K tx-checksum-ipv4 on
# ethtool -K tx-checksum-ipv6 on

Segmentation Offload: TSO, USO, LSO, GSO

It may be possible to offload TCP segmentation. The kernel gives a large segment, maybe 64 kbytes, to the NIC. There is some intelligence in the NIC to use a template from the kernel's TCP/IP stack to segment the data and add the TCP, UDP, IP, and Ethernet headers. This appears under multiple names: TSO or TCP Segmentation Offload, USO or UDP Segmentation Offload, LSO or Large Segment Offload, GSO or Generic Segmentation Offload. It would be done with ethtool -k. Beware: segmentation offload should improve performance across a high speed LAN, but it is likely to hurt performance across a multi-hop WAN path. For more details see:
Description of LSO Linux Foundation on GSO Performance Case Studies Issues with LSO

Bufferbloat

There has been a general trend toward larger and larger buffers at many places in protocol stacks across the Internet, hurting both latency and throughput. As Vint Cerf put it, "The key issue we've been talking about is that all this excessive buffering ends up breaking many of the timeout mechanisms built into our network protocols." See these articles for background:
"Bufferbloat: Dark Buffers in the Internet"
CACM Vol 55 No 1 pp 57-65 "BufferBloat: What's Wrong With the Internet?"
CACM Vol 55 No 2 pp 40-47 "Controlling Queue Delay"
ACM Queue Vol 10 No 5

The Byte Queue Limits or BQL algorithm has been added to deal with this. Kernel data structures in /sys/class/net/interface/queues/tx-0/byte_queue_limits/ control this. It's self tuning, you probably don't want to mess with this, but to improve latency on low speed nets you might put a smaller value in limit_max.

Making ethtool Settings Permanent

You could simply put the appropriate sequence of ethtool commands into /etc/rc.d/rc.local. That is my preferred method, the actual commands appear there in order and I can insert comments before each one to explain what it's doing and why.

With the boot scripts you find on Red Hat, you could list them in the interface configuration file. It's used as a shell script, so we can built the string a piece at a time instead of one monster line.

# cat /etc/sysconfig/network-scripts/ifcfg-enp0s2
DEVICE=enp0s2
BOOTPROTO=static
IPADDR=10.1.1.100
NETMASK=255.255.255.0
ETHTOOL_OPTS="-s enp0s2 speed 1000 duplex full autoneg off
ETHTOOL_OPTS="$ETHTOOL_OPTS ; -K enp0s2 tx off rx off"
ETHTOOL_OPTS="$ETHTOOL_OPTS ; -G enp0s2 rx 4096 tx 4096"

Kernel Parameters for Core Networking

The "core" networking parameters for the kernel are accessible in /proc/sys/net/core/*.

A single input packet queue length is set in /proc/sys/net/core/netdev_max_backlog. That queue length is per core. Increase it to reduce the number of packets dropped during high inbound network loads. Output queues are set individually for each interface with the ip command.

First, do the math and consider that this will be a tradeoff increasing total throughput while increasing latency. Let's say you have one 1000 Mbps interface, you are using the default 1500 byte maximum packet size, and you decide that 0.1 second of latency would be acceptable:

1000 Mbps = 125,000,000 bytes/s
125,000,000 bytes/second / 1500 bytes/frame = 83,333 frames/second
83,333 frames/second × 0.1 second = 8,333 frames
maximum queue of 8,333 1500-byte frames

# echo 8333 > /proc/sys/net/core/netdev_max_backlog
# ip link set enp0s2 txqueuelen 8333

If you have upgraded to 10 Gbps and 9000-byte jumbo frames:

10,000 Mbps = 1,250,000,000 bytes/s
1,250,000,000 bytes/second / 9000 bytes/frame = 138,888 frames/second
138,888 frames/second × 0.1 second = 13,888 frames
maximum queue of 13,888 9000-byte frames

# echo 13888 > /proc/sys/net/core/netdev_max_backlog
# ip link set enp0s2 txqueuelen 13888

However, as discussed here, it seems that the interface-specific value is only used as a default queue length for some of the available queue disciplines (or QDiscs) that can be manipulated with Linux Advanced Routing and Traffic Control and the tc command.

Making Kernel Parameter Tuning Permanent

The kernel value netdev_max_backlog could be set in a file like /etc/sysctl.d/02-netIO.conf, while the ip link command(s) would go in /etc/rc.d/rc.local.

On Red Hat derived systems you can create a script /sbin/ifup-local which will be executed as interfaces are brought up.
Positive: Runs automatically and no need to re-run rc.local.
Negative: Specific to Red Hat.

Any parameters to pass to the module at load time go into a file /etc/modprobe.d/*.conf, based on the output of modinfo for that module and a careful reading of /usr/src/linux/Documentation/networking/drivername.txt and/or the source code. Some parameter values will be Boolean 0 versus 1, some will be numbers, some will be strings.

# cat /etc/modprobe.d/01-ethernet.conf
options dl2k jumbo=1 mtu=9000 tx_flow=1 rx_flow=1 rx_coalesce=100 tx_coalesce=40

And next...

For more see this Linux Journal article: Queueing in the Linux Network Stack

There is nothing about IP to tune for performance, although some security improvements are discussed here. The next step up the protocol stack is to tune TCP performance.

To the Linux / Unix Page