Performance Tuning on Linux — Ethernet
Tune Ethernet Performance
Capacity Planning
You are considering upgrading to 10 Gbps Ethernet, but: Does the motherboard and its system bus have the speed to fill a 10 Gbps network? You are probably communicating file system data, can your disk and file system I/O keep up?
The switch backplane needs bandwidth equal to 2 times the number of ports times the speed. A switch with 20 ports running at 10 Gbps full-duplex needs a 20×2×10 = 400 Gbps bus backplane to be non-blocking.
Bonding, or Not
You can bond multiple Ethernet adapters together and potentially multiply your throughput. However...
Kerneldocumentation
on bonding
You have a choice of several bonding algorithms, see here for details. However, all of them will send all packets of a flow of data between two endpoints through the same interface. Bonding will not increase the throughput between a pair of hosts, as each host will use just one physical interface to send every frame to the other host.
Bonding is only helpful if your data flow topology looks more like a star or a fairly complete mesh. For example, a file server with several concurrent clients.
If you have the appropriate data flow patterns, then bonding multiple 1 Gbps adapters makes sense compared to the price of upgrading to 10 Gbps hardware.
Bonding multiple 10 Gbps adapters gets expensive quickly, and the CPU is already awfully busy working to keep just the one 10 Gbps link filled.
Measure Recent and Current Utilization
Get link-layer statistics for all interfaces:
# ip -s link [...] 1: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 00:11:95:1e:8e:b6 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 8028989029 31573824 0 0 0 0 TX: bytes packets errors dropped carrier collsns 3272273796 15088848 0 0 0 0 [...]
Field | Meaning of Non-Zero Values |
errors |
Poorly or incorrectly negotiated mode and speed, or damaged network cable. |
dropped |
Possibly due to iptables
or other filtering rules, more likely
due to lack of network buffer memory. |
overrun |
Number of times the network interface ran out of buffer space. |
carrier |
Damaged or poorly connected network cable, or switch problems. |
collsns |
Number of collisions, which should always be zero on a switched LAN. Non-zero indicates problems negotiating appropriate duplex mode. A small number that never grows means it happened when the interface came up but hasn't happened since. |
Enable Jumbo Frames
The example of protocol overhead came up back on the introductory page. Every Ethernet frame needs a standard header. Due to design optimization decisions made years ago, the default MTU or maximum Ethernet frame length allows for 1500 bytes of payload. The maximum frame length limits latency, the amount of time a host has to wait to transmit their packet, but that limit of 1500 bytes of payload plus header and CRC was specified when Ethernet ran at 10 Mbps. With 1,000 times the speed, the frames could be much longer in bytes and still have far lower latency. Also, the CPU would be interrupted less often.
A jumbo frame is an Ethernet frame with more than 1500 bytes of payload. A 9000-byte MTU reduces the protocol overhead and CPU interrupts by a factor of six. Much modern Ethernet equipment can support frames up to 9,216 bytes, but make sure to verify that every device on the LAN supports your desired jumbo frame size before making any changes.
Set this interactively with the ip
command:
# ip link set enp0s2 mtu 9000
Make the setting persistent by adding MTU=9000
to /etc/sysconfig/network-scripts/ifcfg-enp0s2
.
Performance Tuning With ethtool
Be aware that interface names have changed, it's no longer
eth0
,
eth1
,
and so on, but names that encode physical location like
enp0s2
.
Details on network interface names
can be found
here.
Not every option will supported on a given network interface, and even if its chipset supports something it's possible that the current Linux driver doesn't. So, don't expect all of the following to work on your system:
Using ethtool
Get current settings including speed and duplex mode and whether a link beat signal is detected, get driver information, and get statistics:
# ethtool enp0s2 # ethtool -i enp0s2 # ethtool -S enp0s2
Interrupt Coalesce
Several packets in a rapid sequence can be coalesced into one interrupt passed up to the CPU, providing more CPU time for application processing.
# ethtool -c enp0s2
Ring Buffer
The ring buffer is another name for the driver queue.
Get the maximum receive and transmit buffer lengths and
their current settings.
RX
and TX
report the number of
frames in the buffer, the buffer contains pointers to
frame data structures.
Change the settings to the maximum to optimize for throughput
while possibly increasing latency.
On a busy system the CPU will have fewer opportunities to
add packets to the queue, increasing the likelihood that
the hardware will drain the buffer before more packets
can be queued.
# ethtool -g enp0s2 Ring parameters for enp0s2: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512 # ethtool -G enp0s2 rx 4096 tx 4096 # ethtool -g enp0s2 Ring parameters for enp0s2: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096
Beware: This is appropriate for servers on high-speed LANs, not for personal systems with lower-speed connections. Let's say you have 256 packets of buffer. At 1,500 bytes each that's 384,000 bytes or 3,072.000 bits. Over a 1 Mbps WLAN or ISP connection, that's over 3 seconds of latency. It would be six times worse with 9,000 byte jumbo frames.
Flow Control
Turn on flow control, allowing the host and the switch to pace their transmission based on current receive capability at the other end. This will reduce packet loss and it may provide a significant improvement on high-speed networks.
# ethtool -A enp0s2 rx on # ethtool -A enp0s2 tx on
Processing Offload
Offload all possible processing from kernel software into hardware.
# ethtool -k enp0s2 # ethtool -K tx-checksum-ipv4 on # ethtool -K tx-checksum-ipv6 on
Segmentation Offload: TSO, USO, LSO, GSO
It may be possible to offload TCP segmentation.
The kernel gives a large segment, maybe 64 kbytes,
to the NIC.
There is some intelligence in the NIC to use a template
from the kernel's TCP/IP stack to segment the data and
add the TCP, UDP, IP, and Ethernet headers.
This appears under multiple names:
TSO or TCP Segmentation Offload,
USO or UDP Segmentation Offload,
LSO or Large Segment Offload,
GSO or Generic Segmentation Offload.
It would be done with ethtool -k
.
Beware: segmentation offload should improve performance
across a high speed LAN, but it is likely to hurt
performance across a multi-hop WAN path.
For more details see:
Description of LSO
Linux Foundation on GSO
Performance Case Studies
Issues with LSO
Bufferbloat
There has been a general trend toward larger and larger
buffers at many places in protocol stacks across the Internet,
hurting both latency and throughput.
As Vint Cerf put it,
"The key issue we've been talking about is that all this
excessive buffering ends up breaking many of the
timeout mechanisms built into our network protocols."
See these articles for background:
"Bufferbloat: Dark Buffers in the Internet"
CACM Vol 55 No 1 pp 57-65
"BufferBloat: What's Wrong With the Internet?"
CACM Vol 55 No 2 pp 40-47
"Controlling Queue Delay"
ACM Queue Vol 10 No 5
The Byte Queue Limits or BQL algorithm has been added
to deal with this.
Kernel data structures in
/sys/class/net/interface/queues/tx-0/byte_queue_limits/
control this.
It's self tuning, you probably don't want to mess with this,
but to improve latency on low speed nets you might put a
smaller value in limit_max
.
Making ethtool Settings Permanent
You could simply put the appropriate sequence of
ethtool
commands into
/etc/rc.d/rc.local
.
That is my preferred method, the actual commands appear there
in order and I can insert comments before each one to explain
what it's doing and why.
With the boot scripts you find on Red Hat, you could list them in the interface configuration file. It's used as a shell script, so we can built the string a piece at a time instead of one monster line.
# cat /etc/sysconfig/network-scripts/ifcfg-enp0s2 DEVICE=enp0s2 BOOTPROTO=static IPADDR=10.1.1.100 NETMASK=255.255.255.0 ETHTOOL_OPTS="-s enp0s2 speed 1000 duplex full autoneg off ETHTOOL_OPTS="$ETHTOOL_OPTS ; -K enp0s2 tx off rx off" ETHTOOL_OPTS="$ETHTOOL_OPTS ; -G enp0s2 rx 4096 tx 4096"
Kernel Parameters for Core Networking
The "core" networking parameters for the kernel
are accessible in /proc/sys/net/core/*
.
A single input packet queue length is set in
/proc/sys/net/core/netdev_max_backlog
.
That queue length is per core.
Increase it to reduce the number of packets dropped
during high inbound network loads.
Output queues are set individually for each interface
with the ip
command.
First, do the math and consider that this will be a tradeoff increasing total throughput while increasing latency. Let's say you have one 1000 Mbps interface, you are using the default 1500 byte maximum packet size, and you decide that 0.1 second of latency would be acceptable:
1000 Mbps = 125,000,000 bytes/s
125,000,000 bytes/second / 1500 bytes/frame = 83,333 frames/second
83,333 frames/second × 0.1 second = 8,333 frames
maximum queue of 8,333 1500-byte frames
# echo 8333 > /proc/sys/net/core/netdev_max_backlog # ip link set enp0s2 txqueuelen 8333
If you have upgraded to 10 Gbps and 9000-byte jumbo frames:
10,000 Mbps = 1,250,000,000 bytes/s
1,250,000,000 bytes/second / 9000 bytes/frame = 138,888 frames/second
138,888 frames/second × 0.1 second = 13,888 frames
maximum queue of 13,888 9000-byte frames
# echo 13888 > /proc/sys/net/core/netdev_max_backlog # ip link set enp0s2 txqueuelen 13888
However, as discussed
here,
it seems that the interface-specific value is only used
as a default queue length for some of
the available queue disciplines (or QDiscs) that can be
manipulated with
Linux Advanced Routing and Traffic Control
and the tc
command.
Making Kernel Parameter Tuning Permanent
The kernel value netdev_max_backlog
could be set in a file like
/etc/sysctl.d/02-netIO.conf
,
while the ip link
command(s)
would go in /etc/rc.d/rc.local
.
On Red Hat derived systems you can create a script
/sbin/ifup-local
which will be executed
as interfaces are brought up.
Positive: Runs automatically and no need
to re-run rc.local
.
Negative: Specific to Red Hat.
Any parameters to pass to the module at load time go into
a file /etc/modprobe.d/*.conf
, based on the
output of modinfo
for that module and a
careful reading of
/usr/src/linux/Documentation/networking/drivername.txt
and/or the source code.
Some parameter values will be Boolean 0 versus 1,
some will be numbers, some will be strings.
# cat /etc/modprobe.d/01-ethernet.conf options dl2k jumbo=1 mtu=9000 tx_flow=1 rx_flow=1 rx_coalesce=100 tx_coalesce=40
And next...
For more see this Linux Journal article: Queueing in the Linux Network Stack
There is nothing about IP to tune for performance, although some security improvements are discussed here. The next step up the protocol stack is to tune TCP performance.