Tune Ethernet Performance
You are considering upgrading to 10 Gbps Ethernet, but: Does the motherboard and its system bus have the speed to fill a 10 Gbps network? You are probably communicating file system data, can your disk and file system I/O keep up?
The switch backplane needs bandwidth equal to 2 times the number of ports times the speed. A switch with 20 ports running at 10 Gbps full-duplex needs a 20×2×10 = 400 Gbps bus backplane to be non-blocking.
Bonding, or Not
You can bond multiple Ethernet adapters together and potentially multiply your throughput. However...
The bonding algorithm uses the XOR of the source and destination MAC address to select the sending interface. So, bonding will not increase the throughput between a pair of hosts as each will use just one physical interface to send every frame to the other host. Bonding is only helpful if your data flow topology looks more like a star or a fairly complete mesh.
If you have the appropriate data flow patterns, then bonding multiple 1 Gbps adapters makes sense compared to the price of upgrading to 10 Gbps hardware.
Bonding multiple 10 Gbps adapters gets expensive quickly, and the CPU is already awfully busy working to keep just the one 10 Gbps link filled.
Measure Recent and Current Utilization
Get link-layer statistics for all interfaces:
# ip -s link [...] 1: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 00:11:95:1e:8e:b6 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 8028989029 31573824 0 0 0 0 TX: bytes packets errors dropped carrier collsns 3272273796 15088848 0 0 0 0 [...]
|Field||Meaning of Non-Zero Values|
||Poorly or incorrectly negotiated mode and speed, or damaged network cable.|
|| Possibly due to
||Number of times the network interface ran out of buffer space.|
||Damaged or poorly connected network cable, or switch problems.|
||Number of collisions, which should always be zero on a switched LAN. Non-zero indicates problems negotiating appropriate duplex mode. A small number that never grows means it happened when the interface came up but hasn't happened since.|
Enable Jumbo Frames
The example of protocol overhead came up back on the introductory page. Every Ethernet frame needs a standard header. Due to design optimization decisions made years ago, the default MTU or maximum Ethernet frame length allows for 1500 bytes of payload. The maximum frame length limits latency, the amount of time a host has to wait to transmit their packet, but that limit of 1500 bytes of payload plus header and CRC was specified when Ethernet ran at 10 Mbps. With 1,000 times the speed, the frames could be much longer in bytes and still have far lower latency. Also, the CPU would be interrupted less often.
A jumbo frame is an Ethernet frame with more than than 1500 bytes of payload. A 9000-byte MTU reduces the protocol overhead and CPU interrupts by a factor of six. Much modern Ethernet equipment can support frames up to 9,216 bytes, but make sure to verify that every device on the LAN supports your desired jumbo frame size before making any changes.
Set this interactively with the
# ip link set enp0s2 mtu 9000
Make the setting persistent by adding
Performance Tuning With
Be aware that interface names have changed, it's no longer
and so on, but names that encode physical location like
Details on network interface names
can be found
Not every option will supported on a given network interface, and even if its chipset supports something it's possible that the current Linux driver doesn't. So, don't expect all of the following to work on your system:
Get current settings including speed and duplex mode and whether a link beat signal is detected, get driver information, and get statistics:
# ethtool enp0s2 # ethtool -i enp0s2 # ethtool -S enp0s2
Several packets in a rapid sequence can be coalesced into one interrupt passed up to the CPU, providing more CPU time for application processing.
# ethtool -c enp0s2
The ring buffer is another name for the driver queue.
Get the maximum receive and transmit buffer lengths and
their current settings.
TX report the number of
frames in the buffer, the buffer contains pointers to
frame data structures.
Change the settings to the maximum to optimize for throughput
while possibly increasing latency.
On a busy system the CPU will have fewer opportunities to
add packets to the queue, increasing the likelihood that
the hardware will drain the buffer before more packets
can be queued.
# ethtool -g enp0s2 Ring parameters for enp0s2: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512 # ethtool -G enp0s2 rx 4096 tx 4096 # ethtool -g enp0s2 Ring parameters for enp0s2: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096
Beware: This is appropriate for servers on high-speed LANs, not for personal systems with lower-speed connections. Let's say you have 256 packets of buffer. At 1,500 bytes each that's 384,000 bytes or 3,072.000 bits. Over a 1 Mbps WLAN or ISP connection, that's over 3 seconds of latency. It would be six times worse with 9,000 byte jumbo frames.
Turn on flow control, allowing the host and the switch to pace their transmission based on current receive capability at the other end. This will reduce packet loss and it may provide a significant improvement on high-speed networks.
# ethtool -A enp0s2 rx on # ethtool -A enp0s2 tx on
Offload all possible processing from kernel software into hardware.
# ethtool -k enp0s2 # ethtool -K tx-checksum-ipv4 on # ethtool -K tx-checksum-ipv6 on
Segmentation Offload: TSO, USO, LSO, GSO
It may be possible to offload TCP segmentation.
The kernel gives a large segment, maybe 64 kbytes,
to the NIC.
There is some intelligence in the NIC to use a template
from the kernel's TCP/IP stack to segment the data and
add the TCP, UDP, IP, and Ethernet headers.
This appears under multiple names:
TSO or TCP Segmentation Offload,
USO or UDP Segmentation Offload,
LSO or Large Segment Offload,
GSO or Generic Segmentation Offload.
It would be done with
Beware: segmentation offload should improve performance
across a high speed LAN, but it is likely to hurt
performance across a multi-hop WAN path.
For more details see:
Description of LSO Linux Foundation on GSO Performance Case Studies Issues with LSO
There has been a general trend toward larger and larger
buffers at many places in protocol stacks across the Internet,
hurting both latency and throughput.
As Vint Cerf put it,
"The key issue we've been talking about is that all this
excessive buffering ends up breaking many of the
timeout mechanisms built into our network protocols."
See these articles for background:
"Bufferbloat: Dark Buffers in the Internet"
CACM Vol 55 No 1 pp 57-65 "BufferBloat: What's Wrong With the Internet?"
CACM Vol 55 No 2 pp 40-47 "Controlling Queue Delay"
ACM Queue Vol 10 No 5
The Byte Queue Limits or BQL algorithm has been added
to deal with this.
Kernel data structures in
It's self tuning, you probably don't want to mess with this,
but to improve latency on low speed nets you might put a
smaller value in
ethtool Settings Permanent
You could simply put the appropriate sequence of
That is my preferred method, the actual commands appear there
in order and I can insert comments before each one to explain
what it's doing and why.
With the boot scripts you find on Red Hat, you could list them in the interface configuration file. It's used as a shell script, so we can built the string a piece at a time instead of one monster line.
# cat /etc/sysconfig/network-scripts/ifcfg-enp0s2 DEVICE=enp0s2 BOOTPROTO=static IPADDR=10.1.1.100 NETMASK=255.255.255.0 ETHTOOL_OPTS="-s enp0s2 speed 1000 duplex full autoneg off ETHTOOL_OPTS="$ETHTOOL_OPTS ; -K enp0s2 tx off rx off" ETHTOOL_OPTS="$ETHTOOL_OPTS ; -G enp0s2 rx 4096 tx 4096"
Kernel Parameters for Core Networking
The "core" networking parameters for the kernel
are accessible in
A single input packet queue length is set in
That queue length is per core.
Increase it to reduce the number of packets dropped
during high inbound network loads.
Output queues are set individually for each interface
First, do the math and consider that this will be a tradeoff increasing total throughput while increasing latency. Let's say you have one 1000 Mbps interface, you are using the default 1500 byte maximum packet size, and you decide that 0.1 second of latency would be acceptable:
1000 Mbps = 125,000,000 bytes/s
125,000,000 bytes/second / 1500 bytes/frame = 83,333 frames/second
83,333 frames/second × 0.1 second = 8,333 frames
maximum queue of 8,333 1500-byte frames
# echo 8333 > /proc/sys/net/core/netdev_max_backlog # ip link set enp0s2 qlen 8333
If you have upgraded to 10 Gbps and 9000-byte jumbo frames:
10,000 Mbps = 1,250,000,000 bytes/s
1,250,000,000 bytes/second / 9000 bytes/frame = 138,888 frames/second
138,888 frames/second × 0.1 second = 13,888 frames
maximum queue of 13,888 9000-byte frames
# echo 13888 > /proc/sys/net/core/netdev_max_backlog # ip link set enp0s2 qlen 13888
However, as discussed
it seems that the interface-specific value is only used
as a default queue length for some of
the available queue disciplines (or QDiscs) that can be
Linux Advanced Routing and Traffic Control
Making Kernel Parameter Tuning Permanent
The kernel value
could be set in a file like
ip link command(s)
would go in
On Red Hat derived systems you can create a script
/sbin/ifup-local which will be executed
as interfaces are brought up.
Positive: Runs automatically and no need to re-run
Negative: Specific to Red Hat.
Any parameters to pass to the module at load time go into
/etc/modprobe.d/*.conf, based on the
modinfo for that module and a
careful reading of
and/or the source code.
Some parameter values will be 0/1 Boolean, some will be
numbers, some will be strings, ...
# cat /etc/modprobe.d/01-ethernet.conf options dl2k jumbo=1 mtu=9000 tx_flow=1 rx_flow=1 rx_coalesce=100 tx_coalesce=40
For more see this Linux Journal article: Queueing in the Linux Network Stack
There is nothing about IP to tune for performance, although some security improvements are discussed here. The next step up the protocol stack is to tune TCP performance.