Performance Tuning on Linux — Disk I/O

Tune Disk I/O

Disks are block devices and we can access related kernel data structures through Sysfs. We can use the kernel data structures under /sys to select and tune I/O queuing algorithms for the block devices. "Block" and "character" are misleading names for device types. The important distinction is that unbuffered "character" devices provide direct access to the device, while buffered "block" devices are accessed through a buffer which can greatly improve performance. Contradicting their common names, "character" devices can enforce block boundaries, and you can read or write "block" devices in blocks of any size including one byte at a time.

The operating system uses RAM for both write buffer and read cache. The idea is that data to be stored is written into the write buffer, where it can be sorted or grouped. A mechanical disk needs its data sorted into order so a sequence of write operations can happen as the arm moves in one direction across the platter, instead of seeking back and forth and greatly increasing overall time. If the storage system is RAID, then the data should be grouped by RAID stripe so that one stripe can be written in one operation.

As for reading, recently-read file system data is stored in RAM. If the blocks are "clean", unmodified since the last read, then the data can be read directly from cache RAM instead of accessing the much slower mechanical disks. Reading is made more efficient by appropriate decisions about which blocks to store and which to discard.

Disk Queuing Algorithms

Pending I/O events are scheduled or sorted by a queuing algorithm also called an elevator because analogous algorithms can be used to most efficiently schedule elevators. There is no single best algorithm, the choice depends some on your hardware and more on the work load.

Tuning is done by disk, not by partition, so if your first disk has partitions containing /, /boot, and /boot/efi, all three file systems must be handled the same way. Since things under /boot are needed only infrequently after booting, if ever, then consider your use of your root partition to select an algorithm for all of /dev/sda. This foreshadows the coming file system discussion where we want to limit the I/O per physical device.

Tuning is done with the kernel object /sys/block/sd*/queue/scheduler. You can read its current contents with cat or similar. The output lists all queuing algorithms supported by the kernel. The one currently in use is surrounded by square brackets.

# grep . /sys/block/sd*/queue/scheduler
/sys/block/sda/queue/scheduler:noop deadline [cfq] 
/sys/block/sdb/queue/scheduler:noop deadline [cfq] 
/sys/block/sdc/queue/scheduler:noop deadline [cfq] 
/sys/block/sdd/queue/scheduler:noop deadline [cfq]

You can modify the kernel object contents and change the algorithm with echo.

# cat /sys/block/sdd/queue/scheduler 
noop deadline [cfq] 
# echo deadline > /sys/block/sdd/queue/scheduler 
# cat /sys/block/sdd/queue/scheduler 
noop [deadline] cfq

If you're looking for information on the Anticipatory I/O scheduler, you are using some old references. It was removed from the 2.6.33 kernel.

The shortest explanation: Use Deadline for interactive systems and NOOP for unattended computation. But read on for details on why, and other parameters to tune.

Deadline Scheduler

The deadline algorithm attempts to limit the maximum latency and keep the humans happy. Every I/O request is assigned its own deadline and it should be completed before that timer expires.

Two queues are maintained per device, one sorted by sector and the other by deadline. As long as no deadlines are expiring, the I/O requests are done in sector order to minimize head motion and provide best throughput.

Reasons to use the deadline scheduler include:

1: People use your system interactively. Your work load is dominated by interactive applications, either users who otherwise may complain of sluggish performance or databases with many I/O operations.

2: Read operations happen significantly more often than write operations, as applications are more likely to block waiting to read data.

3: Your storage hardware is a SAN (Storage Area Network) or RAID array with deep I/O buffers.

Red Hat uses deadline by default for non-SATA disks starting at RHEL 7. IBM System z uses deadline by default for all disks.

CFQ Scheduler

The CFQ or Completely Fair Queuing algorithm first divides processes into the three classes of Real Time, Best Effort, and Idle. Real Time processes are served before Best Effort processes, which in turn are served before Idle processes. Within each class, the kernel attempts to give every thread the same number of time slices. Processes are assigned to the Best Effort class by default, you can change the I/O priority for a process with ionice. The kernel uses recent I/O patterns to anticipate whether an application will issue more requests in the near future, and if more I/O is anticipated, the kernel will wait even though other processes have pending I/O.

CFQ can improve throughput at the cost of worse latency. Users are sensitive to latency and will not like the result when their applications are bound by CFQ.

Reasons to use the CFQ scheduler:

1: People do not use your system interactively, at least not much. Throughput is more important than latency, but latency is still important enough that you don't want to use NOOP.

2: You are not using XFS. According to xfs.org, the CFQ scheduler defeats much of the parallelization in XFS.

Red Hat uses this by default for SATA disks starting at RHEL 7. And they use XFS by default...

NOOP Scheduler

The NOOP scheduler does nothing to change the order or priority, it simply handles the requests in the order they were submitted.

This can provide the best throughput, especially on storage subsystems that provide their own queuing such as solid-state drives, intelligent RAID controllers with their own buffer and cache, and Storage Area Networks.

This usually makes for the worst latency, so it would be a poor choice for interactive use.

Reasons to use the noop scheduler include:

1: Throughput is your dominant concern, you don't care about latency. Users don't use the system interactively.

2: Your work load is CPU-bound: most of the time we're waiting for the CPU to finish something, I/O events are relatively small and widely spaced.

Both of those suggest that you are doing high-throughput unattended jobs such as data mining, scientific high-performance computing, or rendering.

Tuning the Schedulers

Recall (or learn here) that Sysfs is the hierarchy under /sys and it maps internal kernel constructs to a file system so that:

Directories represent kernel objects,
Files represent attributes of those objects, and
Symbolic links represent relationships (usually identity) between objects

Different files (attributes) appear in the queue/iosched subdirectory (object) when you change the content (setting) of the queue/scheduler file (attribute). It's easier to look at than to explain. The directories for the disks themselves contain the same files and subdirectories, including the file queue/scheduler and the subdirectory queue/iosched/:

# ls -F /sys/block/sdb/
alignment_offset  discard_alignment  holders/  removable  stat
bdi@              events             inflight  ro         subsystem@
capability        events_async       power/    sdb1/      trace/
dev               events_poll_msecs  queue/    size       uevent
device@           ext_range          range     slaves/

# ls -F /sys/block/sdb/queue
add_random           max_hw_sectors_kb       optimal_io_size
discard_granularity  max_integrity_segments  physical_block_size
discard_max_bytes    max_sectors_kb          read_ahead_kb
discard_zeroes_data  max_segment_size        rotational
hw_sector_size       max_segments            rq_affinity
iosched/             minimum_io_size         scheduler
iostats              nomerges                write_same_max_bytes
logical_block_size   nr_request

Let's assign three different schedulers and see what tunable parameters appear in their queue/iosched subdirectories:

# echo cfq > /sys/block/sdb/queue/scheduler 
# echo deadline > /sys/block/sdc/queue/scheduler 
# echo noop > /sys/block/sdd/queue/scheduler 

# ls -F /sys/block/sd[bcd]/queue/iosched/
/sys/block/sdb/queue/iosched/:
back_seek_max      fifo_expire_sync  quantum         slice_idle
back_seek_penalty  group_idle        slice_async     slice_sync
fifo_expire_async  low_latency       slice_async_rq  target_latency

/sys/block/sdc/queue/iosched/:
fifo_batch  front_merges  read_expire  write_expire  writes_starved

/sys/block/sdd/queue/iosched/:

So we see that the cfq scheduler has twelve readable and tunable parameters, the deadline scheduler has five, and the noop scheduler has none (which makes sense as it's the not-scheduler).

Tuning The CFQ Scheduler

Remember that this is for mostly to entirely non-interactive work where latency is of lower concern. You care some about latency, but your main concern is throughput.

Attribute	Meaning and suggested tuning
`fifo_expire_async`	Number of milliseconds an asynchronous request (buffered write) can remain unserviced. If lowered buffered write latency is needed, either decrease from default 250 msec or consider switching to deadline scheduler.
`fifo_expire_sync`	Number of milliseconds a synchronous request (read, or O_DIRECT unbuffered write) can remain unserviced. If lowered read latency is needed, either decrease from default 125 msec or consider switching to deadline scheduler.
`low_latency`	0=disabled: Latency is ignored, give each process a full time slice. 1=enabled: Favor fairness over throughput, enforce a maximum wait time of 300 milliseconds for each process issuing I/O requests for a device. Select this if using CFQ with applications requiring it, such as real-time media streaming.
`quantum`	Number of I/O requests sent to a device at one time, limiting the queue depth. request (read, or O_DIRECT unbuffered write) can remain unserviced. Increase this to improve throughput on storage hardware with its own deep I/O buffer such as SAN and RAID, at the cost of increased latency.
`slice_idle`	Length of time in milliseconds that cfq will idle while waiting for further requests. Set to 0 for solid-state drives or for external RAID with its own cache. Leave at default of 8 milliseconds for internal non-RAID storage to reduce seek operations.

Tuning The Deadline Scheduler

Remember that this is for interactive work where latency above about 100 milliseconds will really bother your users. Throughput would be nice, but we must keep the latency down.

Attribute	Meaning and suggested tuning
`fifo_batch`	Number of read or write operations to issue in one batch. Lower values may further reduce latency. Higher values can increase throughput on rotating mechanical disks, but at the cost of worse latency. You selected the deadline scheduler to limit latency, so you probably don't want to increase this, at least not by very much.
`read_expire`	Number of milliseconds within which a read request should be served. Reduce this from the default of 500 to 100 on a system with interactive users.
`write_expire`	Number of milliseconds within which a write request should be served. Leave at default of 5000, let write operations be done asynchronously in the background unless your specialized application uses many synchronous writes.
`writes_starved`	Number read batches that can be processed before handling a write batch. Increase this from default of 2 to give higher priority to read operations.

Tuning The NOOP Scheduler

Remember that this is for entirely non-interactive work where throughput is all that matters. Data mining, high-performance computing and rendering, and CPU-bound systems with fast storage.

The whole point is that NOOP isn't a scheduler, I/O requests are handled strictly first come, first served. All we can tune are some block layer parameters in /sys/block/sd*/queue/*, which could also be tuned for other schedulers, so...

Tuning General Block I/O Parameters

These are in /sys/block/sd*/queue/.

Attribute	Meaning and suggested tuning
`max_sectors_kb`	Maximum allowed size of an I/O request in kilobytes, which must be within these bounds: Min value = max(1, `logical_block_size`/1024) Max value = `max_hw_sectors_kb`
`nr_requests`	Maximum number of read and write requests that can be queued at one time before the next process requesting a read or write is put to sleep. Default value of 128 means 128 read requests and 128 write requests can be queued at once. Larger values may increase throughput for workloads writing many small files, smaller values increase throughput with larger I/O operations. *You could* decrease this if you are using latency-sensitive applications, but then you shouldn't be using NOOP if latency is sensitive!**
`optimal_io_size`	If non-zero, the storage device has reported its own optimal I/O size. If you are developing your own applications, make its I/O requests in multiples of this size if possible.
`read_ahead_kb`	Number of kilobytes the kernel will read ahead during a sequential read operation. 128 kbytes by default, if the disk is used with LVM the device mapper may benefit from a higher value. If your workload does a lot of large streaming reads, larger values may improve performance.
`rotational`	Should be 0 (no) for solid-state disks, but some do not correctly report their status to the kernel. If incorrectly set to 1 for an SSD, set it to 0 to disable unneeded scheduler logic meant to reduce number of seeks.

Automatically Tuning the Schedulers

Sysfs is an in-memory file system, everything goes back to the defaults at the next boot. You could add settings to /etc/rc.d/rc.local:

... preceding lines omitted ...

## Added for disk tuning this read-heavy interactive system
for DISK in sda sdb sdc sdd
do
	# Select deadline scheduler first
	echo deadline > /sys/block/${DISK}/queue/scheduler
	# Now set deadline scheduler parameters
	echo 100 > /sys/block/${DISK}/queue/iosched/read_expire
	echo 4 > /sys/block/${DISK}/queue/iosched/writes_starved
done

Tune Virtual Memory Management to Improve I/O Performance

This work is done in Procfs, under /proc and specifically in /proc/sys/vm/*. You can experiment interactively with echo and sysctl. When you have decided on a set of tuning parameters, create a new file named /etc/sysctl.d/ and enter your settings there. Leave the file /etc/sysctl.conf with the distribution's defaults, files you add overwrite those changes.

The new file must be named *.conf, the recommendation is that its name be two digits, a dash, a name, and then the required .conf. So, something like:

# ls /etc/sysctl*
/etc/sysctl.conf

/etc/sysctl.d:
01-diskIO.conf  02-netIO.conf
		
# cat /etc/sysctl.d/01-diskIO.conf
vm.dirty_ratio = 6
vm.dirty_background_ratio = 3
vm.vfs_cache_pressure = 50

Now, for the virtual memory data structures we might constructively manipulate:

Attribute	Meaning and suggested tuning
`dirty_ratio`	"Dirty" memory is that waiting to be written to disk. `dirty_ratio` is the number of memory pages at which a process will start writing out dirty data, expressed as a percentage out of the total free and reclaimable pages. A default of 20 is reasonable. Increase to 40 to improve throughput, decrease it to 5 to 10 to improve latency, even lower on systems with a lot of memory.
`dirty_background_ratio`	Similar, but this is the number of memory pages at which the kernel background flusher thread will start writing out dirty data, expressed as a percentage out of the total free and reclaimable pages. Set this lower than `dirty_ratio`, `dirty_ratio`/2 makes sense and is what the kernel does by default. This page shows that `dirty_ratio` has the greater effect. Tune `dirty_ratio` for performance, then set `dirty_background_ratio` to half that value.
`overcommit_memory`	Allows for poorly designed programs which `malloc()` huge amounts of memory "just in case" but never really use it. Set this to 0 (disabled) unless you really need it.
`vfs_cache_pressure`	This sets the "pressure" or the importance the kernel places upon reclaiming memory used for caching directory and inode objects. The default of 100 or relative "fair" is appropriate for compute servers. Set to lower than 100 for file servers on which the cache should be a priority. Set higher, maybe 500 to 1000, for interactive systems.

There is further information in the Red Hat Enterprise Linux Performance Tuning Guide.

Also see /usr/src/linux/Documentation/sysctl/vm.txt.

Measuring Disk I/O

Once you have selected and created file systems as discussed in the next page, you can have a choice of tools for testing file system I/O.

Bonnie Bonnie++ IOzone benchmark SPEC SFS 2014

And next...

The next step is to select appropriate file systems and their creation and use options.

To the Linux / Unix Page