Performance Tuning on Linux — Disk I/O
Tune Disk I/O
Disks are block devices and we can
access related kernel data structures through
Sysfs.
We can use the kernel data structures under /sys
to select and tune I/O queuing algorithms for the block
devices.
"Block" and "character" are misleading names for device types.
The important distinction is that unbuffered "character"
devices provide direct access to the device,
while buffered "block" devices are accessed through
a buffer which can greatly improve performance.
Contradicting their common names,
"character" devices can enforce block boundaries, and you
can read or write "block" devices in blocks of any size
including one byte at a time.
The operating system uses RAM for both write buffer and read cache. The idea is that data to be stored is written into the write buffer, where it can be sorted or grouped. A mechanical disk needs its data sorted into order so a sequence of write operations can happen as the arm moves in one direction across the platter, instead of seeking back and forth and greatly increasing overall time. If the storage system is RAID, then the data should be grouped by RAID stripe so that one stripe can be written in one operation.
As for reading, recently-read file system data is stored in RAM. If the blocks are "clean", unmodified since the last read, then the data can be read directly from cache RAM instead of accessing the much slower mechanical disks. Reading is made more efficient by appropriate decisions about which blocks to store and which to discard.
Disk Queuing Algorithms
Pending I/O events are scheduled or sorted by a queuing algorithm also called an elevator because analogous algorithms can be used to most efficiently schedule elevators. There is no single best algorithm, the choice depends some on your hardware and more on the work load.
Tuning is done by disk, not by partition,
so if your first disk has partitions containing
/
,
/boot
, and
/boot/efi
,
all three file systems must be handled the same way.
Since things under /boot
are needed only
infrequently after booting, if ever, then consider your
use of your root partition to select an algorithm
for all of /dev/sda
.
This foreshadows the coming file system discussion where
we want to limit the I/O per physical device.
Tuning is done with the kernel object
/sys/block/sd*/queue/scheduler
.
You can read its current contents with cat
or similar.
The output lists all queuing algorithms supported by the kernel.
The one currently in use is surrounded by square brackets.
# grep . /sys/block/sd*/queue/scheduler /sys/block/sda/queue/scheduler:noop deadline [cfq] /sys/block/sdb/queue/scheduler:noop deadline [cfq] /sys/block/sdc/queue/scheduler:noop deadline [cfq] /sys/block/sdd/queue/scheduler:noop deadline [cfq]
You can modify the kernel object contents
and change the algorithm with echo
.
# cat /sys/block/sdd/queue/scheduler noop deadline [cfq] # echo deadline > /sys/block/sdd/queue/scheduler # cat /sys/block/sdd/queue/scheduler noop [deadline] cfq
The shortest explanation: Use Deadline for interactive systems and NOOP for unattended computation. But read on for details on why, and other parameters to tune.
Deadline Scheduler
The deadline algorithm attempts to limit the maximum latency and keep the humans happy. Every I/O request is assigned its own deadline and it should be completed before that timer expires.
Two queues are maintained per device, one sorted by sector and the other by deadline. As long as no deadlines are expiring, the I/O requests are done in sector order to minimize head motion and provide best throughput.
Reasons to use the deadline scheduler include:
1: People use your system interactively. Your work load is dominated by interactive applications, either users who otherwise may complain of sluggish performance or databases with many I/O operations.
2: Read operations happen significantly more often than write operations, as applications are more likely to block waiting to read data.
3: Your storage hardware is a SAN (Storage Area Network) or RAID array with deep I/O buffers.
Red Hat uses deadline by default for non-SATA disks starting at RHEL 7. IBM System z uses deadline by default for all disks.
CFQ Scheduler
The CFQ or Completely Fair Queuing algorithm
first divides processes into the three classes of Real Time,
Best Effort, and Idle.
Real Time processes are served before Best Effort processes,
which in turn are served before Idle processes.
Within each class, the kernel attempts to give every thread
the same number of time slices.
Processes are assigned to the Best Effort class by default,
you can change the I/O priority for a process with
ionice
.
The kernel uses recent I/O patterns to anticipate whether
an application will issue more requests in the near future,
and if more I/O is anticipated, the kernel will wait
even though other processes have pending I/O.
CFQ can improve throughput at the cost of worse latency. Users are sensitive to latency and will not like the result when their applications are bound by CFQ.
Reasons to use the CFQ scheduler:
1: People do not use your system interactively, at least not much. Throughput is more important than latency, but latency is still important enough that you don't want to use NOOP.
2: You are not using XFS. According to xfs.org, the CFQ scheduler defeats much of the parallelization in XFS.
Red Hat uses this by default for SATA disks starting at RHEL 7. And they use XFS by default...
NOOP Scheduler
The NOOP scheduler does nothing to change the order or priority, it simply handles the requests in the order they were submitted.
This can provide the best throughput, especially on storage subsystems that provide their own queuing such as solid-state drives, intelligent RAID controllers with their own buffer and cache, and Storage Area Networks.
This usually makes for the worst latency, so it would be a poor choice for interactive use.
Reasons to use the noop scheduler include:
1: Throughput is your dominant concern, you don't care about latency. Users don't use the system interactively.
2: Your work load is CPU-bound: most of the time we're waiting for the CPU to finish something, I/O events are relatively small and widely spaced.
Both of those suggest that you are doing high-throughput unattended jobs such as data mining, scientific high-performance computing, or rendering.
Tuning the Schedulers
Recall (or learn here)
that Sysfs is the hierarchy under /sys
and it
maps internal kernel constructs to a file system so that:
- Directories represent kernel objects,
- Files represent attributes of those objects, and
- Symbolic links represent relationships (usually identity) between objects
Different files (attributes) appear in the
queue/iosched
subdirectory (object)
when you change the content (setting)
of the queue/scheduler
file (attribute).
It's easier to look at than to explain.
The directories for the disks themselves contain the same
files and subdirectories, including the file
queue/scheduler
and the subdirectory
queue/iosched/
:
# ls -F /sys/block/sdb/ alignment_offset discard_alignment holders/ removable stat bdi@ events inflight ro subsystem@ capability events_async power/ sdb1/ trace/ dev events_poll_msecs queue/ size uevent device@ ext_range range slaves/ # ls -F /sys/block/sdb/queue add_random max_hw_sectors_kb optimal_io_size discard_granularity max_integrity_segments physical_block_size discard_max_bytes max_sectors_kb read_ahead_kb discard_zeroes_data max_segment_size rotational hw_sector_size max_segments rq_affinity iosched/ minimum_io_size scheduler iostats nomerges write_same_max_bytes logical_block_size nr_request
Let's assign three different schedulers and see what tunable
parameters appear in their queue/iosched
subdirectories:
# echo cfq > /sys/block/sdb/queue/scheduler # echo deadline > /sys/block/sdc/queue/scheduler # echo noop > /sys/block/sdd/queue/scheduler # ls -F /sys/block/sd[bcd]/queue/iosched/ /sys/block/sdb/queue/iosched/: back_seek_max fifo_expire_sync quantum slice_idle back_seek_penalty group_idle slice_async slice_sync fifo_expire_async low_latency slice_async_rq target_latency /sys/block/sdc/queue/iosched/: fifo_batch front_merges read_expire write_expire writes_starved /sys/block/sdd/queue/iosched/:
So we see that the cfq scheduler has twelve readable and tunable parameters, the deadline scheduler has five, and the noop scheduler has none (which makes sense as it's the not-scheduler).
Tuning The CFQ Scheduler
Remember that this is for mostly to entirely non-interactive work where latency is of lower concern. You care some about latency, but your main concern is throughput.
Attribute | Meaning and suggested tuning |
fifo_expire_async |
Number of milliseconds an asynchronous
request (buffered write) can remain
unserviced.
If lowered buffered write latency is needed, either decrease from default 250 msec or consider switching to deadline scheduler. |
fifo_expire_sync |
Number of milliseconds a synchronous
request (read, or O_DIRECT unbuffered
write) can remain unserviced.
If lowered read latency is needed, either decrease from default 125 msec or consider switching to deadline scheduler. |
low_latency |
0=disabled:
Latency is ignored, give each process
a full time slice.
1=enabled: Favor fairness over throughput, enforce a maximum wait time of 300 milliseconds for each process issuing I/O requests for a device. Select this if using CFQ with applications requiring it, such as real-time media streaming. |
quantum |
Number of I/O requests sent to a device
at one time, limiting the queue depth.
request (read, or O_DIRECT unbuffered
write) can remain unserviced.
Increase this to improve throughput on storage hardware with its own deep I/O buffer such as SAN and RAID, at the cost of increased latency. |
slice_idle |
Length of time in milliseconds that cfq
will idle while waiting for further
requests.
Set to 0 for solid-state drives or for external RAID with its own cache. Leave at default of 8 milliseconds for internal non-RAID storage to reduce seek operations. |
Tuning The Deadline Scheduler
Remember that this is for interactive work where latency above about 100 milliseconds will really bother your users. Throughput would be nice, but we must keep the latency down.
Attribute | Meaning and suggested tuning |
fifo_batch |
Number of read or write operations to
issue in one batch.
Lower values may further reduce latency. Higher values can increase throughput on rotating mechanical disks, but at the cost of worse latency. You selected the deadline scheduler to limit latency, so you probably don't want to increase this, at least not by very much. |
read_expire |
Number of milliseconds within which a
read request should be served.
Reduce this from the default of 500 to 100 on a system with interactive users. |
write_expire |
Number of milliseconds within which a
write request should be served.
Leave at default of 5000, let write operations be done asynchronously in the background unless your specialized application uses many synchronous writes. |
writes_starved |
Number read batches that can be processed
before handling a write batch.
Increase this from default of 2 to give higher priority to read operations. |
Tuning The NOOP Scheduler
Remember that this is for entirely non-interactive work where throughput is all that matters. Data mining, high-performance computing and rendering, and CPU-bound systems with fast storage.
The whole point is that NOOP isn't a scheduler,
I/O requests are handled strictly first come, first served.
All we can tune are some block layer parameters in
/sys/block/sd*/queue/*
, which could also be
tuned for other schedulers, so...
Tuning General Block I/O Parameters
These are in /sys/block/sd*/queue/
.
Attribute | Meaning and suggested tuning |
max_sectors_kb |
Maximum allowed size of an I/O request
in kilobytes, which must be within
these bounds:
Min value = max(1, logical_block_size /1024)
Max value = max_hw_sectors_kb
|
nr_requests |
Maximum number of read and write requests
that can be queued at one time before
the next process requesting a read or
write is put to sleep.
Default value of 128 means 128 read
requests and 128 write
requests can be queued at once.
Larger values may increase throughput for workloads writing many small files, smaller values increase throughput with larger I/O operations. You could decrease this if you are using latency-sensitive applications, but then you shouldn't be using NOOP if latency is sensitive! |
optimal_io_size |
If non-zero, the storage device has reported
its own optimal I/O size.
If you are developing your own applications, make its I/O requests in multiples of this size if possible. |
read_ahead_kb |
Number of kilobytes the kernel will read
ahead during a sequential read
operation.
128 kbytes by default, if the disk is
used with LVM the device mapper may
benefit from a higher value.
If your workload does a lot of large streaming reads, larger values may improve performance. |
rotational |
Should be 0 (no) for solid-state disks,
but some do not correctly report
their status to the kernel.
If incorrectly set to 1 for an SSD, set it to 0 to disable unneeded scheduler logic meant to reduce number of seeks. |
Automatically Tuning the Schedulers
Sysfs is an in-memory file system, everything goes back
to the defaults at the next boot.
You could add settings to /etc/rc.d/rc.local
:
... preceding lines omitted ... ## Added for disk tuning this read-heavy interactive system for DISK in sda sdb sdc sdd do # Select deadline scheduler first echo deadline > /sys/block/${DISK}/queue/scheduler # Now set deadline scheduler parameters echo 100 > /sys/block/${DISK}/queue/iosched/read_expire echo 4 > /sys/block/${DISK}/queue/iosched/writes_starved done
Tune Virtual Memory Management to Improve I/O Performance
This work is done in Procfs, under /proc
and specifically in /proc/sys/vm/*
.
You can experiment interactively with echo
and sysctl
.
When you have decided on a set of tuning parameters,
create a new file named /etc/sysctl.d/
and enter your settings there.
Leave the file /etc/sysctl.conf
with the
distribution's defaults, files you add overwrite those
changes.
The new file must be named *.conf
, the
recommendation is that its name be two digits, a dash,
a name, and then the required .conf
.
So, something like:
# ls /etc/sysctl* /etc/sysctl.conf /etc/sysctl.d: 01-diskIO.conf 02-netIO.conf # cat /etc/sysctl.d/01-diskIO.conf vm.dirty_ratio = 6 vm.dirty_background_ratio = 3 vm.vfs_cache_pressure = 50
Now, for the virtual memory data structures we might constructively manipulate:
Attribute | Meaning and suggested tuning |
dirty_ratio |
"Dirty" memory is that waiting to be
written to disk.
dirty_ratio is the number
of memory pages at which a process
will start writing out dirty data,
expressed as a percentage out of the
total free and reclaimable pages.
A default of 20 is reasonable.
Increase to 40 to improve
throughput, decrease it to
5 to 10 to improve latency,
even lower on systems with
a lot of memory.
|
dirty_background_ratio |
Similar, but this is the number
of memory pages at which the
kernel background flusher thread
will start writing out dirty data,
expressed as a percentage out of the
total free and reclaimable pages.
Set this lower than dirty_ratio ,
dirty_ratio /2 makes sense
and is what the kernel does by default.
This page
shows that dirty_ratio
has the greater effect.
Tune dirty_ratio for
performance, then set
dirty_background_ratio
to half that value.
|
overcommit_memory |
Allows for poorly designed programs which
malloc() huge amounts of
memory "just in case" but never really
use it.
Set this to 0 (disabled) unless
you really need it.
|
vfs_cache_pressure |
This sets the "pressure" or the importance the kernel places upon reclaiming memory used for caching directory and inode objects. The default of 100 or relative "fair" is appropriate for compute servers. Set to lower than 100 for file servers on which the cache should be a priority. Set higher, maybe 500 to 1000, for interactive systems. |
There is further information in the Red Hat Enterprise Linux Performance Tuning Guide.
Also see
/usr/src/linux/Documentation/sysctl/vm.txt
.
Measuring Disk I/O
Once you have selected and created file systems as discussed in the next page, you can have a choice of tools for testing file system I/O.
And next...
The next step is to select appropriate file systems and their creation and use options.
If you're looking for information on the Anticipatory I/O scheduler, you are using some old references. It was removed from the 2.6.33 kernel.