Performance Tuning on Linux — File Systems
Create and Tune File Systems
Divide your overall storage hierarchy across multiple file systems to improve performance, support better security choices, and make upgrades and expansions easier. We will see how to do that, properly sizing the volumes to avoid fragmentation. Then choose a file system type — Ext4, XFS, or Btrfs, and make sure to create it with appropriate parameters and use its journal appropriately. Then mount the system system with options that can improve performance.
The classic practical study is File Layout and File System Performance by Keith Smith and Margo Seltzer at Harvard University. It's from 1994 and is based on a study of FFS or the BSD Fast File System, but their measurements are still appropriate and their analysis relevant.
Use Multiple File Systems
Improve performance through balanced I/O and security through finer granularity of file system attributes. Create independent file systems on separate physical volumes.
Use a separate one for /home
on a file server
holding user data — create multiple ones for
/home1
, /home2
, etc., with home
directories symbolically linked from /home
as the number of users and their file I/O increases.
Use a separate one for /var
or subdirectories
like /var/www
on web servers,
/var/spool
on mail and print servers,
and /var/log
on centralized log servers,
the last of those possibly even on busy servers
logging lots of data locally.
Create one for /opt
if you are using Oracle or
other software that uses that area,
or any other local specialized data stores you need.
Consider making /tmp
a separate file system,
but you might make it a RAM-based file system as simply
as making /tmp
a symbolic link to
/dev/shm
.
These separate file systems can be mounted with different options for security as described here.
Multiple file systems can limit integrity loss caused by a disk crash. Then it will be time to restore from backup, and both backups and restores will have been made more efficient with multiple independent file systems.
Choose a File System Type
Current appropriate choices are Ext4 (default in RHEL/CentOS 6), XFS (RHEL/CentOS 7), and Btrfs.
Ext4 is still adequate for many organizations, but XFS brings advantages in scale and performance.
Oracle bought Sun Microsystems so you could buy your Oracle database, the Solaris operating system, and the UltraSPARC platform on which it all runs in a single purchase. But they were soon advocating running Solaris on x86, and soon after that their recommendation was to run Linux on x86_64/AMD64 hardware.
They bought the manufacturer of the Solaris OS and the UltraSPARC platform but before long they were suggesting a free operating system on someone else's hardware. So why did Oracle buy Sun?
My theory was that Oracle was buying the ZFS file system and Solaris and UltraSPARC was the package it came in.
Now Oracle is supporting the development of the Btrfs file system, which promises to be an alternative to ZFS with similar promise of true enterprise scale, performance, and feature set.
When Red Hat Enterprise Linux 7 came out in mid 2014, it used XFS by default and featured Btrfs as a "technology preview". Other distributions have followed suit. Theodore T'so, the principal developer of the Ext3 and Ext4 file systems has stated that while Ext4 was an improvement in performance and feature set, it is a "stop-gap" on the way to Btrfs.
Meanwhile there is the ZFS on Linux project. Ubuntu 16.04 was released with ZFS, see the announcement and the support page.
Create File Systems
There are tools to convert one file system into another in place, or at least to convert Ext3 into Ext4 and to convert Ext3 or Ext4 into XFS in place. However, you should back up your data and create a fresh file system of the newer type in order to better optimize metadata layout and take advantage of the newer file system's performance improvements. Backup and restore into a fresh file system also eliminates much of the fragmentation that will have crept into due to modifications, deletions, and additions to the existing file system.
Match the block size of the underlying storage or a multiple thereof. Other than block size, you probably don't want to adjust anything when creating any of these file systems. Maybe add a label and/or UUID, but that's it in most cases.
By default Ext4 uses heuristics to select a block size of
1024, 2048, or 4096 bytes, although it will be 4096 on all
file systems over 512 MBytes in size unless over-ridden
by /etc/mke2fs.conf
or command-line options.
# mkfs.ext4 -b 4096 /dev/sde1
XFS allows you to specify block size and defaults to 4096-byte blocks.
# mkfs.xfs -b size=8192 /dev/sde1
With Btrfs it's the node size or tree block size. The default is the system page size or 16 kbytes (16384), whichever is larger. Specify if you want.
# mkfs.btrfs -n 65536 /dev/sde1
If your storage is RAID, create file systems with matching
block sizes so aligned full stripe writes can be written
directly to the disk.
This is where it gets complicated.
See the manual page for
mkfs.ext4
for an example of calculating the correct
stride_size
and
stripe_width
.
On XFS, it's sunit
or strip unit size as the
number of 512-byte sectors.
Limit Fragmentation
When you start with a nearly empty file system, new files can be stored in an optimal way. An individual file can be stored as a contiguous series of extents (or allocated blocks), and when you install a package or otherwise extract an archive the component files will probably be stored close to each other. Later access to multiple files from that package or archive can be done with minimal head motion.
The problem is that as you delete or truncate existing files, new "holes" of free space appear. Free space is becoming fragmented.
When you append data to an existing file, it is often the case that the following blocks are already in use so a new extent must be created somewhere else within the file system. Individual files are becoming fragmented.
As you add new files, there is no longer space to store them at physical locations close to the other files in the same directories. Files are being scattered across the volume.
These problems become worse as the file system ages, and especially so as it becomes more full.
Preëmptive Techniques for Limiting Fragmentation
Many of the modern file systems preallocate longer extents to files that are being appended to. Ext4, XFS, and Btrfs do delayed allocation or allocate-on-flush, which holds data in memory until it must be flushed. This groups block or extent allocation into larger groups, and reduces both fragmentation and CPU interruption.
Btrfs includes what it calls automatic on-line defragmentation,
although it's really fragmentation avoidance.
Mount the file system with the autodefrag
option.
It will detect small random writes into files and queue them
for the defrag process.
This works well for small files.
However, it hurts performance with workloads of large
databases or virtual machine disk images as those situations
have many small writes that are within one large file.
Careful design of applications can help.
A common example is a BitTorrent client which gradually fills
in a large file one small piece at a time in a random order,
or a media processing program that takes a lot of time to
create a large file sequentially but you know its eventual
size in advance.
The fallocate()
or posix_fallocate()
system call can pre-allocate the file to its expected file,
minimizing its fragmentation.
Retroactive Techniques for Reducing Fragmentation
Many file systems have defragmentation tools to reorder the extents of files, and some of them can reduce file scattering.
The easiest and fastest way to get a less fragmented file system is to back up all your data to another volume. Then create a new file system within the old volume and copy the data back into place. But to measure fragmentation and work to defragment it in place:
Ext4 Fragmentation and Defragmentation
You can measure the fragmentation of free space with
e2freefrag
.
It displays an overview and a histogram of free extent sizes.
# e2freefrag /dev/sdb1 Device: /dev/sdb1 Blocksize: 4096 bytes Total blocks: 244190390 Free blocks: 92461513 (37.9%) Min. free extent: 4 KB Max. free extent: 2064256 KB Avg. free extent: 91568 KB Num. free extent: 4039 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 237 237 0.00% 8K... 16K- : 130 286 0.00% 16K... 32K- : 77 386 0.00% 32K... 64K- : 77 938 0.00% 64K... 128K- : 132 3254 0.00% 128K... 256K- : 152 6707 0.01% 256K... 512K- : 540 51111 0.06% 512K... 1024K- : 972 145906 0.16% 1M... 2M- : 542 169617 0.18% 2M... 4M- : 223 155107 0.17% 4M... 8M- : 243 415689 0.45% 8M... 16M- : 32 89916 0.10% 16M... 32M- : 13 72453 0.08% 32M... 64M- : 26 323374 0.35% 64M... 128M- : 309 7572979 8.19% 128M... 256M- : 78 3249618 3.51% 256M... 512M- : 70 6185074 6.69% 512M... 1024M- : 40 7921928 8.57% 1G... 2G- : 146 66096934 71.49%
Use filefrag
to display the number of extents
into which a file has been fragmented.
# filefrag CentOS-x86_64.iso CentOS-x86_64.iso: 28 extents found
Use e4defrag
with the -c
option
for a report of how fragmented the file system is.
# e4defrag -c /home <Fragmented files> now/best size/ext 1. /home/vmware/RHEL6/RHEL6-1-s007.vmdk 2/1 6 KB 2. /home/vmware/Solaris-10/Solaris-10-s003.vmdk 2/1 32 KB 3. /home/vmware/Solaris-10/Solaris-10-s002.vmdk 2/1 32 KB 4. /home/vmware/Solaris-10/Solaris-10-s004.vmdk 2/1 32 KB 5. /home/vmware/Solaris-10/Solaris-10-s001.vmdk 2/1 32 KB Total/best extents 59433/50512 Average size per extent 9950 KB Fragmentation score 0 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag] This directory (/home) does not need defragmentation. Done.
You can use e4defrag
without -c
if you decide it's worth what will be a long wait
to defragment the file system.
XFS Fragmentation and Defragmentation
Measure the current level of fragmentation with xfs_db
.
# xfs_db -c frag -r /dev/sdb1
Defragment XFS with xfs_fsr
.
# xfs_fsr /dev/sdb1
Btrfs Fragmentation and Defragmentation
You use the btrfs
command to do most everything
with Btrfs.
For example, you can defragment a file system or
individual files and directories.
# btrfs filesystem defragment /home # btrfs filesystem defragment /usr/local/ISOs/*.iso
Choose Appropriate Mount Options
The easiest way to get significant performance improvements is to disable access times. This can make an especially large improvement on workstations, where various components of the graphical desktop are doing all sorts of searching and indexing behind the scenes.
Every file has three timestamps:
modify is when its content was changed,
change is when its metadata
(e.g., permissions) was changed, and
access is it was last read.
So, a command like find
can cause an enormous
amount of writing to a file system, not modifying any files
or directories but updating all the access times of the
directories.
Use the mount options noatime
and
nodiratime
to disable updating the access times
of the files and directories, respectively.
This disables something needed for POSIX compliance,
but there are few uses for it.
There are warnings that disabling access times can confuse some applications, it seems that the Mutt text-based e-mail client is the standard example. But... Mutt's last stable release was in June 2007. The last development release was in 2013 and there have been nothing but bug fixes since then. The latest entry on its news page is about a bug fix release in June 2009.
An even less critical warning is that once you disable access times, you can no longer tell which files your users are actually using. Some organizations will inform their users, "Here is a list of files you don't seem to have used within the past year. We would like to move these off the systems to an archive."
Reasonably clever users will simply do the following once in a while to make all their files seem important:
$ find ~ -type f -exec cp {} /dev/null \;
To be safe, use nodiratime
to disable directory
access times plus relatime
to only change
the access time when the modify or change time is updated.
So something like this:
# cat /etc/fstab UUID=62dfc4a4-86c2-4ebf-aaa3-442ecc740122 / ext4 nodiratime,relatime 1 1 LABEL=/boot /boot ext4 nodiratime,relatime 1 2 LABEL=/home /home ext4 nodiratime,relatime 1 2 LABEL=/var /var ext4 nodiratime,relatime 1 2 /dev/cdrom /media/cdrom auto umask=0,users,iocharset=utf8,noauto,ro,exec 0 0 none /proc proc defaults 0 0 UUID=12e71ecd-833d-45ea-adfd-1eca8c27d912 swap swap defaults 0 0
Leave write barriers in place.
Journaling file system use write barriers to ensure that
data has been written onto the physical media.
It is possible to mount a file system with write barriers
disabled (barrier=0
option on Ext4,
nobarrier
on XFS and Btrfs),
but it is a very
bad idea unless you are absolutely certain that your storage
subsystem has a perfectly reliable battery-backed power supply.
Even if you think it is safe, any performance improvement
should be negligible.
Select an Appropriate Journal Mode
Ext4 gives you a choice of journaling modes which you can
select when mounting the file system, as options to
the mount
command when experimenting and
then as settings recorded in /etc/fstab
.
data=ordered
is the default,
it only journals the metadata.
Data blocks are written to the file system first,
then metadata.
data=writeback
may provide some performance improvement,
but at severe security risk on a multi-user system.
It only journals the metadata, and then data and metadata
may be written to the disk in any order.
The risk is that if a system crashes while appending to a file,
if it happens that the metadata had been committed (allocating
additional data blocks to the file) before the data had been
written (overwriting data blocks with new contents), then after
the journal recovery that file may contain data blocks with
contents from previously deleted files from any user on that
file system.
If you are tempted to use data=writeback
for performance on file systems like /tmp
,
disable the journal instead with:
tune2fs -O ^has_journal /dev/sde1
data=journal
writes all data twice:
first to the journal and then to the data and metadata blocks.
This will slightly improve the integrity of the data blocks,
but with a serious degradation in performance.
Given the relatively low probability of losing data
blocks during a system crash, most organizations would
be better served by verifying again that their backup
and recovery processes are trustworthy, and then using
the default data=ordered
.
And next...
Now that we have optimized our disk I/O as best we can, let's move on to networking. We will start at the bottom and tune Ethernet performance across the LAN.