Linux servers.

Performance Tuning on Linux — File Systems

Create and Tune File Systems

Divide your overall storage hierarchy across multiple file systems to improve performance, support better security choices, and make upgrades and expansions easier. We will see how to do that, properly sizing the volumes to avoid fragmentation. Then choose a file system type — Ext4, XFS, or Btrfs, and make sure to create it with appropriate parameters and use its journal appropriately. Then mount the system system with options that can improve performance.

The classic practical study is File Layout and File System Performance by Keith Smith and Margo Seltzer at Harvard University. It's from 1994 and is based on a study of FFS or the BSD Fast File System, but their measurements are still appropriate and their analysis relevant.

Use Multiple File Systems

Improve performance through balanced I/O and security through finer granularity of file system attributes. Create independent file systems on separate physical volumes.

Use a separate one for /home on a file server holding user data — create multiple ones for /home1, /home2, etc., with home directories symbolically linked from /home as the number of users and their file I/O increases. Use a separate one for /var or subdirectories like /var/www on web servers, /var/spool on mail and print servers, and /var/log on centralized log servers, the last of those possibly even on busy servers logging lots of data locally. Create one for /opt if you are using Oracle or other software that uses that area, or any other local specialized data stores you need. Consider making /tmp a separate file system, but you might make it a RAM-based file system as simply as making /tmp a symbolic link to /dev/shm.

These separate file systems can be mounted with different options for security as described here.

Multiple file systems can limit integrity loss caused by a disk crash. Then it will be time to restore from backup, and both backups and restores will have been made more efficient with multiple independent file systems.

Choose a File System Type

Current appropriate choices are Ext4 (default in RHEL/CentOS 6), XFS (RHEL/CentOS 7), and Btrfs.

Ext4 is still adequate for many organizations, but XFS brings advantages in scale and performance.

Oracle bought Sun Microsystems so you could buy your Oracle database, the Solaris operating system, and the UltraSPARC platform on which it all runs in a single purchase. But they were soon advocating running Solaris on x86, and soon after that their recommendation was to run Linux on x86_64/AMD64 hardware.

They bought the manufacturer of the Solaris OS and the UltraSPARC platform but before long they were suggesting a free operating system on someone else's hardware. So why did Oracle buy Sun?

My theory was that Oracle was buying the ZFS file system and Solaris and UltraSPARC was the package it came in.

Now Oracle is supporting the development of the Btrfs file system, which promises to be an alternative to ZFS with similar promise of true enterprise scale, performance, and feature set.

When Red Hat Enterprise Linux 7 came out in mid 2014, it used XFS by default and featured Btrfs as a "technology preview". Other distributions have followed suit. Theodore T'so, the principal developer of the Ext3 and Ext4 file systems has stated that while Ext4 was an improvement in performance and feature set, it is a "stop-gap" on the way to Btrfs.

Meanwhile there is the ZFS on Linux project. Ubuntu 16.04 was released with ZFS, see the announcement and the support page.

Create File Systems

There are tools to convert one file system into another in place, or at least to convert Ext3 into Ext4 and to convert Ext3 or Ext4 into XFS in place. However, you should back up your data and create a fresh file system of the newer type in order to better optimize metadata layout and take advantage of the newer file system's performance improvements. Backup and restore into a fresh file system also eliminates much of the fragmentation that will have crept into due to modifications, deletions, and additions to the existing file system.

Match the block size of the underlying storage or a multiple thereof. Other than block size, you probably don't want to adjust anything when creating any of these file systems. Maybe add a label and/or UUID, but that's it in most cases.

By default Ext4 uses heuristics to select a block size of 1024, 2048, or 4096 bytes, although it will be 4096 on all file systems over 512 MBytes in size unless over-ridden by /etc/mke2fs.conf or command-line options.

# mkfs.ext4 -b 4096 /dev/sde1 

XFS allows you to specify block size and defaults to 4096-byte blocks.

# mkfs.xfs -b size=8192 /dev/sde1 

With Btrfs it's the node size or tree block size. The default is the system page size or 16 kbytes (16384), whichever is larger. Specify if you want.

# mkfs.btrfs -n 65536 /dev/sde1 

If your storage is RAID, create file systems with matching block sizes so aligned full stripe writes can be written directly to the disk. This is where it gets complicated. See the manual page for mkfs.ext4 for an example of calculating the correct stride_size and stripe_width. On XFS, it's sunit or strip unit size as the number of 512-byte sectors.

Limit Fragmentation

When you start with a nearly empty file system, new files can be stored in an optimal way. An individual file can be stored as a contiguous series of extents (or allocated blocks), and when you install a package or otherwise extract an archive the component files will probably be stored close to each other. Later access to multiple files from that package or archive can be done with minimal head motion.

The problem is that as you delete or truncate existing files, new "holes" of free space appear. Free space is becoming fragmented.

When you append data to an existing file, it is often the case that the following blocks are already in use so a new extent must be created somewhere else within the file system. Individual files are becoming fragmented.

As you add new files, there is no longer space to store them at physical locations close to the other files in the same directories. Files are being scattered across the volume.

These problems become worse as the file system ages, and especially so as it becomes more full.

Preëmptive Techniques for Limiting Fragmentation

Many of the modern file systems preallocate longer extents to files that are being appended to. Ext4, XFS, and Btrfs do delayed allocation or allocate-on-flush, which holds data in memory until it must be flushed. This groups block or extent allocation into larger groups, and reduces both fragmentation and CPU interruption.

Btrfs includes what it calls automatic on-line defragmentation, although it's really fragmentation avoidance. Mount the file system with the autodefrag option. It will detect small random writes into files and queue them for the defrag process. This works well for small files. However, it hurts performance with workloads of large databases or virtual machine disk images as those situations have many small writes that are within one large file.

Careful design of applications can help. A common example is a BitTorrent client which gradually fills in a large file one small piece at a time in a random order, or a media processing program that takes a lot of time to create a large file sequentially but you know its eventual size in advance. The fallocate() or posix_fallocate() system call can pre-allocate the file to its expected file, minimizing its fragmentation.

Retroactive Techniques for Reducing Fragmentation

Many file systems have defragmentation tools to reorder the extents of files, and some of them can reduce file scattering.

The easiest and fastest way to get a less fragmented file system is to back up all your data to another volume. Then create a new file system within the old volume and copy the data back into place. But to measure fragmentation and work to defragment it in place:

Ext4 Fragmentation and Defragmentation

You can measure the fragmentation of free space with e2freefrag. It displays an overview and a histogram of free extent sizes.

# e2freefrag /dev/sdb1          
Device: /dev/sdb1
Blocksize: 4096 bytes
Total blocks: 244190390
Free blocks: 92461513 (37.9%)

Min. free extent: 4 KB 
Max. free extent: 2064256 KB
Avg. free extent: 91568 KB
Num. free extent: 4039

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :           237           237    0.00%
    8K...   16K-  :           130           286    0.00%
   16K...   32K-  :            77           386    0.00%
   32K...   64K-  :            77           938    0.00%
   64K...  128K-  :           132          3254    0.00%
  128K...  256K-  :           152          6707    0.01%
  256K...  512K-  :           540         51111    0.06%
  512K... 1024K-  :           972        145906    0.16%
    1M...    2M-  :           542        169617    0.18%
    2M...    4M-  :           223        155107    0.17%
    4M...    8M-  :           243        415689    0.45%
    8M...   16M-  :            32         89916    0.10%
   16M...   32M-  :            13         72453    0.08%
   32M...   64M-  :            26        323374    0.35%
   64M...  128M-  :           309       7572979    8.19%
  128M...  256M-  :            78       3249618    3.51%
  256M...  512M-  :            70       6185074    6.69%
  512M... 1024M-  :            40       7921928    8.57%
    1G...    2G-  :           146      66096934   71.49% 

Use filefrag to display the number of extents into which a file has been fragmented.

# filefrag CentOS-x86_64.iso 
CentOS-x86_64.iso: 28 extents found 

Use e4defrag with the -c option for a report of how fragmented the file system is.

# e4defrag -c /home
<Fragmented files>                             now/best       size/ext
1. /home/vmware/RHEL6/RHEL6-1-s007.vmdk          2/1              6 KB
2. /home/vmware/Solaris-10/Solaris-10-s003.vmdk
                                                 2/1             32 KB
3. /home/vmware/Solaris-10/Solaris-10-s002.vmdk
                                                 2/1             32 KB
4. /home/vmware/Solaris-10/Solaris-10-s004.vmdk
                                                 2/1             32 KB
5. /home/vmware/Solaris-10/Solaris-10-s001.vmdk
                                                 2/1             32 KB

 Total/best extents                             59433/50512
 Average size per extent                        9950 KB
 Fragmentation score                            0
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (/home) does not need defragmentation.
 Done. 

You can use e4defrag without -c if you decide it's worth what will be a long wait to defragment the file system.

XFS Fragmentation and Defragmentation

Measure the current level of fragmentation with xfs_db.

# xfs_db -c frag -r /dev/sdb1 

Defragment XFS with xfs_fsr.

# xfs_fsr /dev/sdb1 

Btrfs Fragmentation and Defragmentation

You use the btrfs command to do most everything with Btrfs. For example, you can defragment a file system or individual files and directories.

# btrfs filesystem defragment /home
# btrfs filesystem defragment /usr/local/ISOs/*.iso 

Choose Appropriate Mount Options

The easiest way to get significant performance improvements is to disable access times. This can make an especially large improvement on workstations, where various components of the graphical desktop are doing all sorts of searching and indexing behind the scenes.

Every file has three timestamps: modify is when its content was changed, change is when its metadata (e.g., permissions) was changed, and access is it was last read. So, a command like find can cause an enormous amount of writing to a file system, not modifying any files or directories but updating all the access times of the directories.

Use the mount options noatime and nodiratime to disable updating the access times of the files and directories, respectively. This disables something needed for POSIX compliance, but there are few uses for it.

There are warnings that disabling access times can confuse some applications, it seems that the Mutt text-based e-mail client is the standard example. But... Mutt's last stable release was in June 2007. The last development release was in 2013 and there have been nothing but bug fixes since then. The latest entry on its news page is about a bug fix release in June 2009.

This worry about Mutt reminds me of the urban legend claiming that railway gauge and therefore Space Shuttle solid-rocket booster diameter were both derived from a long series of standards going back to Roman chariot axle length.

An even less critical warning is that once you disable access times, you can no longer tell which files your users are actually using. Some organizations will inform their users, "Here is a list of files you don't seem to have used within the past year. We would like to move these off the systems to an archive."

Reasonably clever users will simply do the following once in a while to make all their files seem important:

$ find ~ -type f -exec cp {} /dev/null \; 

To be safe, use nodiratime to disable directory access times plus relatime to only change the access time when the modify or change time is updated. So something like this:

# cat /etc/fstab
UUID=62dfc4a4-86c2-4ebf-aaa3-442ecc740122 / ext4 nodiratime,relatime 1 1
LABEL=/boot     /boot   ext4    nodiratime,relatime 1 2
LABEL=/home     /home   ext4    nodiratime,relatime 1 2
LABEL=/var      /var    ext4    nodiratime,relatime 1 2
/dev/cdrom /media/cdrom auto umask=0,users,iocharset=utf8,noauto,ro,exec 0 0
none /proc proc defaults 0 0
UUID=12e71ecd-833d-45ea-adfd-1eca8c27d912 swap swap defaults 0 0 

Leave write barriers in place.

Journaling file system use write barriers to ensure that data has been written onto the physical media. It is possible to mount a file system with write barriers disabled (barrier=0 option on Ext4, nobarrier on XFS and Btrfs), but it is a very bad idea unless you are absolutely certain that your storage subsystem has a perfectly reliable battery-backed power supply. Even if you think it is safe, any performance improvement should be negligible.

Select an Appropriate Journal Mode

Ext4 gives you a choice of journaling modes which you can select when mounting the file system, as options to the mount command when experimenting and then as settings recorded in /etc/fstab.

data=ordered is the default, it only journals the metadata. Data blocks are written to the file system first, then metadata.

data=writeback may provide some performance improvement, but at severe security risk on a multi-user system. It only journals the metadata, and then data and metadata may be written to the disk in any order. The risk is that if a system crashes while appending to a file, if it happens that the metadata had been committed (allocating additional data blocks to the file) before the data had been written (overwriting data blocks with new contents), then after the journal recovery that file may contain data blocks with contents from previously deleted files from any user on that file system. If you are tempted to use data=writeback for performance on file systems like /tmp, disable the journal instead with:
tune2fs -O ^has_journal /dev/sde1

data=journal writes all data twice: first to the journal and then to the data and metadata blocks. This will slightly improve the integrity of the data blocks, but with a serious degradation in performance.

Given the relatively low probability of losing data blocks during a system crash, most organizations would be better served by verifying again that their backup and recovery processes are trustworthy, and then using the default data=ordered.

And next...

Now that we have optimized our disk I/O as best we can, let's move on to networking. We will start at the bottom and tune Ethernet performance across the LAN.

To the Linux / Unix Page