Linux / FreeBSD keyboard.

Exploring the Linux Kernel

How the Kernel Boots

I have more details on another page, but the short version of Linux booting is as follows:

  1. The firmware on the motherboard runs a power-on self-test and finds the boot loader.
    • On older PC platforms, the BIOS found the Master Boot Record (MBR), the first 512-byte block of the first media. That MBR points to a boot loader stored on the disk.
    • Modern PC motherboards use UEFI firmware. It must find a specially labeled partition called the EFI System Partition or ESP, which must contain a FAT file system. The firmware runs a specified program within that, which in turn finds andr runs the boot loader.
    • On non-PC hardware (e.g., Alpha, UltraSPARC), the boot loader is a mini-OS and operates somewhat like the UEFI firmware.
  2. The boot loader uncompresses the compressed kernel image and loads it into RAM. More details on the kernel file format appear below.
  3. The boot loader also uncompresses an initial RAM disk image and loads it into memory, the kernel mounts that up as its root file system for a while.
  4. The kernel discovers the available hardware, or at least what it needs and knows about initially.
  5. The real root file system is mounted in read-only mode.
  6. The kernel starts /sbin/init and now we're under the control of the boot files on the disk rather than what's coded into the kernel itself.
  7. The root file system is checked and then remounted in read/write mode.
  8. Some kernel modules (also called device drivers) may be loaded, and they may detect more hardware.
  9. A collection of scripts is run to get the system into the desired run state.
How Linux Boots

That's the simple version.

See my details page for much more.

Where Are The Pieces Installed?

While the kernel components can go anywhere you want to hide them, you will generally find the following files installed within the /boot directory. Replace the string release throughout the following with whatever your kernel release is, the result of running the command uname -r. Depending on your distribution, if you have multiple kernels installed you may see several of each file type with the release as part of their file names, and then one symbolic link pointing to the latest one installed through the distribution's updates.

The monolithic kernel

This is the monolithic core of the kernel, the main operating system file. This file is uncompressed and loaded into memory by the boot loader. Typically this is /boot/vmlinuz-release, possibly with several of different releases plus one symbolic link simply named vmlinuz pointing to the most recent file.

This is one big file with three components on a PC:

On other hardware platforms this is simply a gzip'ed kernel. The kernel portion itself is a statically-linked ELF executable.

Why the funny name and location? Unix traditionally boots from a file named /vmunix or similar. So the Linux developers used /vmlinux.

Then the kernel grew too large for the boot loader to handle it, so it was gzip'ed and the "x" changed to "z".

Then disks grew large enough that the BIOS could not find things beyond the first 1024 cylinders, so a small file system named /boot was created within the first the first partition on the disk.

Directory containing the loadable modules

These are the dynamically loaded kernel modules, or device drivers. The modules for a release are stored in a hierarchy beneath /lib/modules/release/kernel/.

The files /lib/modules/release/modules.* contain information about the modules. An important one is modules.dep, which describes dependencies between modules. For example, the module to support the printer device /dev/lp0 will require the help of a module supporting the generic parallel port, which in turn may require the help of a module supporting some chipset, and so on.

The initial RAM disk image

While this could be named anything within /boot, it is typically /boot/initrd-release.img or /boot/initramfs-release.img. It is the result of creating a file system with enough components to provide the kernel with needed device drivers to find and read the root system on the disks, and some programs, configuration files, shared libraries, and device-special files to handle the initial steps required to detect and initialize hardware and mount the root file system.

It is the result of creating a cpio archive and compressing it with gzip.

While you could copy the file to some temporary working area, uncompress and then extract its contents, you can easily read it with this command:

# lsinitrd /boot/initrd-release.img 

GRUB boot loader components

You will probably be using version 2 of GRUB, so its components will be in /boot/grub2. Older systems with legacy GRUB will use /boot/grub.

System map (kernel symbol table)

This is stored in /boot/System.map-release. If you care about this, then you will want to see some of the other modules.* files in /lib/modules/release, especially modules.symbols*.

Kernel build configuration

This is supposed to be in /boot/config-release. I have found that the files provided by Red Hat are close but not necessarily exactly what was used to build the kernel they supply.

If you built your kernel with the proper choice, this will be available as the kernel data structure /proc/config.gz. In that case you are asking the kernel itself to provide its internal record of how it was built, meaning you will get the complete truth.

If that apparent file (actually a kernel data structure) isn't there, that feature may have been built as a loadable module. Try loading that module and trying again:

# modprobe ikconfig
# zcat /proc/config.gz | less 

See more on this below.

Kernel headers

These may be available as /boot/kernel.h-release. You may need these this header file to compile some C/C++ programs. The kernel source code itself under /usr/src/linux is another possible source for these headers.

How Was My Kernel Built? What Device Drivers Does It Have? What Can It Do?

You need this information. For some reason Red Hat doesn't think you really need to figure it out on your own, I guess you're supposed to call their support line.

There should be a configuration file describing the set of device drivers built into the monolithic kernel and the set built as loadable modules. There will also be some configuration choices made for some of them — for example, for a non-native file system type like NTFS, should it be supported read-only or read-write?

When you build a Linux kernel, the configuration you create to define the build itself ends up as the file /usr/src/linux/.config, see my page on building Linux kernels for more details. If you built your own kernel, you should have kept that file or a copy.

Many distributions give you the file /boot/config-release with the implication that this is the configuration used to build the kernel you got. They might have gotten better about this, but I was misled by Red Hat enough times when working with Linux on the Alpha architecture that I no longer trust their config file to be any more than a fairly close approximation. If it's all you have, understand that it may be close but not completely correct.

To be confident that you are getting the real kernel configuration, ask the kernel to describe itself. If the kernel was built with the right settings, its build configuration is available as a kernel data structure that you can access as /proc/config.gz. The configuration variables are:
  CONFIG_IKCONFIG=y
  CONFIG_IKCONFIG_PROC=y
They are set when configuring the build by:
General setup Kernel .config support Enable access to .config through /proc/config.gz

To see the list of available device drivers, you could use either of these commands:

$ modprobe -l
$ ls -R /lib/modules/$( uname -r )/kernel

That's only somewhat useful, as that just gives you a list of file names. If you installed the source code (and why not?), see the text files in: /usr/src/linux/Documentation/

You can also find information on a specific module this way:

$ modinfo module-name-goes-here 

The information you get is up to the developers of that module. So you might get something very useful, with an explanation, a list of load-time optional parameters and what they mean, and so on. Or you might get a cryptic table of hexadecimal addresses and a list of PCI bus addresses and a reminder that you can always read the C source code and figure it out from there.

Loading and Unloading Modules

Let's say you just added an Ethernet card but you don't know if whether needs the 8139cp or 8139too driver. Based on what you saw on the card and its chips, or maybe in the output of the lspci -v command, you think it's one of the two. But you don't know which one.

Try loading one of them and examining the end of the kernel ring buffer with this command sequence:

# modprobe 8139cp
# dmesg | tail 

Let's say that you only saw this output, generated by the module announcing itself as it loaded:

8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004) 

That doesn't look too promising. So let's unload it, and then load the other:

# rmmod 8139cp
# modprobe 8139too
# dmesg | tail 

Now we see this output at the end of the kernel ring buffer:

8139too Fast Ethernet driver 0.9.28
ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
8139too 0000:01:09.0: PCI INT A -> Link[APC2] -> GSI 17 (level, high) -> IRQ 17
eth0: RealTek RTL8139 at 0xf8394000, 00:11:95:1e:8e:b6, IRQ 17
eth0:  Identified 8139 chip type 'RTL-8100B/8139D'

Hey, that's it! Now we know which driver to specify in /etc/modprobe.conf or wherever.

What Is My Kernel Doing Right Now?

See the kernel ring buffer with this command:

$ dmesg 

It's a ring buffer, so it only keeps the most recent information. It's a good idea to save a copy as soon as possible after boot time, /var/log/dmesg is an obvious place to store this. If your distribution didn't think of this simple improvement, add it yourself by adding this to the end of your /etc/rc.local file:

echo "Saving the kernel ring buffer in /var/log/dmesg
dmesg > /var/log/dmesg 

See what kernel modules are currently loaded with this command:

$ lsmod 

Let's see you see two lines reading like the following among all the output:

ext3                  125412  8
nf_conntrack_ftp       12704  1 nf_nat_ftp 

This means that the module nf_conntrack_ftp has been loaded (it's needed to handle FTP connections through a Linux firewall), and that module is needed by another module, nf_nat_ftp, a module used to handle FTP connections through Network Address Translation or NAT.

The module ext3 has been loaded, as it must to handle the Linux-native Ext3FS file systems. No other module needs ext3, but it could not be unloaded as the kernel needs it to handle all the file systems currently in use!

The first number after the module name indicates the size of the module in bytes. The second number indicates that the number of things currently requiring the module. That FTP connection tracking module is needed by one thing, that other module that requires it. The Ext3FS module is needed by 8 things — the number of currently mounted Ext3FS file systems.

Kernel Hardware Detection

The /proc file system is really a large collection of kernel data structures presented in a reasonably friendly format. It appears to be a hierarchy of directories and files, which you can explore with cd and ls, and investigate in many cases with cat.

What has my kernel detected about the CPU, memory, and partition table?

$ cat /proc/cpuinfo 
.... details appear here ....
$ cat /proc/meminfo
.... details appear here ....
$ cat /proc/partitions
.... details appear here ....

What devices have been connected to the kernel? Note that the loading of kernel modules may lead to the detection of more hardware and the automatic appearance of more device-special files in /dev.

$ ls /dev 

What devices are on the PCI bus? Let's see that in one line per device, then in moderate detail, then in great detail.

$ lspci 
.... output appears ....
$ lspci -v
.... much more output appears ....
$ lspci -vv
.... more output appears than you probably want to see ....  

What about moderate details on the device at PCI bus address 01:08.0?

$ lspci -v -s 01:08.0 

What USB devices are connected? Let's see that in one line per device, then in moderate detail, then in great detail.

$ lsusb 
.... output appears ....
$ lsusb -v
.... much more output appears ....
$ lsusb -vv
.... more output appears than you probably want to see ....  

What about another way to report on the USB bus?

$ systool -v -b usb 

What about SCSI devices, including USB storage devices that appear as generic SCSI devices?

$ systool -v -b scsi 

Kernel Data Structures and Kernel Tuning

What is the complete current set of kernel data structures, by their name and value?

$ sysctl -a 

The kernel data structure net.ipv4.tcp_fin_timeout is accessible as the file /proc/sys/net/ipv4/tcp_fin_timeout. Instead of that sysctl command, you could have changed to the directory /proc/sys/net/ipv4 and displayed its contents with cat.

You can read the current values of these kernel timers, counters, and other fields. You can also change them! This has the effect of flipping a switch or twisting a knob on the running kernel. Now let's say that your enthusiastic modification of kernel values accidentally puts your running kernel into a bizarre state — this is not at all unlikely if you aren't careful. All you have modified is the running kernel in RAM, the kernel file stored on the disk is unchanged. Reboot with a fresh kernel and you're back to the default state.

Read just one specific kernel data structure:

$ sysctl net.ipv4.tcp_fin_timeout

   -- or --

$ cat /proc/sys/net/ipv4/tcp_fin_timeout 

Change that kernel value to 10 (seconds in this case), you will need to be root to do this:

# sysctl -w net.ipv4.tcp_fin_timeout=10

   -- or --

# echo "10" > /proc/sys/net/ipv4/tcp_fin_timeout 

Why would you want to mess with kernel values? To tune the running kernel for performance or security. See my page on security tuning suggestions for far more detail.

If you come up with a collection of adjustments that you find useful, you could either put the relevant echo or sysctl -w command sequence into /etc/rc.local or else you could put the relevant lines into /etc/sysctl.conf.