Linux / FreeBSD keyboard.

How Linux Boots, Run Levels, and Service Control

How Linux Boots

"How does Linux boot?" That's a very important question! Of course you need to understand that for troubleshooting, in order to handle situations where the system doesn't boot, or doesn't do it correctly, or otherwise doesn't get all the needed services started. But you also need to understand this for routine administration. You need to control which services are started, and handle dependencies where one service must be running before a second can be started. The answer to the question is complex, because there are so many choices along the way. The firmware on the platform, which boot loader you are using, which init system you are using, the details depend on these choices and more. Let's start with a simple explanation.

Turn it on, wait a few moments, start doing powerful things.

Maybe that's all you care to know. But maybe you want some details. At its very simplest, Linux is booted by this sequence:

Going just slightly deeper, we have choices for the firmware and boot loader. On Linux's traditional platform derived from the IBM PC architecture, the firmware has been the very limited BIOS. That started transitioning to UEFI a number of years ago. UEFI has been taking over since Microsoft's release of Windows 8 in October 2012, they won't allow retail sales of Windows 8 computers without UEFI and its support for Secure Boot. The boot loader was once LILO, then GRUB, and now GRUB2.

The boot loader also tells the kernel how to find and load an initial RAM-based disk image providing device drivers for physical disk controllers as well as some initial scripts.

The init program continues to evolve in capability and complexity, from early BSD-like systems through a SVR4-style init, then Upstart, and now systemd. As you will see below, systemd brings enormous changes to the way a Linux system boots.

This has been just the briefest of overviews. Continue to learn the details!

The following goes through the booting and service start and control steps in the sequence in which they happen, attempting to cover all the possibilities at each step.

Much of what follows will be needed to understand how to migrate from one Linux distribution to another, and how to upgrade across major releases of one distribution.

Kernel Space versus User Space

The Linux kernel runs directly on the hardware, using physical addresses and accessing the hardware on behalf of user processes while enforcing access permissions.

The work you accomplish on the computer is done by your processes, which were created out of system-owned processes when you logged in. All of these processes are descendants of init, one master process started by the kernel early in the boot process. The init process manages the system state by requesting the creation and termination of other processes.

The kernel first detects enough hardware to find and mount the root file system. It then starts the init program, which manages the system state and all subsequent processes. The only process the kernel knows should run is init. Once init has started, the kernel enforces permissions and manages resource utilization (memory, processing priority, etc) and may prevent or restrict some processes. For the positive side, it's init and its descendants that request the creation of new processes.

The Kernel Boot Loader

Firmware

The system firmware selects some media and attempts to boot the operating system stored there. Selecting, loading, and starting the OS kernel might be done by the firmware itself when it is a small operating system of its own, as in the case of OpenBoot (later called Open Firmware), developed by Sun for the SPARC platform, or the Alpha SRM developed by DEC.

For motherboards with AMD/Intel processors using BIOS or UEFI, the firmware finds a "stub" of the boot loader in a disk partition, which then calls successively more capable components to get to the kernel.

The motherboard has firmware, for AMD/Intel processors it will be BIOS or UEFI. UEFI is a type of firmware, not a type of BIOS. Don't say "This system's BIOS is UEFI", that's like saying "This orange is an apple."

Firmware Finds the Boot Loader

BIOS

The BIOS firmware selects the bootable device, scanning the attached storage in an order specified by the bus and controller scan order as well as the BIOS configuration. It is looking for a device starting with a 512-byte block which ends with the boot signature 0x55AA. The first 446 bytes of the boot block hold the boot loader, followed by 64 bytes for the partition table, then those final two bytes of the boot signature.

446 bytes of program code doesn't provide much capability! It will be just a boot loader "stub" which can find and start a more capable boot loader within a partition.

UEFI

UEFI firmware initializes the processor, memory, and peripheral hardware such as Ethernet, SATA, video, and other interfaces. An interface can have its own firmware code, sometimes called Option ROM, which initializes that peripheral. UEFI can check those Option ROMs for embedded signatures which can appear on the UEFI's "Allowed" and "Disallowed" lists.

UEFI has a "boot manager", a firmware policy engine configured by a set of NVRAM variables. It must find the EFI System Partition (or ESP). It will look for a GPT or GUID Partition Table with GUID C12A7328-F81F-11D2-BA4B-00A0C93EC93B, the distinctive signature of an EFI System Partition in a GPT device. If it can't find that, it will look for a traditional MBR partition of type 0xEF. The EFI System Partition will usually be the first partition on some disk, but it can be anywhere as UEFI doesn't have the early 1980s limitations of BIOS.

This is why Knoppix media cannot boot under UEFI. It contains a single ISO-9660 file system and so you have to reset the UEFI firmware to "legacy BIOS mode" or "BIOS compatibility mode". Other media may have a DOS/MBR partition table and an EFI System Partition. Let's compare Knoppix and Red Hat media with some output broken into multiple lines for readability:

$ file KNOPPIX_V7.4.2DVD-2014-09-28-EN.iso 
KNOPPIX_V7.4.2DVD-2014-09-28-EN.iso: # ISO 9660 CD-ROM filesystem data	\
	'KNOPPIX' (bootable)

% file rhel-server-7.0-x86_64-dvd.iso
rhel-server-7.0-x86_64-dvd.iso: DOS/MBR boot sector;		\
	partition 2 : ID=0xef, start-CHS (0x3ff,254,63),	\
	end-CHS (0x3ff,254,63), startsector 548348, 12572 sectors

% fdisk -l rhel-server-7.0-x86_64-dvd.iso

Disk rhel-server-7.0-x86_64-dvd.iso: 3.5 GiB, 3743416320 bytes, 7311360 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x1c7ff43a

Device                          Boot   Start      End  Blocks  Id System
rhel-server-7.0-x86_64-dvd.iso1 *          0  7311359 3655680   0 Empty
rhel-server-7.0-x86_64-dvd.iso2       548348   560919    6286  ef EFI (FAT-12/16/32)

The EFI System Partition contains a small FAT32 file system (or FAT12 or FAT16 on removable media). Considering that file system to be /EFI, the firmware looks inside that for a program in the EFI System Partition, typically /EFI/BOOT/BOOTX64.EFI or similar. The EFI System partition will usually be mounted as /boot/efi after booting so its content will be accessible.

Let's find our EFI System Partition. The first command finds FAT32 file systems, which could be the ESP. The second command looks at the labels to find the magic GUID:

# mount -t vfat
/dev/sda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)

# parted /dev/sda print
Model: ATA VMware Virtual I (scsi)
Disk /dev/sda: 21.5GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name                  Flags
 1      1049kB  211MB   210MB   fat16        EFI System Partition  boot
 2      211MB   735MB   524MB   xfs
 3      735MB   21.5GB  20.7GB                                     lvm

UEFI is much more capable than BIOS, but with that capability comes complexity. Exactly which program will UEFI run? Whichever program it has been configured to run. What second program will that first program run, and what configuration files might that second program run? Whatever the first and second programs have been built to do next.

The following walks through an example system which happens to use Red Hat Enterprise Linux 7. The following cannot predict exactly what your system has been built to do during the booting process, but you can use this same analysis to figure it out.

Run the following to see what the UEFI firmware is configured to do on your system. In this case, the BootOrder line specifies that the default boot target is the one labeled as "Red Hat Enterprise Linux", but you could enter the UEFI boot manager within the first 5 seconds to choose from the others. See the efibootmgr manual page for how to change the timeout or the order and default choice.

# efibootmgr -v
BootCurrent: 0004
Timeout: 5 seconds
BootOrder: 0004,0000,0001,0003,0002
Boot0000* EFI IDE Hard Drive (IDE 0:0) ACPI(a034ad0,0)PCI(7,1)ATAPI(0,0,0)
Boot0001* EFI SATA CDROM Drive (1.0)   ACPI(a034ad0,0)PCI(11,0)PCI(5,0)03120a00010000000000
Boot0002* EFI Network   ACPI(a0341d0,0)PCI(14,0)MAC(6c626db2f841,0)
Boot0003* EFI Internal Shell (Unsupported option) MM(b,bee94000,bf21efff)
Boot0004* Red Hat Enterprise Linux  HD(spec)File(\EFI\redhat\shim.efi)

The spec string is really some long thing with a UUID, like this:
  HD(1,800,64000,b22ad8fd-3b85-4517-987e-40cba35abd53)
but the point here is the file specification I wanted you to see:
  File(\EFI\redhat\shim.efi)

We can dig deeper with the following command, showing that shim.efi will call grubx64.efi. In case the expected string is split across lines, also search for other substrings: gru, bx64, 64.efi, and so on. In this case we get lucky and find both grub and g.r.u.b complete on lines. We see g.r.u.b where the data contains null characters between the letters: 0x67 0x00 0x76 0x00 0x75 0x00 0x62 0x00.

# hexdump -C /boot/efi/EFI/redhat/shim.efi | egrep -C 6 -i 'grub|g.r.u.b'
000c5090  42 00 53 00 74 00 61 00  74 00 65 00 0a 00 00 00  |B.S.t.a.t.e.....|
000c50a0  4d 00 6f 00 6b 00 49 00  67 00 6e 00 6f 00 72 00  |M.o.k.I.g.n.o.r.|
000c50b0  65 00 44 00 42 00 00 00  46 00 61 00 69 00 6c 00  |e.D.B...F.a.i.l.|
000c50c0  65 00 64 00 20 00 74 00  6f 00 20 00 73 00 65 00  |e.d. .t.o. .s.e.|
000c50d0  74 00 20 00 4d 00 6f 00  6b 00 49 00 67 00 6e 00  |t. .M.o.k.I.g.n.|
000c50e0  6f 00 72 00 65 00 44 00  42 00 3a 00 20 00 25 00  |o.r.e.D.B.:. .%.|
000c50f0  72 00 0a 00 00 00 5c 00  67 00 72 00 75 00 62 00  |r.....\.g.r.u.b.|
000c5100  78 00 36 00 34 00 2e 00  65 00 66 00 69 00 00 00  |x.6.4...e.f.i...|
000c5110  46 00 61 00 69 00 6c 00  65 00 64 00 20 00 74 00  |F.a.i.l.e.d. .t.|
000c5120  6f 00 20 00 67 00 65 00  74 00 20 00 6c 00 6f 00  |o. .g.e.t. .l.o.|
000c5130  61 00 64 00 20 00 6f 00  70 00 74 00 69 00 6f 00  |a.d. .o.p.t.i.o.|
000c5140  6e 00 73 00 3a 00 20 00  25 00 72 00 0a 00 00 00  |n.s.:. .%.r.....|
000c5150  46 00 61 00 69 00 6c 00  65 00 64 00 20 00 74 00  |F.a.i.l.e.d. .t.|
--
000c51b0  73 00 74 00 61 00 6c 00  6c 00 20 00 73 00 65 00  |s.t.a.l.l. .s.e.|
000c51c0  63 00 75 00 72 00 69 00  74 00 79 00 20 00 70 00  |c.u.r.i.t.y. .p.|
000c51d0  72 00 6f 00 74 00 6f 00  63 00 6f 00 6c 00 00 00  |r.o.t.o.c.o.l...|
000c51e0  42 00 6f 00 6f 00 74 00  69 00 6e 00 67 00 20 00  |B.o.o.t.i.n.g. .|
000c51f0  69 00 6e 00 20 00 69 00  6e 00 73 00 65 00 63 00  |i.n. .i.n.s.e.c.|
000c5200  75 00 72 00 65 00 20 00  6d 00 6f 00 64 00 65 00  |u.r.e. .m.o.d.e.|
000c5210  0a 00 00 00 00 00 00 00  5c 67 72 75 62 78 36 34  |........\grubx64|
000c5220  2e 65 66 69 00 74 66 74  70 3a 2f 2f 00 00 00 00  |.efi.tftp://....|
000c5230  55 00 52 00 4c 00 53 00  20 00 4d 00 55 00 53 00  |U.R.L.S. .M.U.S.|
000c5240  54 00 20 00 53 00 54 00  41 00 52 00 54 00 20 00  |T. .S.T.A.R.T. .|
000c5250  57 00 49 00 54 00 48 00  20 00 74 00 66 00 74 00  |W.I.T.H. .t.f.t.|
000c5260  70 00 3a 00 2f 00 2f 00  0a 00 00 00 00 00 00 00  |p.:././.........|
000c5270  54 00 46 00 54 00 50 00  20 00 53 00 45 00 52 00  |T.F.T.P. .S.E.R.|
--
00144430  73 5f 70 72 69 6e 74 00  58 35 30 39 5f 70 6f 6c  |s_print.X509_pol|
00144440  69 63 79 5f 6c 65 76 65  6c 5f 6e 6f 64 65 5f 63  |icy_level_node_c|
00144450  6f 75 6e 74 00 50 45 4d  5f 77 72 69 74 65 5f 62  |ount.PEM_write_b|
00144460  69 6f 5f 44 53 41 50 72  69 76 61 74 65 4b 65 79  |io_DSAPrivateKey|
00144470  00 58 35 30 39 5f 41 54  54 52 49 42 55 54 45 5f  |.X509_ATTRIBUTE_|
00144480  63 72 65 61 74 65 5f 62  79 5f 4f 42 4a 00 69 6e  |create_by_OBJ.in|
00144490  69 74 5f 67 72 75 62 00  52 53 41 5f 70 72 69 6e  |it_grub.RSA_prin|
001444a0  74 00 58 35 30 39 5f 74  72 75 73 74 5f 63 6c 65  |t.X509_trust_cle|
001444b0  61 72 00 42 49 4f 5f 73  5f 6e 75 6c 6c 00 58 35  |ar.BIO_s_null.X5|
001444c0  30 39 76 33 5f 67 65 74  5f 65 78 74 5f 62 79 5f  |09v3_get_ext_by_|
001444d0  63 72 69 74 69 63 61 6c  00 53 68 61 31 46 69 6e  |critical.Sha1Fin|
001444e0  61 6c 00 44 49 52 45 43  54 4f 52 59 53 54 52 49  |al.DIRECTORYSTRI|
001444f0  4e 47 5f 66 72 65 65 00  69 32 64 5f 58 35 30 39  |NG_free.i2d_X509|
		

We got lucky, all instances of both grub and g.r.u.b appear only within 16-byte blocks, none span from one block to the next and are overlooked by this simple search. We can verify that's the case, and find any other instances, by running something like this and dealing with the messy output:

# egrep -a -C 2 /boot/efi/EFI/redhat/shim.efi | cat -A 

Or we could open the shim.efi program with the vim editor, using the -R option to specify read-only mode. Then search for the patterns.

The GRUB configuration file is not in /boot/grub2/grub.cfg where I and the GRUB 2 documentation would expect to find it. Instead it's in the same directory as grubx64.efi. How does that program find it? It has been hard-coded to find it there:

# strings /boot/efi/EFI/redhat/grubx64.efi | grep grub.cfg
%s/grub.cfg 

This means that on this system, using the paths after the UEFI System Partition is mounted, the sequence would be:

  1. UEFI Firmware calls...
  2. /boot/EFI/redhat/shim.efi which calls...
  3. /boot/EFI/redhat/grubx64.efi, which reads grub.cfg and loads...
  4. /boot/vmlinuz-release

See UEFI boot: how does that actually work, then? for more details and a great explanation of what UEFI is and is not.

The Boot Loader Starts

BIOS-MBR

On a BIOS-MBR system, that 446-byte boot loader "stub" was, in the past, LILO. However, LILO relied on physical addresses into the disk and was very sensitive to disk geometry and reconfigurations. You frequently had to rescue a system by booting from other media and then recreating the LILO boot loader block.

GRUB is a much better solution for BIOS-MBR. That 446-byte block is the GRUB "stub", which can be recovered from the first 446 bytes of a file in /boot/grub. In legacy GRUB this is /boot/grub/stage1, in GRUB 2 this is /boot/grub2/i386-pc/boot.img.

Encoded into the stage 1 GRUB loader is a definition of where to find the small /boot file system. This is typically the first partition of the first disk.

The boot loader will need to read the file system in the boot partition. Legacy GRUB used the files /boot/grub/*stage1_5, helper modules for the various file systems that could be used for the /boot file system — e2fs_stage1_5, xfs_stage1_5, and others.

The boot block of a disk is sector 0. For legacy reasons, the first partition of a disk does not begin until sector 63, leaving a gap of 62 sectors. The GRUB 2 component core.img is written into this gap. It plays the role of the legacy GRUB *_stage1_5 modules.

Now that GRUB (either legacy or GRUB 2) can read the /boot file system, it can run the final stage GRUB boot loader to do the complex work. In legacy GRUB this is /boot/grub/stage2, configured by either /boot/grub/grub.conf or /boot/grub/menu.lst. Both of those usually exist, one as a normal file and the other as a symbolic link pointing to it. In GRUB 2 this is /boot/grub2/i386-pc/kernel.img, configured by /boot/grub2/grub.cfg.

UEFI-GPT

On a UEFI-GPT system, the firmware has found the UEFI System Partition, which must hold a FAT32 file system.

On a typical system, the firmware will load and run whatever it is configured to run, frequently /EFI/BOOT/BOOTX64.EFI. This in turn will load and run whatever it has been built to run next, for a Linux system this will be /EFI/BOOT/GRUB.EFI. The second of those is a component of the GRUB 2 boot loader. Or, to support Secure Boot, the firmware might first run /EFI/BOOT/SHIM.EFI which has been signed by the UEFI signing service, and it then chain loads the Grub program.

As mentioned above, run efibootmgr -v to see what the firmware is configured to run.

This GRUB "stub" can then use the core.img component to read whichever file system type was used for the boot partition, and then start the kernel.img program which reads its configuration from /boot/grub2/grub.cfg.

The EFI version of GRUB uses /usr/lib/grub/x86_64-efi/linuxefi.mod* to load the kernel. Notice that the configuration file created by components of the grub2-efi package specify the kernel and initial RAM disk image (more on that below) with linuxefi and initrdefi rather than linux and initrd as used on non-EFI platforms.

Once the kernel has been loaded and the file systems have been mounted, a typical system will mount the EFI System Partition as /boot/efi and you can explore your system. However, some of the Grub components will be outside the file systems and you won't find them in /boot or /boot/efi. On a GPT disk the boot block or master boot record is in sector #0, the GPT header is in sector #1, and the GPT partition entry arrays fill sectors #2-33. The file core.img is written into empty sectors between the end of the partition table and the beginning of the first partition.

The User Gets a Choice

The Grub boot loader can present a menu to the user, typically a choice of various kernels, or the menu can be hidden and require the user to realize that the <Escape> key must be pressed within a few seconds. The menu may time out after a specified number of seconds, or it may wait until the user makes a selection.

The user can also edit the selected boot option, useful for maintenance and rescue. For example, specifying a boot to single-user mode to repair some problems.

Grub wasn't all that complicated in the beginning, but Grub 2 adds a lot of complexity. Compare typical configuration files to see the large increase in complexity. The Linux kernel build process can modify it automatically, but you will probably want to go into the file and make some modifications. Be careful!

The Kernel

On some hardware, the Alpha for example, the kernel file is simply a compressed copy of the compiled vmlinux kernel file.

On Intel/AMD platforms, the kernel file starts with a 512-byte boot block, then a secondary boot loader block, and then the compressed kernel image.

The boot loader does what is needed to extract and decompress the kernel into RAM and then turn control over to it.

Finding the File System

For example, the root file system might be on NFS, and storage devices might be connected through iSCSI or AoE or FCoE (that is, SCSI over IP, or ATA over Ethernet, or Fiber Channel over Ethernet), any of that requiring the network interface to be detected, enabled, and used in a certain way.

Or on a logical volume, which means that the underlying physical volumes and volume group must be detected.

Or, the root file system might be encrypted with the LUKS and dm-crypt components, requiring using those kernel modules and asking the user to enter a passphrase.

It doesn't have to be as complicated as any of those. SCSI controllers and SATA interfaces are based on a wide range of chip sets, each requiring one of several different kernel modules.

How can you load a kernel module (or device driver) used to interact with a device, when that module is stored on a disk controlled through the very device we're currently unable to interact with? You can't. The solution is...

The boot loader, be it GRUB or the mainboard firmware, tells the kernel where to find its primary or root file system. The problem is that that file system might be stored in any of several different complicated ways, many of them requiring kernel modules that can't all be simultaneously compiled into the monolithic kernel core.

Initial RAM Disk Image

The boot loader will tell the kernel how to find an initial RAM disk image stored in the boot file system.

The traditional method has been to build an initrd file. This is a file system containing those kernel modules along with binaries, shared libraries, scripts, device-special files, and everything else needed to get the system to the point it can find and use the real root file system. That resulting file system has been archived into a single file with cpio and then compressed with gzip.

In the initrd design, this image is uncompressed and extracted into /dev/ram which is then mounted as a file system. It once executed the script /linuxrc in that file system, now it executes the script /init.

In the initramfs design, the dracut tool is used to create the file (which may still be called initrd-release) while including its own framework from /usr/lib/dracut or similar.

The initrd design uses a script, linuxrc, while dracut uses a binary. The goal is to speed this part of the boot process, to quickly detect the root file system's device(s) and transition to the real root file system.

In either case, that /init script will find and mount the real file system, so the real init program can be run.

Exploring Your Initial RAM Disk Image

For a listing, simply try this. Change release to the appropriate value:

# lsinitrd /boot/initrd-release

To extract and explore a copy, do this:

# mkdir /tmp/explore
# cd /tmp/explore
# zcat /boot/initrd-release | cpio -i
# ls -l
# tree | less

You will find that there is a /init file, but it is a shell script. Directories /bin and /sbin contain needed programs while /lib and /lib64 contain shared libraries. Three devices are in /devconsole, kmsg, and null. Some configuration is in /etc, mount points exist for /proc, /sys, and /sysroot. A /tmp directory is available.

Finding the Actual Root File System

But out of the multiple file systems available on multiple devices, combinations of devices, and network abstractions, how does the kernel know which is to be mounted first as the real /, the real root file system?

Well, how do you want to accomplish that?

Some root= directive is passed to the kernel by the boot loader. This can be the device name:
root=/dev/sda4
or a label embedded in the file system header:
root=LABEL=/
or the manufacturer and model name and its partition as detected by the kernel and listed in /dev/disk/by-id:
root=/dev/disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WMAZA3949022-part1
or the PCI bus address and path through the controller:
/dev/disk/by-path/pci-0000:00:02.1-usb-0:1:1.0-scsi-0:0:0:0.part1
or a UUID embedded in the file system header:
root=UUID=12e71ecd-833d-45ea-adfd-1eca8c27d912

Of these choices:

Continuing Critiques of the Boot Sequence

This boot sequence is overly complicated. The system passes through three environments with similar functionality before boot: EFI, GRUB, and Dracut. All of them involve loading device drivers. EFI and Dracut have a small shell and scripting, and GRUB has a complex editable menu.

The 3.3 kernel added the capability for the EFI stub to directly load the kernel.

The Kernel Found the Root File System, Now What?

Once the kernel has started and it has discovered enough hardware to mount the root file system, it searches for a master user-space program which will control the state of the operating system itself and also manage the processes running on the operating system, both system services and user processes.

This master process is the init program, or at least we will find that the kernel expects that to be the case. The init program runs the appropriate set of boot scripts, based on its own configuration and that of the collection of boot scripts. It is then the ancestor of all other processes, cleaning up and freeing their resources if their parent process has lost track of them ("reaping zombie processes" in the UNIX parlance).

The kernel and init

The Linux kernel starts the init program. The kernel source code contains a block of code in init/main.c that looks for the init program in the appropriate place, the /sbin directory. If it isn't there, the kernel then tries two other locations before falling back to starting a shell. If it can't even do that, then the boot loader seems to have misinformed the kernel about where the root file system really is. Maybe it specified the wrong device, or maybe the root file system is corrupted or simply missing a vital piece.

static int run_init_process(const char *init_filename)
{
        argv_init[0] = init_filename;
        return do_execve(init_filename,
                (const char __user *const __user *)argv_init,
                (const char __user *const __user *)envp_init);
}

[ ... ]

if (!run_init_process("/sbin/init") ||
    !run_init_process("/etc/init") ||
    !run_init_process("/bin/init") ||
    !run_init_process("/bin/sh"))
        return 0;

panic("No init found.  Try passing init= option to kernel. "
      "See Linux Documentation/init.txt for guidance.");

The kernel does all the kernel-space work: interacting directly with hardware, managing running processes by allocating memory and CPU time, and enforcing access control through ownership and permissions.

init handles the user-space work, at least initially. It runs a number of scripts sometimes called boot scripts as they handle the user-space part of starting the operating system environment, and sometimes called init scripts because they're run by init.

The first of these boot scripts has traditionally been /etc/rc.sysinit. It does basic system initialization, checking the root file system with fsck if required, checking and mounting the other file systems, and loading any needed kernel modules along the way.

Other boot scripts usually start both local and remote authentication interfaces. This includes the local command-line interface with mingetty and possibly a graphical login on non-servers with some X-based display manager; plus remote access with the SSH daemon.

Like all the other boot scripts, the user authentication interfaces are started as root so the user authentication and subsequent calls to setuid() and setgid() can run. User-owned processes then do the work within each session.

Linux kernel, init process, boot scripts, and user processes.

Boot scripts can start services that more safely run with lower privileges. For example, the Apache web server must start as root so it can open TCP port 80. But it then drops privileges through the setuid() and setgid() system calls and continues running as an unprivileged user named apache or httpd or similar.

The components drawn in red can directly access the hardware: the firmware, the boot loader, and the kernel.

Those drawn in black are owned by root: the init process, the boot scripts, and the daemons they spawn. These should be persistent, running until there is a change in system state, possibly shutting down to halt or reboot.

Those drawn in green are owned by unprivileged users: the user's login session and its processes, and possibly some daemonized non-root services like a web server. User processes run until they finish or are terminated.

Where Are the Scripts?

This can be confusing, as the scripts may be directly in the directory /etc/ or maybe in its subdirectory /etc/rc.d/. This is made more confusing by the presence of symbolic links which mean that — most of the time, anyway — either naming convention works. You frequently find something like this:

$ ls -lFd /etc/init.d /etc/rc*
lrwxrwxrwx  1 root root   11 Feb  3 16:47 /etc/init.d -> rc.d/init.d/
drwxr-xr-x 11 root root 4096 Feb  9 16:51 /etc/rc.d/
lrwxr-xr-x 11 root root   13 Feb  9 16:48 /etc/rc.local -> rc.d/rc.local
lrwxr-xr-x 11 root root   15 Feb  9 16:48 /etc/rc.sysinit -> rc.d/rc.sysinit
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc0.d -> rc.d/rc0.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc1.d -> rc.d/rc1.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc2.d -> rc.d/rc2.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc3.d -> rc.d/rc3.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc4.d -> rc.d/rc4.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc5.d -> rc.d/rc5.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rc6.d -> rc.d/rc6.d/
lrwxrwxrwx  1 root root   10 Feb  3 16:58 /etc/rcS.d -> rc.d/rcS.d/
$  ls -lF /etc/rc.d/
total 64
drwxr-xr-x 2 root root  4096 Feb  3 21:52 init.d/
-rwxr-xr-x 1 root root   220 Feb  3 16:51 rc.local*
-rwxr-xr-x 1 root root 10707 Feb  3 16:52 rc.sysinit*
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc0.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc1.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc2.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc3.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc4.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc5.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc6.d/
drwxr-xr-x 2 root root  4096 Feb  9 16:25 rc7.d/
lrwxrwxrwx 1 root root     5 Feb  3 16:47 rcS.d -> rc1.d/

Development of init through several major versions

BSD Unix uses a very simple init scheme using one master boot script (which calls other scripts to start services), configured by one file specifying which services to enable and with one additional script optionally adding other boot-time tasks.

System V Unix added the concept of run levels, multiple target states for the running system. Each is defined as a collection of started/stopped states for services.

Upstart is a modification of the SysV method. The first thing the administrator notices it that init configuration has changed from a single file to a collection of files. More significantly in the long run, Upstart adds support for dependencies between service components, automatically restarting a crashed service, and the ability for events to trigger starting or stopping services.

systemd is the biggest change yet. A system can boot or otherwise transition to multiple simultaneous targets. Aggressive parallelization yields fast state transitions and a very flat process tree, The dependency support promised in Upstart is delivered in systemd.

We'll look at these one at a time.

BSD-style /etc/rc

The OpenBSD kernel also starts init. Here is a block of code from /usr/src/sys/kern/init_main.c showing that init must be in an obvious place.

/*
 * List of paths to try when searching for "init".
 */
static char *initpaths[] = {
	"/sbin/init",
	"/sbin/oinit",
	"/sbin/init.bak",
	NULL,
};

The BSD init uses a simple boot script configuration that used to be used in some Linux distributions such as Slackware. But no mainstream Linux distributions does it this way now.

The BSD style init program brings up the system by running the /etc/rc script. That's it — rc uses a few other scripts, but it's a simple and efficient design.

The configuration script /etc/rc.conf sets a number of standard parameters for available services. You then modify /etc/rc.conf.local to turn on and modify the parameters of services on your system. rc runs rc.conf and then rc.conf.local. For example, rc.conf says don't run a the Network Time Protocol or start a web server by default, it contains these lines:

ntpd_flags=NO    # for normal use: ""
httpd_flags=NO   # for normal use: ""

But then you might customize your system with these changes, in rc.conf.local, turning on both NTP and HTTP and disabling the chroot() capability of Apache:

ntpd_flags=""
httpd_flags="-u"

The individual services are started by scripts in /etc/rc.d/* called by rc.

Then almost at the very end of the master boot script rc, it calls /etc/rc.local. The only things done after that are starting some hardware monitoring daemons, the cron daemon, and possibly the simple X display manager xdm if you asked for that during the installation. This gives you a place to add some customization. My rc.local contains this:

#       $OpenBSD: rc.local,v 1.44 2011/04/22 06:08:14 ajacoutot Exp $

# Site-specific startup actions, daemons, and other things which
# can be done AFTER your system goes into securemode.  For actions
# which should be done BEFORE your system has gone into securemode
# please see /etc/rc.securelevel.
## ADDED BELOW HERE #################################################
echo "Starting KDM"
( sleep 5 ; /usr/local/bin/kdm ) &
echo "Saving kernel ring buffer in /var/log/dmesg"
dmesg > /var/log/dmesg
echo "Starting smartd to monitor drives"
/usr/local/sbin/smartd
echo "Unmuting audio"
audioctl output_muted=0

Reboot the system with shutdown -r or simply reboot.

Halt and turn off the power with shutdown -h or simply halt -p.

SysV-style init

This is more complex than the BSD method. It is based on the concept of numbered run levels. Linux uses the definitions in this table. Solaris and other Unix-family operating systems use something very similar.

Level Purpose
Most Linux Some Linux
0 Shut down and power off Shut down and power off
1 Single-user mode Single-user mode
2 Multi-user console login, no networking Multi-user console login, networking enabled
3 Multi-user console login, networking enabled Multi-user graphical login, networking enabled
4 not used not used
5 Multi-user graphical login, networking enabled not used
6 Shut down and reboot Shut down and reboot

Red Hat and therefore many other distributions used this SysV-style init roughly from the late 1990s through the late 2000s.

/etc/inittab

The SysV init program reads its configuration file /etc/inittab to see what to do by default and how to do that. CentOS 5 used the inittab shown here.

#
# inittab       This file describes how the INIT process should set up
#               the system in a certain run-level.
#
# Author:       Miquel van Smoorenburg, <miquels@drinkel.nl.mugnet.org>
#               Modified for RHS Linux by Marc Ewing and Donnie Barnes
#

# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)
# 
id:5:initdefault:

# System initialization.
si::sysinit:/etc/rc.d/rc.sysinit

l0:0:wait:/etc/rc.d/rc 0
l1:1:wait:/etc/rc.d/rc 1
l2:2:wait:/etc/rc.d/rc 2
l3:3:wait:/etc/rc.d/rc 3
l4:4:wait:/etc/rc.d/rc 4
l5:5:wait:/etc/rc.d/rc 5
l6:6:wait:/etc/rc.d/rc 6

# Trap CTRL-ALT-DELETE
ca::ctrlaltdel:/sbin/shutdown -t3 -r now

# When our UPS tells us power has failed, assume we have a few minutes
# of power left.  Schedule a shutdown for 2 minutes from now.
# This does, of course, assume you have powerd installed and your
# UPS connected and working correctly.  
pf::powerfail:/sbin/shutdown -f -h +2 "Power Failure; System Shutting Down"

# If power was restored before the shutdown kicked in, cancel it.
pr:12345:powerokwait:/sbin/shutdown -c "Power Restored; Shutdown Canceled"

# Run gettys in standard runlevels
1:2345:respawn:/sbin/mingetty tty1
2:2345:respawn:/sbin/mingetty tty2
3:2345:respawn:/sbin/mingetty tty3
4:2345:respawn:/sbin/mingetty tty4
5:2345:respawn:/sbin/mingetty tty5
6:2345:respawn:/sbin/mingetty tty6

# Run xdm in runlevel 5
x:5:respawn:/etc/X11/prefdm -nodaemon

The default run level is 5, graphical desktop. If you built a server, this would be 3 instead.

The first boot script to be run is /etc/rc.d/rc.sysinit. It does a number of initialization tasks. Most importantly, it re-mounts the root file system in read/write mode and finds and checks the other file systems.

Then, to get into run level 5, it runs the scripts with 5 in the second field. That means that the second task is to run the script /etc/rc.d/rc with a parameter of 5. More on the details of this in a moment...

If it had been about to reboot because of a power failure, that is canceled.

It then starts six /sbin/mingetty processes, one each on TTY devices tty1 through tty6. The key combinations <Ctrl><Alt><F1> through <Ctrl><Alt><F7> switch you between these six text virtual consoles plus X if it's running.

Finally, it runs the script /etc/X11/prefdm which tries to determine which display manager is probably the preferred one and then starts it.

Along the way it specified that a detected power failure event scheduled a shutdown in two minutes, and the text console keyboard event <Ctrl><Alt><Del> causes an immediate reboot.

Boot script directories and changing run levels

The boot scripts themselves are stored in /etc/rc.d/init.d/ and can be thought of as a collection of available tools.

$ ls /etc/rc.d/rc5.d
K15httpd       S00microcode_ctl    S22messagebus    S80sendmail
K20nfs         S04readahead_early  S25netfs         S85denyhosts
K28amd         S06cpuspeed         S26acpid         S90crond
K50netconsole  S08arptables_jf     S26lm_sensors    S90xfs
K65kadmin      S10network          S26lvm2-monitor  S91freenx-server
K65kprop       S10restorecond      S28autofs        S95anacron
K65krb524      S12syslog           S50hplip         S95atd
K65krb5kdc     S13irqbalance       S55cups          S96readahead_later
K69rpcsvcgssd  S13mcstrans         S55sshd          S98haldaemon
K74nscd        S13portmap          S56rawdevices    S99firewall
K80kdump       S14nfslock          S56xinetd        S99local
K87multipathd  S18rpcidmapd        S58ntpd          S99smartd
K89netplugd    S19rpcgssd          S61clamd 

You can manually stop, start, or restart a service by running its boot script with a parameter of stop or start or restart. Most of the boot scripts also support checking on its current state with status, and some support reload to keep running but re-read its configuration file to change some details of how its running.

Try running a boot script with no parameter at all. That usually provides a short message explaining that a parameter is needed and then listing the possible parameters.

The directories /etc/rc.d/rc0.d through /etc/rc.d/rc6.d specify what to stop and start to get into the corresponding run levels. Each of those is populated with symbolic links pointing to the actual scripts in /etc/rc.d/init.d/*, with the exception of S99local which points to /etc/rc.d/rc.local.

For example, /etc/rc.d/rc5.d contains S10network, a symbolic link pointing to /etc/rc.d/init.d/network.

The logic is that rc goes through the list of links in the target directory, first stopping (killing) those with link names beginning with "K" in numerical order, and then starting those with link names beginning with "S" in numerical order. The network script sets up IPv4/IPv6 networking, and so it is started in run level 5 before those network services that rely on it. Similarly, when going to run level 0 or 6, those services are stopped before turning off IP networking.

There's more to it than just that — if the system was already running, and if a service is to be in the same state in both the current and target run level, then it isn't stopped. For example, if you booted a system to run level 3, in which networking and network services are started, and then you changed to run level 5, the only thing that will happen is that the graphical display manager will be started. It won't shut down all services and IP networking and then start them back up again.

To change from one run level to another, run the init command with a parameter of the target run level.

Use runlevel to see the previous and current run levels, where N means "none", you booted the system directly into the current run level.

Reboot the system with init 6 or shutdown -r or simply reboot.

Halt and turn off the power with init 0 or shutdown -h or simply halt.

On a text console you can reboot with <Ctrl><Alt><Del>, and on a graphical console that usually brings up a dialog in which both rebooting and shutting down are options. You can also click through the graphical menus to shut down or reboot from graphical mode.

Specifying how to get into a given run level

You could manually create the symbolic links, but you would have to think carefully about what numbers to assign to get everything into the correct order.

Don't do that, use chkconfig.

The chkconfig program is a little confusing because it is programmed by shell script comments within the boot scripts. Let's look at an example:

$ head /etc/rc.d/init.d/network
#! /bin/bash
#
# network       Bring up/down networking
#
# chkconfig: 2345 10 90
# description: Activates/Deactivates all network interfaces configured to \
#              start at boot time.

This specifies that if you want this service to be used (and you probably do, this sets up basic IP networking!), then it should be started in run levels 2, 3, 4, and 5, started as S10, fairly early. That leaves it to be turned off in run levels 0, 1, and 6, stopped (killed) as K90, fairly late.

Let's experiment with chkconfig:

$ su
password:
# chkconfig --add network
# chkconfig --list network
network         0:off   1:off   2:on    3:on    4:on    5:on    6:off
# ls /etc/rc.d/rc?.d/*network
/etc/rc.d/rc0.d/K90network  /etc/rc.d/rc4.d/S10network
/etc/rc.d/rc1.d/K90network  /etc/rc.d/rc5.d/S10network
/etc/rc.d/rc2.d/S10network  /etc/rc.d/rc6.d/K90network
/etc/rc.d/rc3.d/S10network
# chkconfig --del network
# chkconfig --list network
network         0:off   1:off   2:off   3:off   4:off   5:off   6:off
# ls /etc/rc.d/rc?.d/*network
/etc/rc.d/rc0.d/K90network  /etc/rc.d/rc4.d/K90network
/etc/rc.d/rc1.d/K90network  /etc/rc.d/rc5.d/K90network
/etc/rc.d/rc2.d/K90network  /etc/rc.d/rc6.d/K90network
/etc/rc.d/rc3.d/K90network
# chkconfig --add network
# chkconfig --list network
network         0:off   1:off   2:on    3:on    4:on    5:on    6:off

We turned it on (probably not needed) and then checked its state in various run levels. We also listed the symbolic links showing that it's started as S10 and stopped (killed) as K90.

Then we turned off the service and tested what that did.

Finally, we turned it back on and made sure that worked.

Control now versus in the future

Remember that you can do two very different things, and often you should do both of them:

Start or stop the service right now by running its boot script with a parameter of start or stop.

Have it automatically started (or not) after future reboots by running chkconfig with an option of --add (or --del).

Upstart init

Upstart is an event-driven replacement or re-design for init. It was meant to be analogous to the Service Management Facility in Solaris, with services started and stopped by events. These events might be kernel detection of hardware, or they might be caused by other services. This includes the crash of a service automatically leading to its being started. It was developed at Ubuntu but it came to be used in many distributions, including RHEL 6 and therefore CentOS and, less directly, Oracle Linux and Scientific Linux.

Upstart is different from SysV init, but the differences are very small for the typical administrator. Instead of a large /etc/inittab specifying several things, now that file has just one line specifying the default target run level, initdefault.

Instead of one configuration file, Upstart uses the collection of significantly-named files in /etc/init/.

/etc/init/rcS.conf specifies how to start the system. It does this in a very familiar way, by running /etc/rc.d/rc.sysinit followed by /etc/rc.d/rc with the single parameter of the target run level. That is, as long as the system wasn't booted into rescue or emergency mode, in which case it runs /sbin/sulogin to make sure it really is the administrator at the keyboard and not someone doing a simple console break-in, and then drops to a shell.

The text consoles are started by /etc/init/start-ttys.conf.

If you go to run level 5, /etc/init/prefdm.conf starts the graphical display manager.

If you passed the console=/dev/tty0 parameter to the kernel at boot time, /etc/init/serial.conf sets up a serial console line.

If you press <Ctrl><Alt><Del> on a text console, /etc/init/control-alt-delete.conf handles the task of rebooting.

Debian and Ubuntu Have Been A Little Different

In October 2014 Ubuntu went to systemd, but until then it and Debian used different script logic and they configured services with sysv-rc-conf instead of chkconfig.

Instead of /etc/rc.d/rc.sysinit, Debian and Ubuntu run /etc/init.d/rcS. That in turn runs every script /etc/rcS.d/S* in order.

Commands sysv-rc-conf and update-rc.d are used instead of chkconfig. It is probably easiest to see these by example.

See the run levels in which a given service is started:

Debian / Ubuntu
 # sysv-rc-conf --list
 # sysv-rc-conf --list apache

Most other Linux distributions
 # chkconfig --list
 # chkconfig --list httpd

BSD
 # more /etc/rc.conf /etc/rc.conf.local

Solaris
 # svcs
 # svcs | grep httpd

Add/enable one service and delete/disable another after future boots:

Debian / Ubuntu
 # sysv-rc-conf apache on
 # sysv-rc-conf finger off

Most other Linux distributions
 # chkconfig httpd on
 # chkconfig finger off

BSD
 # vi /etc/rc.conf.local

Solaris
 # svcadm enable network/httpd
 # svcadm disable network/finger

For all Linux distributions we have been able to stop, start, stop and restart, and sometimes take other actions simply by running the associated script with an appropriate parameter:

# /etc/init.d/httpd status
# /etc/init.d/httpd restart
# /etc/init.d/named reload
# /etc/init.d/named status

However, all of that and much more changes with...

systemd

This is really different from what has come before. Lennart Poettering, the systemd author, provides a description of the systemd design goals and philosophy and then adds a later comparison of features. Also see the official systemd page at freedesktop.org.

It was the default in Mageia at least by early 2013, and became standard in Fedora around that time. By the end of 2013 a RHEL 7 beta release had appeared and it used systemd. By early 2014, Mark Shuttleworth announced that Ubuntu would also transition to systemd with Ubuntu 14.10 in October 2014.

Systemd uses many good ideas from Apple's launchd, introduced with macOS 10.4 and now also part of iOS.

However, systemd has its critics! See the boycott systemd page for a set of critiques, and see The World After Systemd for a project already planning for its demise.

To summarize the design:

Systemd Design Philosophy

Start only what's needed

It doesn't make sense to start the CUPS print service while everything else is trying to start. We're booting now, we'll print later. Start it on demand, when someone wants to print.

Similarly, for hardware-specific services like Bluetooth, only start those services when hardware has been detected and some process requests communication with it.

Start some daemons on demand.

For what you do start, aggressively parallelize it

Traditional SysV init required a long sequence of individual service starts. Several early processes were needed by many other services, the early ones had to fully start and their boot scripts successfully terminate before the later ones could begin.

Notice the traditional use of boot scripts. Shell scripts are very good for rapid development, but they don't run fast. The script itself has to be created, and then everything it does requires the creation of further processes. This is made worse by the typical nesting of a boot script calling its helper script which in turn calls a number of configuration scripts.

Recode boot scripts in C, use binary executables as much as possible.

The CPU is the fastest component in the system, the disks are the slowest. The CPU must sit idle for many potentially useful cycles waiting for disk I/O. And, saying "the CPU" is a little old-fashioned, most systems have multiple CPU cores and we want to use them all in parallel. We want to aggressively parallelize the startup programs, but we don't want to coordinate actions by monitoring the file system.

Systemd can create sockets, then pass those sockets to daemon processes as they are started. This can be simplified and sped up by creating all needed sockets at once, and then starting all daemon processes at once. Get them all started and let them communicate among themselves as they come up. The sockets are maintained by systemd so if a daemon crashes, systemd restarts that daemon and programs that were communicating with the old daemon are still connected but now to the replacement.

Aggressively parallelize the startup by starting all daemons simultaneously and using sockets for inter-process communication to handle inter-service order dependencies.

There is more to it. Control Groups or cgroups are used to group related processes into a hierarchy of process groups, providing a way to monitor and control all processes of a group, including limiting and isolating their resource usage. When you stop the service, it will stop all the related processes.

Automounting can be used for all file systems other than the root file system, supporting encryption with LUKS, NFS and other network-based storage, LVM and RAID.

When Only-As-Needed Meets Parallelization

Let's say you boot a desktop system and it goes to its default graphical boot target. Log in, and see if you have any of the getty family of programs running:

$ pgrep getty
$ ps axuww | egrep 'PID|getty'
USER       PID %CPU %MEM    VSZ   RSS TTY    STAT START  TIME COMMAND
root     26717  0.0  0.0 108052  2020 pts/0  R+   17:36  0:00 egrep --color=auto PID|getty 

Probably not. But you think that's odd, doesn't the login program use something like mingetty or agetty to handle command-line authentication on a text-only console? Let's check if those text consoles are really there with Ctrl-Alt-F1, Ctrl-Alt-F2, Ctrl-Alt-F3, then back to X with Ctrl-Alt-F1 (or F7, or F2, depending on the sequence of events when the system started).

Yes, there were text login prompts waiting on those virtual consoles. Well, no, they weren't waiting, each was only started when you first switched to that virtual console. They're there now:

$ pgrep getty
22698
26727
$ ps axuww | egrep 'PID|getty'
USER    PID %CPU %MEM    VSZ   RSS TTY    STAT START  TIME COMMAND
root  22698  0.0  0.0 110012  1712 tty2   Ss+  16:19  0:00 /sbin/agetty --noclear tty2
root  26727  0.0  0.0 110012  1640 tty3   Ss+  17:37  0:00 /sbin/agetty --noclear tty3
root  26717  0.0  0.0 108052  2020 pts/0  R+   17:36  0:00 egrep --color=auto PID|getty 

See Lennart Poettering's description for more details.

Location and Components

It gets weird here, there no longer is an /sbin/init program! You must either set up symbolic links, as seen here, or else modify the boot loader to pass this option to the kernel:
init=/lib/systemd/systemd

$ ls -l /usr/sbin/init /usr/bin/systemd /lib/systemd/systemd
-rwxr-xr-x 1 root root 929520 Sep 22 12:26 /lib/systemd/systemd*
lrwxrwxrwx 1 root root     22 Oct  6 01:37 /usr/bin/systemd -> ../lib/systemd/systemd*
lrwxrwxrwx 1 root root     22 Oct  6 01:37 /usr/sbin/init -> ../lib/systemd/systemd* 

Notice how some components are under /usr, just part of a general Linux trend of crucial components moving under /usr and making it impractical for that to be a separate file system as it frequently has been in UNIX tradition. Beware that modern Linux systems typically have no real /bin, /lib, /lib64, or /sbin, those are all symbolic links pointing to directories in /usr and so that must be part of the root file system.

% ls -ld /bin /lib* /sbin
lrwxrwxrwx 1 root root 7 Feb  3 16:45 /bin -> usr/bin/
lrwxrwxrwx 1 root root 7 Feb  3 16:45 /lib -> usr/lib/
lrwxrwxrwx 1 root root 9 Feb  3 16:45 /lib64 -> usr/lib64/
lrwxrwxrwx 1 root root 8 Feb  3 16:45 /sbin -> usr/sbin/

Systemd binaries are located in /lib/systemd/systemd-*, with optional distribution-specific scripts in the same directory.

The interesting parts are the task unit configuration files, all of them under /lib/systemd/system/.

Units

Booting tasks are organized into units — these include initializing hardware, mounting file systems, creating sockets, and starting services that will daemonize and run in the background. Each of these task units is configured by a simple file holding configuration information, these are sources of information and not scripts to be run. Their syntax is similar to things like kdmrc, the KDE display manager configuration file, and therefore similar to Windows *.ini files. For example, here is the named.service file, specifying when and how to start the BIND DNS service:

[Unit]
Description=Berkeley Internet Name Domain (DNS)
Wants=nss-lookup.target
Before=nss-lookup.target
After=network.target

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/named
Environment=KRB5_KTNAME=/etc/named.keytab
PIDFile=/var/lib/named/var/run/named/named.pid

ExecStartPre=/usr/sbin/setup-named-chroot.sh /var/lib/named on
ExecStartPre=/usr/sbin/named-checkconf -t /var/lib/named -z /etc/named.conf
ExecStart=/usr/sbin/named -u named -t /var/lib/named $OPTIONS

ExecReload=/bin/sh -c '/usr/sbin/rndc reload > /dev/null 2>&1 || /bin/kill -HUP $MAINPID'

ExecStop=/bin/sh -c '/usr/sbin/rndc stop > /dev/null 2>&1 || /bin/kill -TERM $MAINPID'
ExecStopPost=/usr/sbin/setup-named-chroot.sh /var/lib/named off

PrivateTmp=false
TimeoutSec=25

[Install]
WantedBy=multi-user.target 

Unit Types

The file name indicates the type of that unit.

*.mount files specify when and how to mount and unmount file systems, *.automount files are for storage handled by the automounter.

*.service files handle services that in the past were typically handled by scripts in /etc/rc.d/init.d/.

*.socket files create sockets that we be used by the associated service units.

*.path files allow systemd to monitor the specified files and directories through inotify; access in that path causes a service start.

The CUPS printing service provides a simple example. systemd watches for the appearance of a file named /var/spool/cups/d*, which is what happens when you submit a print job.

The interesting difference from the old design is that there is no print service running until you submit a print job. Once started it persists, with both it and systemd monitoring the socket. When you submit a new print job, systemd sends out a log message "systemd[1]: Started CUPS Printing Service." But typically none is needed because the daemon is still running.

*.target files define groups of units. These are analogous to the run levels we saw in SysV and Upstart, but you can have arbitrarily many of arbitrary complexity. (Actually that was true with SysV and Upstart but hardly anyone did such a thing.)

$ cd /lib/systemd/system
$ more cups.*
::::::::::::::
cups.path
::::::::::::::
[Unit]
Description=CUPS Printer Service Spool

[Path]
PathExistsGlob=/var/spool/cups/d*

[Install]
WantedBy=multi-user.target
::::::::::::::
cups.service
::::::::::::::
[Unit]
Description=CUPS Printing Service

[Service]
ExecStart=/usr/sbin/cupsd -f
PrivateTmp=true

[Install]
Also=cups.socket cups.path
WantedBy=printer.target
::::::::::::::
cups.socket
::::::::::::::
[Unit]
Description=CUPS Printing Service Sockets

[Socket]
ListenStream=/var/run/cups/cups.sock

[Install]
WantedBy=sockets.target 

You can view the available targets with one command:

$ systemctl --type=target --all
UNIT                   LOAD   ACTIVE   SUB    JOB DESCRIPTION
basic.target           loaded active   active     Basic System
cryptsetup.target      loaded active   active     Encrypted Volumes
emergency.target       loaded inactive dead       Emergency Mode
final.target           loaded inactive dead       Final Step
getty.target           loaded active   active     Login Prompts
graphical.target       loaded active   active     Graphical Interface
local-fs-pre.target    loaded active   active     Local File Systems (Pre)
local-fs.target        loaded active   active     Local File Systems
multi-user.target      loaded active   active     Multi-User
network.target         loaded active   active     Network
nfs.target             loaded active   active     Network File System Client and
nss-lookup.target      loaded active   active     Host and Network Name Lookups
nss-user-lookup.target loaded inactive dead       User and Group Name Lookups
printer.target         loaded active   active     Printer
remote-fs-pre.target   loaded inactive dead       Remote File Systems (Pre)
remote-fs.target       loaded active   active     Remote File Systems
rescue.target          loaded inactive dead       Rescue Mode
rpcbind.target         loaded active   active     RPC Port Mapper
shutdown.target        loaded inactive dead       Shutdown
sockets.target         loaded active   active     Sockets
sound.target           loaded active   active     Sound Card
swap.target            loaded active   active     Swap
sysinit.target         loaded active   active     System Initialization
syslog.target          loaded active   active     Syslog
time-sync.target       loaded active   active     System Time Synchronized
umount.target          loaded inactive dead       Unmount All Filesystems

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
JOB    = Pending job for the unit.

26 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'. 

Directories named servicename.target.wants allow you to manually define dependencies between units. For example, while some network services can handle network interfaces that only appear after the network service has started, the Apache web server needs to have networking up and running before it starts.

Defining the Default Targets

/lib/systemd/system/default.target defines the default target at boot time. It is usually a symbolic link pointing to multi-user.target for a server or graphical.target for a workstation.

Note that /etc/systemd/system/default.target can also exist and point to a unit file. On my Mageia system, for example, that's a roundabout way of getting to the same target:
/etc/systemd/system/default.target -> /lib/systemd/system/runlevel5.target
/lib/systemd/system/runlevel5.target -> /etc/systemd/system/graphical.target

You can override this default by passing a parameter to the kernel at boot time, systemd will discover this in /proc/cmdline and override the default. For example:
systemd.unit=runlevel3.target
or:
systemd.unit=rescue.target

Note that the traditional parameters from SysV and Upstart can still be used, 1, s, S, single, 3, 5. Systemd maps those to the associated runlevelX.target definitions.

On my desktop system, the special target file default.target is a symbolic link pointing to graphical.target. Leaving out the standard initial comment block, it contains what we see here.

[Unit]
Description=Graphical Interface
Documentation=man:systemd.special(7)
Requires=multi-user.target
After=multi-user.target
Conflicts=rescue.target
Wants=display-manager.service
AllowIsolate=yes

[Install]
Alias=default.target 

Notice that it explicitly requires the multi-user.target unit, and it will also require all the components in the subdirectory default.target.wants, although that is empty on my system.

The reason for making servicename.target be a directory of symbolic links is that you can easily add and delete the "wants" without modifying the unit definition file itself.

Going back one level to the multi-user target, multi-user.target has a requirement for basic.target. Here's the content of multi-user.target:

[Unit]
Description=Multi-User
Documentation=man:systemd.special(7)
Requires=basic.target
Conflicts=rescue.service rescue.target
After=basic.target rescue.service rescue.target
AllowIsolate=yes

[Install]
Alias=default.target 

The multi-user.target.wants directory contains files defining these additional requirements:

dbus.service, getty.target, plymouth-quit-wait.service, plymouth-quit.service, rpcbind.target, systemd-ask-password-wall.path, systemd-logind.service, systemd-user-sessions.service

Chasing it further back, basic.target contains what we see here, a requirement for the sysinit.target target:

[Unit]
Description=Basic System
Documentation=man:systemd.special(7)
Requires=sysinit.target sockets.target
After=sysinit.target sockets.target
RefuseManualStart=yes 

The basic.target.wants directory adds these requirements, restoring the sound service and applying any distribution-specific scripts:

alsa-restore.service, alsa-state.service, fedora-autorelabel-mark.service, fedora-autorelabel.service, fedora-configure.service, fedora-loadmodules.service, mandriva-everytime.service, mandriva-save-dmesg.service

And then sysinit.target contains what we see here:

[Unit]
Description=System Initialization
Documentation=man:systemd.special(7)
Conflicts=emergency.service emergency.target
Wants=local-fs.target swap.target
After=local-fs.target swap.target emergency.service emergency.target
RefuseManualStart=yes 

It has a larger list of added requirements in sysinit.target.wants:

cryptsetup.target, dev-hugepages.mount, dev-mqueue.mount, kmod-static-nodes.service, mandriva-kmsg-loglevel.service, plymouth-read-write.service, plymouth-start.service, proc-sys-fs-binfmt_misc.automount, sys-fs-fuse-connections.mount, sys-kernel-config.mount, sys-kernel-debug.mount, systemd-ask-password-console.path, systemd-binfmt.service, systemd-journal-flush.service, systemd-journald.service, systemd-modules-load.service, systemd-random-seed.service, systemd-sysctl.service, systemd-tmpfiles-setup-dev.service, systemd-tmpfiles-setup.service, systemd-udev-trigger.service, systemd-udevd.service, systemd-update-utmp.service, systemd-vconsole-setup.service

Examining and Controlling System State With systemctl

List all active units (that is, units enabled and should have successfully run or still be running), showing their current status, paging through the results:

# systemctl list-units

List all target units, showing the collective targets reached in the current system state. This is broader than simply "the current run level" as shown by the runlevel command:

# systemctl list-units --type=target

List just those active units which have failed:

# systemctl --failed

List the units listening on sockets:

# systemctl list-sockets

List all available units, showing whether they are enabled or not:

# systemctl list-unit-files

Display the dependency tree for a service. Service names are something like named.service but they can be abbreviated by leaving off .service.

# systemctl list-dependencies named

Generate the full dependency graph for all services. This will be enormous and not very useful to most people. View it with Chrome or similar.

# systemd-analyze dot | dot -Tsvg > systemd.svg

Start, stop, restart, reload the configuration, and report the status of one or more service. These are like the corresponding /etc/init.d/* boot scripts, with the addition of the inter-process communication and automated dependency satisfaction. Use show for far more information on that service.

You will notice that the first time you check the status for a service it will probably take a noticeable amount of time. This is because it is checking the journal, another powerful but complex addition that comes with systemd. More on that below...

# systemctl stop named dhcpd
# systemctl start named dhcpd
# systemctl restart named
# systemctl reload named
# systemctl is-active named
# systemctl status named
# systemctl show named

Disable and enable a service for use in the future. These are like the corresponding chkconfig commands.

# systemctl disable named
# systemctl enable named

Check the current system configuration for the default target run state, then change it to newtarget.

# systemctl get-default
# systemctl set-default newtarget

Make major changes in system state:

# systemctl reboot
# systemctl halt
# systemctl poweroff

What About /etc/rc.d/rc.local?

Here's a common question: How do I get /etc/rc.d/rc.local to work under systemd? Maybe you're like me, you have written your own iptables firewall script or some other locally developed programs you want to run at the end of the booting process.

Well, maybe it already works.

See the example /lib/systemd/system/rc-local.service systemd service file here.

A comment refers to /lib/systemd/system-generators/systemd-rc-local-generator, which is one of those fast-running binaries. All I have to do is create an executable script named /etc/rc.d/rc.local, and the next time the system boots, that script is run.

Otherwise, see if you have an rc-local.service unit and enable it if needed.

If you don't have an rc-local.service file, create one similar to what you see here and enable it:

systemctl enable rc-local.service

Maybe you want to tinker a little, use /etc/rc.local directly and leave out the rc.d subdirectory. Or use ConditionPathExists instead of ConditionFileIsExecutable. Have fun!

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

# This unit gets pulled automatically into multi-user.target by
# systemd-rc-local-generator if /etc/rc.d/rc.local is executable.
[Unit]
Description=/etc/rc.d/rc.local Compatibility
ConditionFileIsExecutable=/etc/rc.d/rc.local
After=network.target

[Service]
Type=forking
ExecStart=/etc/rc.d/rc.local start
TimeoutSec=0
RemainAfterExit=yes
SysVStartPriority=99

What is left in /etc/rc.d?

Specifically, what about the directories /etc/rc.d/init.d/ and /etc/rc.d/rc?.d/ — do they still contain scripts and symbolic links?

Not much! But what remains does work. You can run the scripts in init.d and systemd will run the scripts in rc3.d or rc5.d when going to the multi-user or graphical target, respectively.

Writing your own service scripts

See these:
Writing systemd service files
How to write a startup script for systemd

Simplified "Phrase Book" Comparison of SVR4 init, Upstart, and systemd

There's much more to it than this, but here's what an administrator sees day-to-day:

SVR4 init on CentOS/RHEL 5:
One file /etc/inittab configures the init program as to what run level to enter by default and what it takes to get there. Other than starting multiple virtual consoles with text login in run levels 3 and 5, and starting a graphical login in run level 5, it says to use the directory /etc/rc[0-6].d/ corresponding to the target run level. That directory will contain symbolic links pointing to the collection of boot scripts in /etc/init.d/. Each link has the same name as the actual script, preceded with either K (to kill) or S (to start) and a two-digit number to impose order. You use the chkconfig program to enable or disable services, it reads specially coded comments in the comment block at the top of the boot script to specify which run levels to start and stop the service and at what numerical order position. You directly run the boot script /etc/init.d/servicename to stop, start, or restart it right now.

Upstart on CentOS/RHEL 6:
Very similar to SVR4 init as far as configuration and operation goes. The exception is that /etc/inittab is now almost empty. Its functionality has been expanded and moved into the files /etc/sysconfig/init and /etc/init/*.

Systemd on CentOS/RHEL 7 and 8:
This is very different! Instead of run levels, in which only 1 (maintenance or rescue), 3 (text-only, server) and 5 (graphics, workstation) are useful, it uses "targets". The commonly used one correspond to the traditional run levels 3 and 5, but you can boot or transition into any combination of the targets found in /lib/systemd/system/*.target. Only a few boot scripts remain in /etc/init.d/. You use the program systemctl to query the current overall system state, to query the state of individual services, to control a service right now, and to enable or disable it for the future.

Simplified "Phrase Book" of Equivalent Commands

Goal:
What run state are we in?
What services were started/stopped to get here, and with what order dependencies?
SVR4 init, Upstart
runlevel

ls /etc/rcN.d
systemd
systemctl get-default
systemctl
systemctl -a
systemctl list-dependencies
systemctl list-sockets
systemctl status crond sshd httpd ...
Goal:
What is the default run state if the system is simply rebooted?
SVR4 init, Upstart
grep initdefault /etc/inittab
systemd
systemctl get-default
Goal:
What is the default run state if the system Change the default run state to newtarget.
SVR4 init, Upstart
vim /etc/inittab
systemd
systemctl set-default newtarget
Goal:
What services are available? Of the available services, which are enabled and disabled?
SVR4 init, Upstart
ls /etc/rc.d/init.d
chkconfig --list
systemd
systemctl list-unit-files
Goal:
Stop service xyz.
Start service xyz.
Stop and restart service xyz.
Signal service xyz to re-read its configuration file.
SVR4 init, Upstart
/etc/init.d/xyz stop
/etc/init.d/xyz start
/etc/init.d/xyz restart
/etc/init.d/xyz reload
systemd
systemctl stop xyz
systemctl start xyz
systemctl restart xyz
systemctl reload xyz
Goal:
Enable service xyz to automatically start at the next boot.

Disable service xyz to not automatically start at the next boot.
SVR4 init, Upstart
chkconfig --add xyz
chkconfig xyz on
chkconfig --levels 345 xyz on

chkconfig --del xyz
chkconfig xyz off
systemd
systemctl enable xyz

systemctl disable xyz
Systemd will automatically enable services that xyz depends upon.
Goal:
What is involved in service xyz?

A short description, what it needs to run before it, what else wants this to run before it can, is it running now or stopped now, since when, if running what's it PID, and far more?
SVR4 init, Upstart
more /etc/init.d/xyz
ls /etc/rc$(runlevel | awk '{print $2}').d/
/etc/init.d/xyz status
grep xyz /var/log/messages
ls /var/run/xyz
cat /var/run/xyz
ps axuww | egrep 'PID|xyz'
Oof!

You would have to do all of these, plus many more, plus do some careful analysis of all of the output, to get everything you can get from the one systemd command.

This is an area where systemd has an advantage.
systemd
systemctl show xyz
Goal:
Halt or reboot the system.
SVR4 init, Upstart
init 0
halt
poweroff
shutdown -h now -t 0

init 6
reboot
shutdown -r now -t 0
systemd
systemctl halt
systemctl poweroff

systemctl reboot
Goal:
Change to another run state
SVR4 init, Upstart
init 1
init 3
init 5
systemd
systemctl isolate rescue.target
systemctl isolate multi-user.target
systemctl isolate graphical.target
Goal:
The system is shut down, boot it into a non-default run state (typically used for rescue or maintenance.
SVR4 init, Upstart
Interrupt the boot loader's countdown timer and modify the line that will be passed to the kernel. Add the desired target state to the end — 1, 3, or 5 for SVR4 init or Upstart; rescue, multi-user, or graphical for systemd (1, 3, and 5 will probably work, but don't count on it). The kernel's command line at the last boot is kept in /proc/cmdline.

Smaller Process Trees

With many startup tasks now done by one binary executable instead of a script that spawned many child processes, including other scripts which may have called other scripts, fewer processes were spawned to bring the system up.

The aggressive parallelization means a flatter tree of processes.

Here is part of the process tree on CentOS 5 with SysV init:

init(1)-+-acpid(1850)
        |-atd(2290)
        |-crond(2100)
        |-cupsd(1935)
    [ ... ]
        |-gdm-binary(2401)---gdm-binary(2441)-+-Xorg(2446)
        |                                     `-tcsh(2460,cromwell)-+-ssh-agent(2496)
	|                                                           `-startkde(2506)---kwrapper(2572)
    [ ... ]
        |-kdeinit(2559,cromwell)-+-artsd(2586)
        |                        |-autorun(2677)
        |                        |-bt-applet(2691)
        |                        |-eggcups(2591)
        |                        |-kio_file(2582)
        |                        |-klauncher(2564)
        |                        |-konqueror(2598)
        |                        |-konsole(2602)-+-tcsh(2705)
        |                        |               |-tcsh(2707)---su(2854,root)---bash(2922)
        |                        |               `-tcsh(2712)
        |                        |-kwin(2575)
        |                        |-nm-applet(2663)
        |                        |-pam-panel-icon(2590)---pam_timestamp_c(2592,root)
        |                        |-xload(2664)
        |                        |-xmms(2638)-+-{xmms}(2678)
        |                        |            `-{xmms}(2786)
        |                        |-xterm(2593)---tcsh(2603)
        |                        |-xterm(2596)---tcsh(2606)
        |                        |-xterm(2597)---tcsh(2608)---ssh(3251)
        |                        `-xterm(2637)---bash(2640)-+-grep(2645)
        |                                                   |-grep(2646)
        |                                                   `-tail(2644)
    [ ... ]
        |-ntpd(2022,ntp)
        |-sendmail(2061)
        |-sendmail(2070,smmsp)
        |-smartd(2387)
        |-syslogd(1653)
        |-udevd(418)
        |-watchdog/0(4)
        |-xfs(2153,xfs)
	`-xinetd(2001) 

Compare that to this process tree from Mageia with systemd. Shells and other processes aren't as deep:

$ pstree -pu | less
systemd(1)-+-acpid(695)
           |-agetty(3006)
           |-atd(672,daemon)
    [ ... ]
           |-kmix(3278,cromwell)---{kmix}(3676)
           |-knotify4(3241,cromwell)---{knotify4}(3242)
           |-konsole(3288,cromwell)-+-tcsh(3455)-+-audacious(6294)-+-{audacious}(6295)
           |                        |            |                 |-{audacious}(6298)
           |                        |            |                 |-{audacious}(6300)
           |                        |            |                 |-{audacious}(6310)
           |                        |            |                 `-{audacious}(6418)
           |                        |            |-less(6463)
           |                        |            `-pstree(6462)
           |                        |-tcsh(12198)---vim(5903)---{vim}(5904)
           |                        `-{konsole}(3453)
    [ ... ]
           |-named(2365,named)-+-{named}(2366)
           |                   |-{named}(2367)
           |                   |-{named}(2368)
           |                   |-{named}(2369)
           |                   |-{named}(2370)
           |                   `-{named}(2371)
           |-ntpd(2227,ntp)
           |-plasma-desktop(3244,cromwell)-+-ksysguardd(3262)
           |                               |-{plasma-desktop}(3245)
           |                               |-{plasma-desktop}(3246)
           |                               |-{plasma-desktop}(3256)
           |                               |-{plasma-desktop}(3261)
           |                               `-{plasma-desktop}(3263)
    [ ... ]
           |-rpcbind(1683,rpc)
           |-rsyslogd(697)-+-{rsyslogd}(763)
           |               |-{rsyslogd}(764)
           |               |-{rsyslogd}(765)
           |               `-{rsyslogd}(766)
           |-ssh-agent(2847,cromwell)
           |-sshd(1697)
           |-start_kdeinit(3200,cromwell)
           |-systemd-journal(380)
           |-systemd-logind(677)
           |-systemd-udevd(384)
           |-tor(2041,toruser)
           |-udisks-daemon(679)-+-udisks-daemon(683)
           |                    |-{udisks-daemon}(769)
           |                    `-{udisks-daemon}(817)
           |-udisksd(3217)-+-{udisksd}(3218)
           |               |-{udisksd}(3220)
           |               `-{udisksd}(3222)
           |-upowerd(699)-+-{upowerd}(767)
           |              `-{upowerd}(770)
           `-xosview(3799,cromwell)

The Journal and journalctl

You probably noticed that systemctl status servicename took a while the first time you ran it. And you may have stumbled across that large and possibly mysterious /var/log/journal/ directory. This is the systemd journaling system.

The systemd journal captures log information even when the rsyslog daemon isn't running, and stores it in a form that requires the use of the journalctl command.

A unique machine ID was created during the installation, it is a 16-byte or 128-bit string recorded in ASCII as hexadecimal in /etc/machine-id. That machine ID is used as a subdirectory in which the journal files are stored. For example:

# cat /etc/machine-id
3845e210bd0d4dc5b2e5f5fd8fdc6f01
# find /var/log/journal -type d
/var/log/journal
/var/log/journal/3845e210bd0d4dc5b2e5f5fd8fdc6f01

The journal files are all owned by root and associated with group adm or systemd-journal. Put a user in both groups to ensure they can read the journal with journalctl.

The systemd-journald manual page explains that you can grant read access to all members of groups adm and wheel for all journal files existing now and created in the future:

# setfacl -Rnm g:wheel:rx,d:g:wheel:rx,g:adm:rx,d:g:adm:rx /var/log/journal/

Worries About Size and Compliance

On the one hand, you are likely to worry about all this journal data filling your file system. Don't worry — by default it will use no more than 10% of the file system and keep at least 15% free. See the manual page for journald.conf to see how to adjust that in /etc/systemd/journald.conf.

If regulatory compliance requires you to retain log information, you should worry about collecting and archiving this information before its older content is automatically trimmed away. See the manual page for journalctl to see how to have a scheduled job extract the past day. For example, run this script via cron after midnight every night to capture all events from midnight to midnight from the day before. Log output tends to be very redundant and compress down to about 5% of its original size with xz:

#!/bin/sh

# Create an archive if this is the first run ever.
ARCHIVE=/var/log/journal-archive
mkdir -p ${ARCHIVE}
cd ${ARCHIVE}

# Capture yesterday's events.  File name will include the
# host on which this was done plus yesterday's date in
# YYYY-MM-DD format.  Then compress it:
HOST=$( hostname )
DATE=$( date --date=yesterday "+%F" )
journalctl --since=yesterday > journal-${HOST}-${DATE}
xz journal-${HOST}-${DATE}

Useful journalctl Techniques

See the manual page for journalctl for full details. You can accomplish these types of things with rsyslog data, but only with possibly complicated grep or awk commands based on some initial investigation into just when boot events happened. The journalctl command makes these much easier. Some handy commands include:

See just the kernel events logged since the most recent boot:

# journalctl -k -b -0

Or, all logged events since the most recent boot:

# journalctl -b -0

Or, all logged events within the run before this most recent boot. For example, you rebooted this system some time yesterday afternoon and again this morning, and you want to see all events between those two reboots. This would require some initial investigation and then some complex grep strings using rsyslogd data only:

# journalctl -b -1

Just the logged events for one systemd unit, or for two (or more):

# journalctl -u named
# journalctl -u named -u dhcpd

Or, for just these three units since the last boot:

# journalctl -u named -u dhcpd -u httpd -b -0

Or, to emulate  tail -f /var/log/messages 

# journalctl -f

journalctl or Rsyslog or both?

With journalctl capturing all local events, even those when the Rsyslog daemon isn't running, do we still need to run rsyslogd?

You very likely do want to also run rsyslogd and it's easily set and imposes very little additional overhead.

A UNIX socket is created by systemd and rsyslogd will listen to it by default, capturing all messages (whether it saves them, and if so, where, is entirely up to your configuration of rsyslogd).

#  ls -l /run/systemd/journal/syslog 
srw-rw-rw- 1 root root 0 Feb  9 16:28 /run/systemd/journal/syslog=
# file /run/systemd/journal/syslog 
/run/systemd/journal/syslog: socket
# lsof /run/systemd/journal/syslog 
COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
systemd    1 root   25u  unix 0xffff880234b2b800      0t0 1730 /run/systemd/journal/syslog
rsyslogd 787 root    3u  unix 0xffff880234b2b800      0t0 1730 /run/systemd/journal/syslog

journalctl is very nice for querying the existing journal, but rsyslogd can still do some things that the journal cannot.

Centralized logging has a number of advantages. One is security, the integrity and availability of the log data. Yes, Forward Secure Sealing can periodically "seal" journal data to detect integrity violation, but I would feel better about having critical log data stored on a dedicated, hardened remote rsyslog server.

Rsyslog can enforce host authentication and data confidentiality and integrity through TLS, see my how-to page for the details.

Also, with all the log data in one place you're immediately ready to apply a log analysis package like Splunk or ArcSight.

So, for me, systemd journal plus Rsyslog makes sense.