UNIX / Linux keyboard.

Mirroring Data and Making Backups with rsync

Mirroring Data and Synchronizing File Systems with rsync

The rsync command can easily and quickly meet several goals on Linux, BSD, and other platforms. You can use it locally, within one host, but it's especially useful for mirroring data from one host to another. You can run it on either the source or destination host.

Backups are absolutely vital. You can use rsync to make a complete mirror image of a file system on another system.

You can recursively copy a directory structure while preserving file permissions and modification and access timestamps. Symbolic links in terms of both absolute and relative paths will be recreated. (So would device-special files, named pipes, and sockets, but those are much less likely to be an issue)

Web site maintenance can become much easier. You can make changes and additions to a local copy, and then use rsync to synchronize what's on the server with the local copy.

The rsync command only transfers data as needed, making it very efficient. And, it transfers via SSH for security.

Advantages: rsync > ssh+tar

Mirroring Data with ssh and tar

You can also use a combination of ssh and tar to make a mirror image of a directory tree, either from local to remote or remote to local. However, a great advantage of rsync is that it only transfers what it needs to.

By default, rsync makes only a quick check based on size or modification time to decide what it needs to transfer. It deals with timezones as needed — my web server runs on UTC and my home system and laptop on local time, but this still works regardless of how the systems are spread across the Internet. The file system actually stores the time and offset from UTC, use the stat command to verify that for yourself. However, you can use the --checksum option to instead analyze content to decide what needs to be transferred.

How to Use rsync

Below is an example of using rsync. I have just edited several files on my web site, editing the files on my local machine. Now I want to upload all the changes to my server. If the changes include creating new directories, files, or symbolic links, all of those will be created on the remote system. As for the options:

-a
Archive mode: Recursively follow the directory tree structure preserving owners, groups, permissions, modification timestamps, symbolic links, and device-special files. You have to do this as root to create device nodes or change ownership. Add the -U or --atimes option to also preserve access timestamps.

-v
Verbose mode, narrate the list of transferred files.

--progress
Show the progress for individual files. This is very useful when large files are involved, to assure you that it's working. It also shows the network performance.

--exclude xyz
Exclude paths containing the string xyz. You can use this multiple times. The following example will not copy vi/vim swap files named *.swp or directories named old-css or old-js.

Here we go with the command, and the output. All the additions and updates in the local /var/www/html directory tree will be copied to the same location on the host cromwell-intl.com. Thousands of files that don't need to be uploaded, won't be.

$ rsync -av --progress	\
	--exclude '*.swp'	\
	--exclude '/old-css/'	\
	--exclude '/old-js/'	\
	/var/www/html/ cromwell-intl.com:/var/www/html
sending incremental file list
sitemap.txt
         63,071 100%   59.48MB/s    0:00:00 (xfr#1, ir-chk=1041/1055)
open-source/
open-source/Index.html
         30,250 100%  113.18kB/s    0:00:00 (xfr#2, ir-chk=1133/3867)
open-source/rsync.html
          6,623 100%   24.78kB/s    0:00:00 (xfr#3, ir-chk=1079/3867)
open-source/tar-and-ssh.html
          7,009 100%   26.23kB/s    0:00:00 (xfr#4, ir-chk=1067/3867)

sent 455,142 bytes  received 1,992 bytes  130,609.71 bytes/sec
total size is 2,249,384,384  speedup is 4,920.62

To That Location? Or Into That Location?

In the above example, the destination already exists. I must include the trailing slash on the source. That is, use /var/www/html/ instead of simply /var/www/html.

If I leave that off, then the destination gets new a subdirectory /var/www/html/html containing an extra copy of everything else from one level up.

The trailing / on the source means "Copy the contents of this directory and not the directory itself."

These two commands are equivalent:

$ rsync -av --progress	\
	--exclude '*.swp'	\
	--exclude '/old-css/'	\
	--exclude '/old-js/'	\
	/var/www/html/ cromwell-intl.com:/var/www/html

$ rsync -av --progress	\
	--exclude '*.swp'	\
	--exclude '/old-css/'	\
	--exclude '/old-js/'	\
	/var/www/html cromwell-intl.com:/var/www

Connect via SSH?

Yes, you want to use SSH. And by default you will, on modern systems.

rsync is quite old, dating from the era of rsh and rlogin. No one uses rsh and rlogin, or at least they shouldn't! Use ssh instead!

How to Set Up and Use SSH

Similarly, there has been an rsync daemon. Usually inetd listened on TCP port 873, and started rsyncd as needed. In the past, rsync would connect to its own service by default. However, modern versions of rsync use SSH by default. I can't think of a good reason to use anything else.

Performance

Setting up Kodi/OSMC on a Raspberry Pi

Later, I wanted to make a backup copy of a large collection of media files. Photographs, video files, and DVD UDF image files.

When I play one of my DVDs, I first extract it into an image file so I can watch it with Kodi or OSMC. I very much prefer the interface, and generally find it easier and more convenient. Here's a segment of output when mirroring media files to another system.

[... lines omitted ...]
iso images/
iso images/2001: A Space Odyssey.iso
  6,923,206,656 100%   97.16MB/s    0:01:07 (xfr#178, to-chk=169/399)
iso images/A Scanner Darkly.iso
  6,870,347,776 100%   98.72MB/s    0:01:06 (xfr#179, to-chk=168/399)
iso images/Abraham Lincoln Vampire Hunter.iso
  6,748,639,232 100%   97.34MB/s    0:01:06 (xfr#180, to-chk=167/399)
iso images/Ad Astra.iso
  7,037,278,208 100%   94.20MB/s    0:01:11 (xfr#181, to-chk=166/399)
iso images/American Hustle.iso
  6,199,312,384 100%   94.76MB/s    0:01:02 (xfr#182, to-chk=165/399)
iso images/Andromeda Strain.iso
  4,519,952,384 100%   97.60MB/s    0:00:44 (xfr#183, to-chk=164/399)
iso images/Apocalypse Now Redux.iso
  8,319,780,864 100%   95.70MB/s    0:01:22 (xfr#184, to-chk=163/399)
iso images/Arrival.iso
  7,966,900,224 100%   94.97MB/s    0:01:20 (xfr#185, to-chk=162/399)
iso images/Atomic Blonde.iso
  6,010,949,632 100%   96.20MB/s    0:00:59 (xfr#186, to-chk=161/399)
[... lines omitted ...]

It isn't clear whether "MB" is intended to report traditional units of MB or 106 bytes, or MiB, units of 220 bytes. MB = 1,000,000 bytes while MiB = 1,048,576 bytes. Doing the math, it's in between the two. However, it's significantly closer to units of 220, which I use to calculate network speed.

Or maybe it's MB including the TCP, IP, and Ethernet headers? I don't care, I'm getting roughly 80% of 1 Gbps.

The reported speed isn't useful for small files that transfer in a few seconds or less. These large files, however, provide a measure of performance.

This was on a gigabit switch. 94.2 to 98.7 MiB per second means 790 to 828 Mbps, or 79.0% to 82.8% of the full bandwidth. That's not bad, especially when you consider the protocol overhead and the fact that this is not using Ethernet jumbo frames. NetApp has a nice analysis.

The top command shows that while ssh and rsync are keeping the CPU cores busy, this transfer isn't limited by processing speed. During this snapshot, ssh was using 84.4% of the available time of one CPU core, rsync was using 32.1% of another. Results on the server, where it was the sshd server instead of the ssh client, were very similar.

$ top
top - 11:07:23 up 1 day, 21:11,  6 users,  load average: 2.22, 2.08, 2.03
Tasks: 285 total,   2 running, 283 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.8 us,  4.7 sy,  0.0 ni, 63.8 id,  2.2 wa,  0.0 hi, 11.5 si,  0.0 st
MiB Mem :  11892.1 total,    145.6 free,   1661.9 used,  10084.6 buff/cache
MiB Swap:   6144.0 total,   6029.2 free,    114.8 used.   9592.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  43374 cromwell  20   0   10280   7348   5464 S  84.4   0.1  62:03.08 ssh
  43373 cromwell  20   0   73720  17600   2292 S  32.1   0.1  31:31.94 rsync
   1465 cromwell  20   0 4068652 139752  76744 R   4.3   1.1  47:57.37 cinnamon
    989 root      20   0  858440  62740  48688 S   2.6   0.5  35:13.12 Xorg
    101 root      20   0       0      0      0 S   1.3   0.0   1:15.36 kswapd0
   1648 cromwell  20   0  937136  72084  42544 S   1.3   0.6  19:02.48 audacio+
  45387 cromwell  20   0 4724388 186388 117928 S   1.3   1.5   0:36.92 chrome
[... lines omitted ...]

Digging Deeper

The iostat command measures storage I/O in transfers per second (tps); kilobytes read, written, and discarded per second; and the total read and written during that measurement cycle.

Here I'm running it to output a report every 10 seconds, indefinitely until interrupted with Control-C. The same as with vmstat, you should ignore the first block of output. It is intended to be cumulative since boot time, but that isn't usually of interest. Worse yet, at least for vmstat it isn't at all accurate.

Here are the first two useful blocks of information on the client, where the data is being read for transmission to the other end. As you can see, it's being read from the /dev/sdb device.

$ iostat 10
Linux 5.4.0-65-generic (clienthost)   02/21/21        _x86_64_        (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.78    0.00    0.92    0.58    0.00   96.72

Device     tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda       0.68        11.85        69.94         0.00    1928364   11385973          0
sdb      22.60      3240.46        55.14         0.00  527539060    8976908          0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.00    0.00   15.53    2.84    0.00   62.63

Device     tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda       3.00        20.00        14.00         0.00        200        140          0
sdb     580.20     98649.60        10.80         0.00     986496        108          0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.09    0.00   16.91    2.39    0.00   61.61

Device     tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda       2.90         0.00        15.60         0.00          0        156          0
sdb     602.70    102860.80         0.00         0.00    1028608          0          0


^C 
Linux interface names

The ethstat command may be of interest. Again, ignore the first block of output. We get Mbps and packets per second in and out. The interface enp4s0 is moving the heavy traffic.

$ ethstat
total:        0.48 Mb/s In    85.81 Mb/s Out -    631.3 p/s In    7312.4 p/s Out
    eno1:     0.01 Mb/s In     0.00 Mb/s Out -      1.1 p/s In       0.7 p/s Out
  enp4s0:     0.46 Mb/s In    85.81 Mb/s Out -    630.2 p/s In    7311.7 p/s Out
  enp5s0:     0.00 Mb/s In     0.00 Mb/s Out -      0.0 p/s In       0.0 p/s Out
total:        4.68 Mb/s In   879.72 Mb/s Out -   6193.4 p/s In   74653.5 p/s Out
    eno1:     0.13 Mb/s In     0.00 Mb/s Out -     11.5 p/s In       7.1 p/s Out
  enp4s0:     4.55 Mb/s In   879.71 Mb/s Out -   6181.9 p/s In   74646.4 p/s Out
  enp5s0:     0.00 Mb/s In     0.00 Mb/s Out -      0.0 p/s In       0.0 p/s Out
total:        4.69 Mb/s In   891.44 Mb/s Out -   6199.3 p/s In   75361.6 p/s Out
    eno1:     0.13 Mb/s In     0.00 Mb/s Out -     11.1 p/s In       7.4 p/s Out
  enp4s0:     4.56 Mb/s In   891.44 Mb/s Out -   6188.2 p/s In   75354.2 p/s Out
  enp5s0:     0.00 Mb/s In     0.00 Mb/s Out -      0.0 p/s In       0.0 p/s Out
^C 

If you frequently do large transfers, you might look into further performance measurements and possible tuning.

If You're Synchronizing a GoDaddy Web Site

The good news is that GoDaddy doesn't bill for bandwidth.

The bad news is that GoDaddy SSH connectivity is unreliable.

I still have toilet-guru.com hosted there, but I wouldn't do that again. Three clients also have sites there. So, I have to deal with GoDaddy flakiness. Here's the trick:

Add a --timeout=10 option. My sychronize-websites script contains the following: Linux-based hosting at GoDaddy puts your web root in your ~/public_web directory:

[...lines deleted...]
echo toilet-guru.com
rsync -av --progress --exclude '*.swp' --timeout=10 \
	/var/www/html-toilet-guru/ toiletguru@toilet-guru.com:public_web
echo '-------------------------------------------------------------------------'
echo client1.com
rsync -av --progress --exclude '*.swp' --timeout=10 \
	/var/www/html-client1/ clientuser1@client1.com:public_web
echo '-------------------------------------------------------------------------'
echo client2.com
rsync -av --progress --exclude '*.swp' --timeout=10 \
	/var/www/html-client2/ clientuser2@client2.com:public_web
echo '-------------------------------------------------------------------------'
echo client3.com
rsync -av --progress --exclude '*.swp' --timeout=10 \
	/var/www/html-client3/ clientuser3@client3.com:public_web
echo '-------------------------------------------------------------------------'
echo cromwell-intl.com
rsync -av --progress --exclude '*.swp' \
	/var/www/html/ cromwell-intl.com:/var/www/html
[...lines deleted...] 

I frequently see output like this, where one or more GoDaddy servers timed out:

$ synchronize-websites
toilet-guru.com
sending incremental file list

sent 77,483 bytes  received 136 bytes  51,746.00 bytes/sec
total size is 303,892,020  speedup is 3,915.18
-------------------------------------------------------------------------
client1.com
sending incremental file list
rsync error: timeout in data send/receive (code 30) at io.c(137) [receiver=3.0.9]
rsync: connection unexpectedly closed (86 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.3]
-------------------------------------------------------------------------
client2.com
sending incremental file list

sent 46,495 bytes  received 466 bytes  18,784.40 bytes/sec
total size is 159,567,423  speedup is 3,397.87
-------------------------------------------------------------------------
client3.com
sending incremental file list

sent 43,049 bytes  received 496 bytes  29,030.00 bytes/sec
total size is 912,999,178  speedup is 20,966.80
-------------------------------------------------------------------------
cromwell-intl.com
sending incremental file list
open-source/
open-source/rsync.html
         18,859 100%   17.32MB/s    0:00:00 (xfr#1, ir-chk=1079/3867)

sent 463,173 bytes  received 1,167 bytes  132,668.57 bytes/sec
total size is 2,249,420,775  speedup is 4,844.34

Or, instead, this:

[...lines deleted...]
client1.com
kex_exchange_identification: Connection closed by remote host
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
[...lines deleted...] 

Give it a moment, then try it again. Be careful! Rapidly repeated failures get your client IP address blacklisted, and it takes at least an hour on the phone to get the problem escalated to higher-level support staff who can see and fix these things. Even when you start by explaining exactly what's going on.