
Mirroring Data and Making Backups with rsync
Mirroring Data and Synchronizing
File Systems with rsync
The
rsync
command can easily and quickly meet several goals on Linux,
BSD, and other platforms.
You can use it locally, within one host,
but it's especially useful for mirroring data from one
host to another.
You can run it on either the source or destination host.
Backups are absolutely vital.
You can use rsync
to
make a complete mirror image of a file system
on another system.
You can recursively copy a directory structure
while preserving file permissions and modification and
access timestamps.
Symbolic links in terms of both absolute and relative
paths will be recreated.
(So would device-special files, named pipes, and sockets,
but those are much less likely to be an issue)
Web site maintenance can become much easier.
You can make changes and additions to a local copy,
and then use rsync
to
synchronize what's on the server with the local copy.
The rsync
command only transfers data as needed,
making it very efficient.
And, it transfers via SSH for security.
Advantages: rsync > ssh+tar
Mirroring Data with ssh
and tar
You can also use a combination of
ssh
and
tar
to make a mirror image of a directory tree,
either from local to remote or remote to local.
However, a great advantage of rsync
is that it only transfers what it needs to.
By default, rsync
makes only a quick check
based on size or modification time to decide what it
needs to transfer.
It deals with timezones as needed — my web
server runs on UTC and my home system and laptop on
local time, but this still works regardless of how
the systems are spread across the Internet.
The file system actually stores the time and offset
from UTC, use the stat
command to verify
that for yourself.
However, you can use the --checksum
option
to instead analyze content to decide what needs to be
transferred.
How to Use rsync
Below is an example of using rsync
.
I have just edited several files on my web site,
editing the files on my local machine.
Now I want to upload all the changes to my server.
If the changes include creating new directories,
files, or symbolic links, all of those will be created
on the remote system.
As for the options:
-a
Archive mode: Recursively follow the directory
tree structure preserving owners, groups,
permissions, modification timestamps,
symbolic links, and device-special files.
You have to do this as root
to create device nodes or change ownership.
Add the -U
or --atimes
option to also preserve access timestamps.
-v
Verbose mode, narrate the list of transferred files.
--progress
Show the progress for individual files.
This is very useful when large files are
involved, to assure you that it's working.
It also shows the network performance.
--exclude xyz
Exclude paths containing the string xyz
.
You can use this multiple times.
The following example will not copy
vi/vim
swap files named
*.swp
or directories named
old-css
or old-js
.
Here we go with the command, and the output.
All the additions and updates in the local
/var/www/html
directory tree will be copied
to the same location on the host
cromwell-intl.com
.
Thousands of files that don't need to be uploaded, won't be.
$ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html/ cromwell-intl.com:/var/www/html sending incremental file list sitemap.txt 63,071 100% 59.48MB/s 0:00:00 (xfr#1, ir-chk=1041/1055) open-source/ open-source/Index.html 30,250 100% 113.18kB/s 0:00:00 (xfr#2, ir-chk=1133/3867) open-source/rsync.html 6,623 100% 24.78kB/s 0:00:00 (xfr#3, ir-chk=1079/3867) open-source/tar-and-ssh.html 7,009 100% 26.23kB/s 0:00:00 (xfr#4, ir-chk=1067/3867) sent 455,142 bytes received 1,992 bytes 130,609.71 bytes/sec total size is 2,249,384,384 speedup is 4,920.62
To That Location? Or Into That Location?
In the above example, the destination already exists.
I must include the trailing slash on the source.
That is, use
/var/www/html/
instead of simply
/var/www/html
.
If I leave that off, then the destination gets new a subdirectory
/var/www/html/html
containing an extra copy of
everything else from one level up.
The trailing /
on the source means
"Copy the contents of this directory
and not the directory itself."
These two commands are equivalent:
$ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html/ cromwell-intl.com:/var/www/html $ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html cromwell-intl.com:/var/www
Connect via SSH?
Yes, you want to use SSH. And by default you will, on modern systems.
rsync
is quite old, dating from the era
of rsh
and rlogin
.
No one uses rsh
and rlogin
,
or at least they shouldn't!
Use ssh
instead!
Similarly, there has been an rsync
daemon.
Usually inetd
listened on TCP port 873, and
started rsyncd
as needed.
In the past, rsync
would connect to its own
service by default.
However, modern versions of rsync
use SSH by default.
I can't think of a good reason to use anything else.
Performance
Setting up Kodi/OSMC on a Raspberry PiLater, I wanted to make a backup copy of a large collection of media files. Photographs, video files, and DVD UDF image files.
When I play one of my DVDs, I first extract it into an image file so I can watch it with Kodi or OSMC. I very much prefer the interface, and generally find it easier and more convenient. Here's a segment of output when mirroring media files to another system.
[... lines omitted ...] iso images/ iso images/2001: A Space Odyssey.iso 6,923,206,656 100% 97.16MB/s 0:01:07 (xfr#178, to-chk=169/399) iso images/A Scanner Darkly.iso 6,870,347,776 100% 98.72MB/s 0:01:06 (xfr#179, to-chk=168/399) iso images/Abraham Lincoln Vampire Hunter.iso 6,748,639,232 100% 97.34MB/s 0:01:06 (xfr#180, to-chk=167/399) iso images/Ad Astra.iso 7,037,278,208 100% 94.20MB/s 0:01:11 (xfr#181, to-chk=166/399) iso images/American Hustle.iso 6,199,312,384 100% 94.76MB/s 0:01:02 (xfr#182, to-chk=165/399) iso images/Andromeda Strain.iso 4,519,952,384 100% 97.60MB/s 0:00:44 (xfr#183, to-chk=164/399) iso images/Apocalypse Now Redux.iso 8,319,780,864 100% 95.70MB/s 0:01:22 (xfr#184, to-chk=163/399) iso images/Arrival.iso 7,966,900,224 100% 94.97MB/s 0:01:20 (xfr#185, to-chk=162/399) iso images/Atomic Blonde.iso 6,010,949,632 100% 96.20MB/s 0:00:59 (xfr#186, to-chk=161/399) [... lines omitted ...]
It isn't clear whether "MB" is intended to report
traditional units of MB or 106 bytes,
or MiB, units of 220 bytes.
MB = 1,000,000 bytes while
MiB = 1,048,576 bytes.
Doing the math, it's in between the two.
However, it's significantly closer to units of 220,
which I use to calculate network speed.
Or maybe it's MB including the TCP, IP, and Ethernet headers?
I don't care, I'm getting roughly 80% of 1 Gbps.
The reported speed isn't useful for small files that transfer in a few seconds or less. These large files, however, provide a measure of performance.
This was on a gigabit switch. 94.2 to 98.7 MiB per second means 790 to 828 Mbps, or 79.0% to 82.8% of the full bandwidth. That's not bad, especially when you consider the protocol overhead and the fact that this is not using Ethernet jumbo frames. NetApp has a nice analysis.
The
top
command shows that while ssh
and
rsync
are keeping the CPU cores busy,
this transfer isn't limited by processing speed.
During this snapshot, ssh
was using 84.4%
of the available time of one CPU core,
rsync
was using 32.1% of another.
Results on the server, where it was the sshd
server instead of the ssh
client,
were very similar.
$ top
top - 11:07:23 up 1 day, 21:11, 6 users, load average: 2.22, 2.08, 2.03
Tasks: 285 total, 2 running, 283 sleeping, 0 stopped, 0 zombie
%Cpu(s): 17.8 us, 4.7 sy, 0.0 ni, 63.8 id, 2.2 wa, 0.0 hi, 11.5 si, 0.0 st
MiB Mem : 11892.1 total, 145.6 free, 1661.9 used, 10084.6 buff/cache
MiB Swap: 6144.0 total, 6029.2 free, 114.8 used. 9592.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
43374 cromwell 20 0 10280 7348 5464 S 84.4 0.1 62:03.08 ssh
43373 cromwell 20 0 73720 17600 2292 S 32.1 0.1 31:31.94 rsync
1465 cromwell 20 0 4068652 139752 76744 R 4.3 1.1 47:57.37 cinnamon
989 root 20 0 858440 62740 48688 S 2.6 0.5 35:13.12 Xorg
101 root 20 0 0 0 0 S 1.3 0.0 1:15.36 kswapd0
1648 cromwell 20 0 937136 72084 42544 S 1.3 0.6 19:02.48 audacio+
45387 cromwell 20 0 4724388 186388 117928 S 1.3 1.5 0:36.92 chrome
[... lines omitted ...]
Digging Deeper
The
iostat
command measures storage I/O in transfers per second
(tps
); kilobytes read, written, and
discarded per second; and the total read and written
during that measurement cycle.
Here I'm running it to output a report every 10 seconds,
indefinitely until interrupted with Control-C.
The same as with
vmstat
,
you should ignore the first block of output.
It is intended to be cumulative since boot time,
but that isn't usually of interest.
Worse yet, at least for vmstat
it isn't at all accurate.
Here are the first two useful blocks of information
on the client, where the data is being read for transmission
to the other end.
As you can see, it's being read from
the /dev/sdb
device.
$ iostat 10
Linux 5.4.0-65-generic (clienthost) 02/21/21 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.78 0.00 0.92 0.58 0.00 96.72
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 0.68 11.85 69.94 0.00 1928364 11385973 0
sdb 22.60 3240.46 55.14 0.00 527539060 8976908 0
avg-cpu: %user %nice %system %iowait %steal %idle
19.00 0.00 15.53 2.84 0.00 62.63
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 3.00 20.00 14.00 0.00 200 140 0
sdb 580.20 98649.60 10.80 0.00 986496 108 0
avg-cpu: %user %nice %system %iowait %steal %idle
19.09 0.00 16.91 2.39 0.00 61.61
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.90 0.00 15.60 0.00 0 156 0
sdb 602.70 102860.80 0.00 0.00 1028608 0 0
^C
Linux interface names
The
ethstat
command may be of interest.
Again, ignore the first block of output.
We get Mbps and packets per second in and out.
The interface enp4s0
is moving the heavy traffic.
$ ethstat
total: 0.48 Mb/s In 85.81 Mb/s Out - 631.3 p/s In 7312.4 p/s Out
eno1: 0.01 Mb/s In 0.00 Mb/s Out - 1.1 p/s In 0.7 p/s Out
enp4s0: 0.46 Mb/s In 85.81 Mb/s Out - 630.2 p/s In 7311.7 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
total: 4.68 Mb/s In 879.72 Mb/s Out - 6193.4 p/s In 74653.5 p/s Out
eno1: 0.13 Mb/s In 0.00 Mb/s Out - 11.5 p/s In 7.1 p/s Out
enp4s0: 4.55 Mb/s In 879.71 Mb/s Out - 6181.9 p/s In 74646.4 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
total: 4.69 Mb/s In 891.44 Mb/s Out - 6199.3 p/s In 75361.6 p/s Out
eno1: 0.13 Mb/s In 0.00 Mb/s Out - 11.1 p/s In 7.4 p/s Out
enp4s0: 4.56 Mb/s In 891.44 Mb/s Out - 6188.2 p/s In 75354.2 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
^C
If you frequently do large transfers, you might look into further performance measurements and possible tuning.
If You're Synchronizing a GoDaddy Web Site
The good news is that GoDaddy doesn't bill for bandwidth.
The bad news is that GoDaddy SSH connectivity is unreliable.
I still have
toilet-guru.com
hosted there, but I wouldn't do that again.
Three clients also have sites there.
So, I have to deal with GoDaddy flakiness.
Here's the trick:
Add a --timeout=10
option.
My sychronize-websites
script contains
the following:
Linux-based hosting at GoDaddy puts your web root
in your ~/public_web
directory:
[...lines deleted...] echo toilet-guru.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-toilet-guru/ toiletguru@toilet-guru.com:public_web echo '-------------------------------------------------------------------------' echo client1.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client1/ clientuser1@client1.com:public_web echo '-------------------------------------------------------------------------' echo client2.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client2/ clientuser2@client2.com:public_web echo '-------------------------------------------------------------------------' echo client3.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client3/ clientuser3@client3.com:public_web echo '-------------------------------------------------------------------------' echo cromwell-intl.com rsync -av --progress --exclude '*.swp' \ /var/www/html/ cromwell-intl.com:/var/www/html [...lines deleted...]
I frequently see output like this, where one or more GoDaddy servers timed out:
$ synchronize-websites toilet-guru.com sending incremental file list sent 77,483 bytes received 136 bytes 51,746.00 bytes/sec total size is 303,892,020 speedup is 3,915.18 ------------------------------------------------------------------------- client1.com sending incremental file list rsync error: timeout in data send/receive (code 30) at io.c(137) [receiver=3.0.9] rsync: connection unexpectedly closed (86 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.3] ------------------------------------------------------------------------- client2.com sending incremental file list sent 46,495 bytes received 466 bytes 18,784.40 bytes/sec total size is 159,567,423 speedup is 3,397.87 ------------------------------------------------------------------------- client3.com sending incremental file list sent 43,049 bytes received 496 bytes 29,030.00 bytes/sec total size is 912,999,178 speedup is 20,966.80 ------------------------------------------------------------------------- cromwell-intl.com sending incremental file list open-source/ open-source/rsync.html 18,859 100% 17.32MB/s 0:00:00 (xfr#1, ir-chk=1079/3867) sent 463,173 bytes received 1,167 bytes 132,668.57 bytes/sec total size is 2,249,420,775 speedup is 4,844.34
Or, instead, this:
[...lines deleted...] client1.com kex_exchange_identification: Connection closed by remote host rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3] [...lines deleted...]
Give it a moment, then try it again. Be careful! Rapidly repeated failures get your client IP address blacklisted, and it takes at least an hour on the phone to get the problem escalated to higher-level support staff who can see and fix these things. Even when you start by explaining exactly what's going on.