Mirroring Data and Making Backups with rsync
Mirroring Data and Synchronizing
File Systems with rsync
The
rsync
command can easily and quickly meet several goals on Linux,
BSD, and other platforms.
You can use it locally, within one host,
but it's especially useful for mirroring data from one
host to another.
You can run it on either the source or destination host.
Backups are absolutely vital.
You can use rsync
to
make a complete mirror image of a file system
on another system.
You can recursively copy a directory structure
while preserving file permissions and modification and
access timestamps.
Symbolic links in terms of both absolute and relative
paths will be recreated.
(So would device-special files, named pipes, and sockets,
but those are much less likely to be an issue)
Web site maintenance can become much easier.
You can make changes and additions to a local copy,
and then use rsync
to
synchronize what's on the server with the local copy.
The rsync
command only transfers data as needed,
making it very efficient.
And, it transfers via SSH for security.
Advantages: rsync > ssh+tar
Mirroring Data with ssh
and tar
You can also use a combination of
ssh
and
tar
to make a mirror image of a directory tree,
either from local to remote or remote to local.
However, a great advantage of rsync
is that it only transfers what it needs to.
By default, rsync
makes only a quick check
based on size or modification time to decide what it
needs to transfer.
It deals with timezones as needed — my web
server runs on UTC and my home system and laptop on
local time, but this still works regardless of how
the systems are spread across the Internet.
The file system actually stores the time and offset
from UTC, use the stat
command to verify
that for yourself.
However, you can use the --checksum
option
to instead analyze content to decide what needs to be
transferred.
How to Use rsync
Below is an example of using rsync
.
I have just edited several files on my web site,
editing the files on my local machine.
Now I want to upload all the changes to my server.
If the changes include creating new directories,
files, or symbolic links, all of those will be created
on the remote system.
As for the options:
-a
Archive mode: Recursively follow the directory
tree structure preserving owners, groups,
permissions, modification timestamps,
symbolic links, and device-special files.
You have to do this as root
to create device nodes or change ownership.
Add the -U
or --atimes
option to also preserve access timestamps.
-v
Verbose mode, narrate the list of transferred files.
--progress
Show the progress for individual files.
This is very useful when large files are
involved, to assure you that it's working.
It also shows the network performance.
--exclude xyz
Exclude paths containing the string xyz
.
You can use this multiple times.
The following example will not copy
vi/vim
swap files named
*.swp
or directories named
old-css
or old-js
.
Here we go with the command, and the output.
All the additions and updates in the local
/var/www/html
directory tree will be copied
to the same location on the host
cromwell-intl.com
.
Thousands of files that don't need to be uploaded, won't be.
$ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html/ cromwell-intl.com:/var/www/html sending incremental file list sitemap.txt 63,071 100% 59.48MB/s 0:00:00 (xfr#1, ir-chk=1041/1055) open-source/ open-source/Index.html 30,250 100% 113.18kB/s 0:00:00 (xfr#2, ir-chk=1133/3867) open-source/rsync.html 6,623 100% 24.78kB/s 0:00:00 (xfr#3, ir-chk=1079/3867) open-source/tar-and-ssh.html 7,009 100% 26.23kB/s 0:00:00 (xfr#4, ir-chk=1067/3867) sent 455,142 bytes received 1,992 bytes 130,609.71 bytes/sec total size is 2,249,384,384 speedup is 4,920.62
To That Location? Or Into That Location?
In the above example, the destination already exists.
I must include the trailing slash on the source.
That is, use
/var/www/html/
instead of simply
/var/www/html
.
If I leave that off, then the destination gets new a subdirectory
/var/www/html/html
containing an extra copy of
everything else from one level up.
The trailing /
on the source means
"Copy the contents of this directory
and not the directory itself."
These two commands are equivalent:
$ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html/ cromwell-intl.com:/var/www/html $ rsync -av --progress \ --exclude '*.swp' \ --exclude '/old-css/' \ --exclude '/old-js/' \ /var/www/html cromwell-intl.com:/var/www
Connect via SSH?
Yes, you want to use SSH. And by default you will, on modern systems.
rsync
is quite old, dating from the era
of rsh
and rlogin
.
No one uses rsh
and rlogin
,
or at least they shouldn't!
Use ssh
instead!
Similarly, there has been an rsync
daemon.
Usually inetd
listened on TCP port 873, and
started rsyncd
as needed.
In the past, rsync
would connect to its own
service by default.
However, modern versions of rsync
use SSH by default.
I can't think of a good reason to use anything else.
Performance
Setting up Kodi/OSMC on a Raspberry PiLater, I wanted to make a backup copy of a large collection of media files. Photographs, video files, and DVD UDF image files.
When I play one of my DVDs, I first extract it into an image file so I can watch it with Kodi or OSMC. I very much prefer the interface, and generally find it easier and more convenient. Here's a segment of output when mirroring media files to another system.
[... lines omitted ...] iso images/ iso images/2001: A Space Odyssey.iso 6,923,206,656 100% 97.16MB/s 0:01:07 (xfr#178, to-chk=169/399) iso images/A Scanner Darkly.iso 6,870,347,776 100% 98.72MB/s 0:01:06 (xfr#179, to-chk=168/399) iso images/Abraham Lincoln Vampire Hunter.iso 6,748,639,232 100% 97.34MB/s 0:01:06 (xfr#180, to-chk=167/399) iso images/Ad Astra.iso 7,037,278,208 100% 94.20MB/s 0:01:11 (xfr#181, to-chk=166/399) iso images/American Hustle.iso 6,199,312,384 100% 94.76MB/s 0:01:02 (xfr#182, to-chk=165/399) iso images/Andromeda Strain.iso 4,519,952,384 100% 97.60MB/s 0:00:44 (xfr#183, to-chk=164/399) iso images/Apocalypse Now Redux.iso 8,319,780,864 100% 95.70MB/s 0:01:22 (xfr#184, to-chk=163/399) iso images/Arrival.iso 7,966,900,224 100% 94.97MB/s 0:01:20 (xfr#185, to-chk=162/399) iso images/Atomic Blonde.iso 6,010,949,632 100% 96.20MB/s 0:00:59 (xfr#186, to-chk=161/399) [... lines omitted ...]
The reported speed isn't useful for small files that transfer in a few seconds or less. These large files, however, provide a measure of performance.
This was on a gigabit switch. 94.2 to 98.7 MiB per second means 790 to 828 Mbps, or 79.0% to 82.8% of the full bandwidth. That's not bad, especially when you consider the protocol overhead and the fact that this is not using Ethernet jumbo frames. NetApp has a nice analysis.
The
top
command shows that while ssh
and
rsync
are keeping the CPU cores busy,
this transfer isn't limited by processing speed.
During this snapshot, ssh
was using 84.4%
of the available time of one CPU core,
rsync
was using 32.1% of another.
Results on the server, where it was the sshd
server instead of the ssh
client,
were very similar.
$ top
top - 11:07:23 up 1 day, 21:11, 6 users, load average: 2.22, 2.08, 2.03
Tasks: 285 total, 2 running, 283 sleeping, 0 stopped, 0 zombie
%Cpu(s): 17.8 us, 4.7 sy, 0.0 ni, 63.8 id, 2.2 wa, 0.0 hi, 11.5 si, 0.0 st
MiB Mem : 11892.1 total, 145.6 free, 1661.9 used, 10084.6 buff/cache
MiB Swap: 6144.0 total, 6029.2 free, 114.8 used. 9592.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
43374 cromwell 20 0 10280 7348 5464 S 84.4 0.1 62:03.08 ssh
43373 cromwell 20 0 73720 17600 2292 S 32.1 0.1 31:31.94 rsync
1465 cromwell 20 0 4068652 139752 76744 R 4.3 1.1 47:57.37 cinnamon
989 root 20 0 858440 62740 48688 S 2.6 0.5 35:13.12 Xorg
101 root 20 0 0 0 0 S 1.3 0.0 1:15.36 kswapd0
1648 cromwell 20 0 937136 72084 42544 S 1.3 0.6 19:02.48 audacio+
45387 cromwell 20 0 4724388 186388 117928 S 1.3 1.5 0:36.92 chrome
[... lines omitted ...]
Digging Deeper
The
iostat
command measures storage I/O in transfers per second
(tps
); kilobytes read, written, and
discarded per second; and the total read and written
during that measurement cycle.
Here I'm running it to output a report every 10 seconds,
indefinitely until interrupted with Control-C.
The same as with
vmstat
,
you should ignore the first block of output.
It is intended to be cumulative since boot time,
but that isn't usually of interest.
Worse yet, at least for vmstat
it isn't at all accurate.
Here are the first two useful blocks of information
on the client, where the data is being read for transmission
to the other end.
As you can see, it's being read from
the /dev/sdb
device.
$ iostat 10
Linux 5.4.0-65-generic (clienthost) 02/21/21 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.78 0.00 0.92 0.58 0.00 96.72
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 0.68 11.85 69.94 0.00 1928364 11385973 0
sdb 22.60 3240.46 55.14 0.00 527539060 8976908 0
avg-cpu: %user %nice %system %iowait %steal %idle
19.00 0.00 15.53 2.84 0.00 62.63
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 3.00 20.00 14.00 0.00 200 140 0
sdb 580.20 98649.60 10.80 0.00 986496 108 0
avg-cpu: %user %nice %system %iowait %steal %idle
19.09 0.00 16.91 2.39 0.00 61.61
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.90 0.00 15.60 0.00 0 156 0
sdb 602.70 102860.80 0.00 0.00 1028608 0 0
^C
Linux interface names
The
ethstat
command may be of interest.
Again, ignore the first block of output.
We get Mbps and packets per second in and out.
The interface enp4s0
is moving the heavy traffic.
$ ethstat
total: 0.48 Mb/s In 85.81 Mb/s Out - 631.3 p/s In 7312.4 p/s Out
eno1: 0.01 Mb/s In 0.00 Mb/s Out - 1.1 p/s In 0.7 p/s Out
enp4s0: 0.46 Mb/s In 85.81 Mb/s Out - 630.2 p/s In 7311.7 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
total: 4.68 Mb/s In 879.72 Mb/s Out - 6193.4 p/s In 74653.5 p/s Out
eno1: 0.13 Mb/s In 0.00 Mb/s Out - 11.5 p/s In 7.1 p/s Out
enp4s0: 4.55 Mb/s In 879.71 Mb/s Out - 6181.9 p/s In 74646.4 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
total: 4.69 Mb/s In 891.44 Mb/s Out - 6199.3 p/s In 75361.6 p/s Out
eno1: 0.13 Mb/s In 0.00 Mb/s Out - 11.1 p/s In 7.4 p/s Out
enp4s0: 4.56 Mb/s In 891.44 Mb/s Out - 6188.2 p/s In 75354.2 p/s Out
enp5s0: 0.00 Mb/s In 0.00 Mb/s Out - 0.0 p/s In 0.0 p/s Out
^C
If you frequently do large transfers, you might look into further performance measurements and possible tuning.
If You're Synchronizing a GoDaddy Web Site
The good news is that GoDaddy doesn't bill for bandwidth.
The bad news is that GoDaddy SSH connectivity is unreliable and very outdated.
I still had
toilet-guru.com
hosted there, but I wouldn't do that again.
Three consulting clients also have sites there.
So, for the time being I had to deal with GoDaddy
being outdated and unreliable.
I talked to their technical support on the line. They informed me that the Linux-based shared hosting service I was using could not be updated. If I wanted security software that wasn't 9.5 years old, I would have to move to their "dedicated hosting" service which was significantly more expensive.
Deploying FreeBSD on Google Cloud Platform Nginx, OpenSSL, and Quantum SafeOK, then I just need a short-term workaround until I can migrate a couple of sites to my Google Cloud Platform server running Nginx on FreeBSD.
Dealing with GoDaddy's Outdated SSH
I updated my laptop from Linux Mint 20.3 to Linux Mint 21 in August 2022. That meant updating from OpenSSH OpenSSH_8.2p1 to OpenSSH_8.9p1. GoDaddy was still using OpenSSH_5.3p1 from February 2013, 9.5 years in the past. Now connections from my updated laptop to a GoDaddy host failed. Asking for details showed this:
$ ssh -v myaccount@toilet-guru.com uptime [... much output ...] debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: diffie-hellman-group-exchange-sha256 debug1: kex: host key algorithm: (no match) Unable to negotiate with 23.229.238.72 port 22: no matching host key type found. Their offer: ssh-rsa,ssh-rsa-cert-v01@openssh.com,ssh-dssHow Elliptic-Curve Cryptography Works
The short answer is that GoDaddy's outdated OpenSSH doesn't support ephemeral keys with elliptic-curve cryptography and thus forward secrecy, while OpenSSH_8.9p1 no longer supports the less secure ssh-rsa or ssh-dss by default.
At least at the time of writing this and using OpenSSH_8.9p1,
you can enable ssh-rsa or ssh-dss explicitly and then
the connection will work.
Also include the -v
option to see the details:
$ ssh -oHostKeyAlgorithms=+ssh-rsa myaccount@toilet-guru.com ssh -V OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
And so, my workaround was to add host-specific exceptions
to my ~/.ssh/config
file.
I could have done this with a Host
stanza
for each GoDaddy host,
or with two Host
stanzas with two hosts each.
I listed all four in one, it's up to you.
$ cat ~/.ssh/config
Host *
IdentityFile ~/.ssh/id_ed25519
IdentityFile ~/.ssh/id_ecdsa
# These are needed because in August 2022 with Mint using OpenSSH_8.9p1,
# GoDaddy was still using OpenSSH_5.3p1 from Feb 2013.
Host toilet-guru.com client1.com client2.com client3.com
HostKeyAlgorithms +ssh-rsa
I also needed to edit ~/.ssh/known_hosts
and remove the lines for those outdated GoDaddy hosts.
And now, using that workaround to connect from my updated laptop to an outdated GoDaddy server:
$ ssh -V OpenSSH_8.9p1 Ubuntu-3, OpenSSL 3.0.2 15 Mar 2022 $ ssh -v myaccount@toilet-guru.com uptime [... much output ...] debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: diffie-hellman-group-exchange-sha256 debug1: kex: host key algorithm: ssh-rsa debug1: kex: server->client cipher: aes128-ctr MAC: umac-64@openssh.com compression: none debug1: kex: client->server cipher: aes128-ctr MAC: umac-64@openssh.com compression: none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(2048<3072<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: SSH2_MSG_KEX_DH_GEX_GROUP received debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug1: SSH2_MSG_KEX_DH_GEX_REPLY received debug1: Server host key: ssh-rsa SHA256:bS8oN8AH5+dgNQGuyB+EcpSQAVdJqP2tw+nrNuMJVrY [... more output ...]
Here's what we would hope to see, communicating with an up-to-date server:
$ ssh -v cromwell-intl.com ssh -V [... much output ...] debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ssh-ed25519 debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: SSH2_MSG_KEX_ECDH_REPLY received debug1: Server host key: ssh-ed25519 SHA256:BbuCDDVQOHhVK8WdVfZBKboAp29GEwmqfaRSTWIlRLw [... more output ...] OpenSSH_8.8p1, OpenSSL 1.1.1o-freebsd 3 May 2022 debug1: channel 0: free: client-session, nchannels 1 Transferred: sent 2944, received 3100 bytes, in 0.2 seconds Bytes per second: sent 12486.3, received 13147.9 debug1: Exit status 0
Further suggestions for dealing with legacy versions of OpenSSH are here
Dealing with GoDaddy's Unreliable SSH
The unreliability seems to be due to overly sensitive
load balancers through which you access their web servers
via SSH and thus scp
or rsync
.
It easily takes over an hour on the phone to reach someone
who has any idea what you're talking about.
You want to avoid that problem.
Add a --timeout=10
option.
My sychronize-websites
script contains
the following:
Linux-based hosting at GoDaddy puts your web root
in your ~/public_web
directory:
[...lines deleted...] echo toilet-guru.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-toilet-guru/ toiletguru@toilet-guru.com:public_web echo '-------------------------------------------------------------------------' echo client1.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client1/ clientuser1@client1.com:public_web echo '-------------------------------------------------------------------------' echo client2.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client2/ clientuser2@client2.com:public_web echo '-------------------------------------------------------------------------' echo client3.com rsync -av --progress --exclude '*.swp' --timeout=10 \ /var/www/html-client3/ clientuser3@client3.com:public_web echo '-------------------------------------------------------------------------' echo cromwell-intl.com rsync -av --progress --exclude '*.swp' \ /var/www/html/ cromwell-intl.com:/var/www/html [...lines deleted...]
I frequently see output like this, where one or more GoDaddy servers timed out:
$ synchronize-websites toilet-guru.com sending incremental file list sent 77,483 bytes received 136 bytes 51,746.00 bytes/sec total size is 303,892,020 speedup is 3,915.18 ------------------------------------------------------------------------- client1.com sending incremental file list rsync error: timeout in data send/receive (code 30) at io.c(137) [receiver=3.0.9] rsync: connection unexpectedly closed (86 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.3] ------------------------------------------------------------------------- client2.com sending incremental file list sent 46,495 bytes received 466 bytes 18,784.40 bytes/sec total size is 159,567,423 speedup is 3,397.87 ------------------------------------------------------------------------- client3.com sending incremental file list sent 43,049 bytes received 496 bytes 29,030.00 bytes/sec total size is 912,999,178 speedup is 20,966.80 ------------------------------------------------------------------------- cromwell-intl.com sending incremental file list open-source/ open-source/rsync.html 18,859 100% 17.32MB/s 0:00:00 (xfr#1, ir-chk=1079/3867) sent 463,173 bytes received 1,167 bytes 132,668.57 bytes/sec total size is 2,249,420,775 speedup is 4,844.34
Or, instead, this:
[...lines deleted...] client1.com kex_exchange_identification: Connection closed by remote host rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3] [...lines deleted...]
Give it a moment, then try it again. Be careful! Rapidly repeated failures get your client IP address blacklisted, and it takes at least an hour on the phone to get the problem escalated to higher-level support staff who can see and fix these things. Even when you start by explaining exactly what's going on.