Speeding up data transfer with tar and SSH
Moving data to a new computer
Do this withrsync
instead!
I recently had to replace my laptop, and that meant copying about 100 GB from the old system to the new one.
You can recursively copy a directory structure while
preserving file permissions and modification and
access timestamps with the
scp
command.
Let's say you're on the new system,
and you are copying it from the old one,
with the boring but obvious hostname old
.
Your home directory is /home/yourname
on both,
and your login on the remote machine is username
.
% cd /home % scp -pr username@old:/home/yourname .
If your login is the same on both systems, you could simplify this:
% cd /home % scp -pr old:/home/yourname .
Running this command as an ordinary user, all the files
will be owned by you regardless of their ownership
on the remote machine.
If you want to preserve all the original ownerships,
be root
on the destination machine.
There is, however, a serious problem with copying
the directory structure with scp
:
Symbolic links are not copied as symbolic links,
you instead get the things they point to.
Nor would you get device-special files, sockets, or named
pipes, but that's far less like to be an issue.
Don't worry, there's a fix.
Solving the problem with ssh and tar
The trick is to use the
tar
program to create a complete archive on the remote machine,
and stream the data across the network to a tar
command on the local machine where it is extracted into
place with all the metadata and file types intact.
Here is what you commonly see recommended. Again, we are doing this on the destination machine:
% ssh username@old 'tar czf - /home/username' | tar xvzf - -C /home/username
As for those parameters:
czf
on the remote (creating) host means "Create (c) an archive,
compressing (z) it, and put it in the file (f) whose name
is to follow."
But the name is simply "-
", meaning standard output.
The rest of the parameters specify the data that is to be
archived, your remote home directory in this example.
After the pipe, the tar
on the local machine is told
"Extract (x) the archive, verbosely (v) so you see the names
listed as the files are extracted, uncompressing the data
stream (z), and the archive is in the file (f) to
be named as the following parameter."
Again the name is simply "-
", in this case
meaning standard input.
The final -C means to change the extraction location
as specified.
This command pipeline will accomplish exactly what we want, but we can both simplify it and improve its performance.
First, tar
writes to standard output by default.
However, on some systems tar
will default to
reading the first tape device /dev/rst0
and so we still need to tell the destination tar
what to read.
Second, the ssh
authentication puts us in that
account's home directory.
So, we can simplify the command pipeline:
% ssh username@old 'tar cz .' | tar xvzf - -C /home/username
Speeding up the transfer
I started the transfer running, and saw that it was
going to take a long time.
I ran
top
on both sides, saw that
gzip
was taking about 60% of the CPU time,
with ssh
(client) or sshd
(server)
taking about 30%.
These days our bulky data is already compressed.
JPEG, PDF, MPEG, and the *.pptx
and *.docx
files from recent versions of Microsoft Office are already
compressed.
Leave the compression off at both ends and the transfer will go faster with typical user data:
% ssh username@old 'tar c .' | tar xvf - -C /home/username
I was seeing 80% of the CPU going to ssh/sshd
and 15% to tar
.
Now, to speed things up even more, you could try asking
ssh
to skip the encryption and decryption:
% ssh -c none username@old 'tar c .' | tar xvf - -C /home/username
However, for reasons I hope are obvious, modern OpenSSH based SSH servers will not support requests for the null cipher. Patches are available for non-encrypting SSH within isolated enclaves, see here and here for details.