UNIX / Linux keyboard.

Speeding up data transfer with tar and SSH

Moving data to a new computer

I recently had to replace my laptop, and that meant copying about 100 GB from the old system to the new one.

You can recursively copy a directory structure while preserving file permissions and modification and access timestamps with the scp command. Let's say you're on the new system, and you are copying it from the old one, with the boring but obvious hostname old. Your home directory is /home/yourname on both, and your login on the remote machine is username.

% cd /home
% scp -pr username@old:/home/yourname . 

If your login is the same on both systems, you could simplify this:

% cd /home
% scp -pr old:/home/yourname . 

Running this command as an ordinary user, all the files will be owned by you regardless of their ownership on the remote machine. If you want to preserve all the original ownerships, be root on the destination machine.

There is, however, a serious problem with copying the directory structure with scp: Symbolic links are not copied as symbolic links, you instead get the things they point to. Nor would you get device-special files, sockets, or named pipes, but that's far less like to be an issue. Don't worry, there's a fix.

Solving the problem with ssh and tar

The trick is to use the tar program to create a complete archive on the remote machine, and stream the data across the network to a tar command on the local machine where it is extracted into place with all the metadata and file types intact.

If OpenBSD is involved, use the GNU gtar command instead of tar.

Here is what you commonly see recommended. Again, we are doing this on the destination machine:

% ssh username@old 'tar czf - /home/username' | tar xvzf - -C /home/username 

As for those parameters: czf on the remote (creating) host means "Create (c) an archive, compressing (z) it, and put it in the file (f) whose name is to follow." But the name is simply "-", meaning standard output. The rest of the parameters specify the data that is to be archived, your remote home directory in this example.

After the pipe, the tar on the local machine is told "Extract (x) the archive, verbosely (v) so you see the names listed as the files are extracted, uncompressing the data stream (z), and the archive is in the file (f) to be named as the following parameter." Again the name is simply "-", in this case meaning standard input. The final -C means to change the extraction location as specified.

This command pipeline will accomplish exactly what we want, but we can both simplify it and improve its performance.

First, tar writes to standard output by default. However, on some systems tar will default to reading the first tape device /dev/rst0 and so we still need to tell the destination tar what to read.

Second, the ssh authentication puts us in that account's home directory. So, we can simplify the command pipeline:

% ssh username@old 'tar cz .' | tar xvzf - -C /home/username 

Speeding up the transfer

I started the transfer running, and saw that it was going to take a long time. I ran top on both sides, saw that gzip was taking about 60% of the CPU time, with ssh (client) or sshd (server) taking about 30%.

These days our bulky data is already compressed. JPEG, PDF, MPEG, and the *.pptx and *.docx files from recent versions of Microsoft Office are already compressed.

Leave the compression off at both ends and the transfer will go faster with typical user data:

% ssh username@old 'tar c .' | tar xvf - -C /home/username 

I was seeing 80% of the CPU going to ssh/sshd and 15% to tar.

Now, to speed things up even more, you could try asking ssh to skip the encryption and decryption:

% ssh -c none username@old 'tar c .' | tar xvf - -C /home/username 

However, for reasons I hope are obvious, modern OpenSSH based SSH servers will not support requests for the null cipher. Patches are available for non-encrypting SSH within isolated enclaves, see here and here for details.