We recently got an email from our IT department that our workstation OSes will be getting upgraded from Ubuntu 18.04 MATE to Ubuntu 20.04 GNOME. As much as I love MATE and how lightweight it is (LinuxScoop makes wonderful OS overview videos), I also like the “visuals” of GNOME. My personal laptop already runs Ubuntu 20.04 GNOME, so I am excited to have it on my lab workstation as well.
However, this OS upgrade also means that we have to backup our workstations since the drives will be wiped. Our research group has a generous storage space allocation on Compute Canada’s Cedar, so storage is not a big issue. The problem is: Cedar’s long-term storage space is a “tape-based backup system”, so there is a strict limit on the number of files we can store there. Therefore, the best strategy is to create tar archives of our data and store those on Cedar.
Unfortunately, both my drives: /local-scratch (1 TB SSD) and /local-scratch2 (2 TB HDD) are over 80% full, so I do not have the space to first create a local tar.gz archive and then transfer it to Cedar.
Here are four different commands I used to move my data efficiently over SSH from my workstation to Cedar. They are for different scenarios and vary in sophistication from “quick and dirty” to “parallelized and automated”.
I. Basic streaming over SSH
The most important point to remember is to stop thinking of backup as two steps (i.e., Archive –> Transfer), and instead think of streams. By pipe-ing the output of the tar command directly into the ssh command, we can avoid creating an interm local tar.gz file on disk.
# Backup workstation data and SCP to Cedar without storing the (intermediate) archive on local disk.
$ tar cvf - /local-scratch2 | ssh cedar "cat > /scratch/location/local_scratch_2.tar"
II. Parallelizing the compression
The standard tar.gz uses gzip for compression, which is a single-threaded process. Most of modern workstations have multiple cores, so we can parallelize the compression process to speed up the backup process. This means that the backup speed will only be limited by the network bandwidth, not the CPU.
To parallelize the compression, we use pigz to allow utilization of multiple cores. Using it with tar is very straightforward:
# Backup workstation data and store the archive on local disk, but use more than 1 CPU process
$ tar -vc --use-compress-program="pigz -p 8" -f <ARCHIVENAME.tar.gz> <FOLDER>
This uses 8 CPU cores to compress the archive. But notice that this stores the archive on the local machine.
III. Combining the two: Parallel compression and streaming
Combining the two methods above, we can parallelize the compression and streaming process to speed up the backup process, thus maximizing CPU and network utilization, without creating an intermediate local tar.gz file.
# Backup workstation data and SCP to Cedar without storing the (intermediate) archive on local disk and use more than 1 CPU process.
tar --use-compress-program="pigz -p 8" -c <FOLDER> | ssh username@cedar.computecanada.ca "cat > /scratch/location/folder.tar.gz"
IV. Automating the process
A very realistic scenario is when we want to backup a huge directory, but creating one massive (say, 2TB) archive is risky. Therefore, it may be safer to archive each subdirectory individually and transfer them over SSH. This script below loops over each subdirectory and creates a separate archive for each one.
#!/bin/bash
for file in *; do
echo "Compressing: $file"
tar --use-compress-program="pigz -p 8" -c $file | ssh username@cedar.computecanada.ca "cat > /home/username/scratch/location/$file.tar.gz"
done
You can save this script as, say, backup_directory.sh, and place it in the directory you want to backup.
Important: Remember to set up passwordless SSH keys first, otherwise you will keep getting prompted for the password for each folder. I find DigitalOcean’s guide on this to be quite comprehensive.
Update: Can we go faster than pigz?
I recently found out about zstd, which apparently yields smaller file sizes while also being faster than pigz [source1, source2]. I should try it some time.