Last updated on Tuesday 16th of August 2022 06:34:57 PM

©VMWare ©ESXi narrowband off-site backup over high latency link

Overcome poor effective bandwidth due to latency using parallel Rsync streams

We recently had to devise an off-site backup system for a client hosting 1.6 TB of data sitting in a Carebean island with an asymmetric FO connection that would yield barely 1.5 -> 2.0 MB/s upwards for an SSH stream of data.

The target system was a dedicated server from a well-known ISP in Central Europe with latencies ranging from 160 to 200 ms., which is not bad at all for the distance, but too much to achieve fast TCP transfer speeds using standard OpenSSH.

Narrowband VM backup
The method explained in this post has been applied to synchronizing ©XSIBackup repositories, nonetheless the concepts deployed in this method can be easily applied to any other similar situation in which any number of files must be syncronized over a relatively low bandwidth link with a high latency.

To make things even worse, it's not unfrequent that they suffer from blackouts or hurricane alerts, which results in the servers being switched off for some time to prevent physical damage.

We installed ©XSIBackup locally and deployed some ready to use replicas and a backup repository with 20 days backward restore points. So far so good, the local part of the job was easy to accomplish to Rocky Linux 8 VM hosted in a new 4TB local disk.

The thing started to become not so easy when we tried to upload the whole Rocky Linux VM to the off-site server at Europe. The relative high latency resulted in very poor effective upload speeds due to the nature of the TCP protocol, which requires acknowledgement from the remote end.

We recently wrote a post on this subject: ©VMWare ©ESXi throughput limitations.

- There exist ways to work that limitation inherent to the TCP protocol around, but in this case the tools to use are fixed and they use syncronous TCP and OpenSSH -

We first started to upload the full backup server Rocky Linux VM to the off-site server, but the speed was very low and the frequent network outages would make it unfeasible to use this method

Luckily ©XSIBackup repositories break the VMs into small chunks that are easy to transfer discretionally and that allows to perform an easy accounting on the number of them already transferred to the backend

So we started to upload around 1.1 million blocks of 680Kb size as an average to the target server using Rsync over SSH. The thing was slow, but after some days we had uploaded around 500K blocks and we felt it could be done. Nonetheless we needed to ensure that the amount of new blocks generated daily could be uploaded in time to the off-site backup server.

We then new we had to give the project an additional twist. We couldn't be conformist with the speed we were achieving, as any unexpected situation could end up in a clogged queue and an excruciating delay in the off-site backups.

So, as we knew that the low effective bandwidth was due to the syncronous nature of the TCP protocol plus a reduced OpenSSH inner buffer, we came to the conclusion that we needed to paralellize the transfers to improve the saturation of the TCP/IP stream and squeeze its full potential.

We have been using Rsync for years and we knew we could trust its rock solid stability for the task. We read some posts on other people setting the same kind of strategy and their comments were totally favourable, so we started with some preliminary tests. The results were even better than what we expected, so we refined the method.

You will need an additional component in your source server, typically the one where you have your primary local backup repository. This component is screen. It's a terminal session manager that allows your scripts to detach from the TTY so that you can run multiple jobs as different processes and monitor them as they run.

There are other methods to detach a terminal session from a TTY, but they are difficult to accomplish and reconnecting is not always possible. The screen binary makes it easy to run multiple child processes attached to a virtual TTY and reattach to them from the main TTY window as needed.

The aim of this post is not to train the reader in using screen, still we will provide some basic functionality insights.

- Prepending screen -dm to your Rsync commands will create a separate virtual TTY identified by a subprocess Id that you can use to reattach to the terminal session.

- Running screen -list will print a list of the running screen subprocesses along with their Ids.

- Running screen -r <screen Id> will reattach your current terminal to the output of that subprocess.

- [Ctrl+a] + [Ctrl+d] will exit the subprocess view without affecting it.

The Bash script we finally prepared to run multiple Rsync processes is the one below: Please note how we use the Rsync --size-only argument to tell Rsync to compare files by their size only.

Removing that flag would result in the Rsync processes calculating the full checksum value for every file, which would make extensive use of the local and remote CPU, you would risk to clog the backup servers and assuming that you have enough CPU cycles available, it would take much longer indeed, as you would be reading the full repository (1.6 TB in our case study) and calculating the hash checksum for every block.

By using the --size-only argument we are making some presuppositions: if the TCP protocol and the SSH tunnel do not return any errors and the name and the size of the local and remote files is the same, we can assume with some fair degree of certainty that the files are identical.

The certainty degree as expressed above is below a full checksum on each file, still, the possibility that a TCP checksum is passed plus the SSH integrity check is passed plus the resulting file is the same size and still the destination data is corrupt is something we can presume to be rather impossible.

Dissecting the script

There are two main loops in the script where the parallelization is happening:

1/ for f in $MAPF

Here we are copying the YYYYMMDDhhmmss timestamped folders one by one inside the loop. As the screen command triggers the subprocess and returns immediately, the loop finishes almost instantly. This loop will generate as many child processes as restore points are present in the repository but will copy just the .map files, we will also need the blocks to be able to rebuild the virtual disks or to access them.

2/ for i in {0..15}

This loop takes care to copy the data blocks, this is the bulk of the data and the child processes that will take the most time to complete. Differently as the .map files loop, this one will always produce 16 subprocesses, which correspond to the 16 hexadecimal digits of the first character in the first level of the block structure.

The expression hexval="$(printf '%x\n' $i)" converts each number in the sequence 0...15 to hexadecimal 0...f:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,  a,  b,  c,  d,  e,  f

Running the above script should return something like the output below, where you can see 10 screen subprocesses launched by screen to sync 10 restore points with .map files from July the 1st to July the 10th plus 16 additional screen child processes syncronizing each of the 16 main folders of the block structure.

If you now list the running screen subprocesses (screen -list), you will see something like this:

The first 10 processes correspond to the restore point sincronization, the YYYYMMDDhhmmss folders containing the .map files.

The last 16 entries correspond to each one of the 16 main folders in the hexadecimal structure of the blocks inside the data folder.

If you run screen -list again after some minutes you will see that there are less processes, they disappear from the list as each individual task is completed and each screen subprocess ends.

You can connect to any of those processes with screen -r <procId> to inspect its output to STDOUT, it will give you an idea of how long must the process run until it ends. Per instance, running screen -r 714071 would return something like the code below.

Once all child processes finish and the command screen -list returns no child processes the syncronization will have finished.

The first sync, which will copy hundreds of thousands or millions of blocks should be run manually and inspected frequently to make sure that the seeding process completes with no errors. You may need to do that over the weekend or the nights depending on your infrastructure load.

In our case study, which is real, we were able to multiply the saturation of the upload stream by more than 10.