Last updated on Monday 31st of October 2022 03:04:07 PM

©XSIBackup-DC Thread Powered Improvements

Increase of backup/replica speed by parallelizing tasks

©XSIBackup-DC and its small brother ©XSIBackup-Pro (no CBT) have been making use of threads since some time ago. Still the parts of the code that implemented threading weren't directly related to increase data throughput.

It is generally accepted that applying threads does not necessarily improve plain data throughput. That would only happen if the processing of data from part of the CPU would be slower than the disks' throughput, which is generally not the case. That has changed lately with the existence of extremely fast nVME/M.2 disks, still ©XSIBackup-DC copies data to NFS or over IP most of the times.

In any case ©XSIBackup-DC does a lot more than just copying data, it does checksum calculations for every chunk of data, it compresses the data and encrypts it via SSH to send it to the remote end among other minor tasks.

It is those other tasks that can be threaded to improve throughput of the data stream, which is specially important in the first full transfer of the data.


Threaded computing

How ©XSIBackup-DC improves with threading

As stated, ©XSIBackup-DC has to read the data from the source, it then has to calculate the checksum of the read data and eventually compress before writing it to the remote end. Once the data has been written to the remote end it reads a new chunk of data and so on until the end of the file is reached.

We have divided the logic of the algorithm into smaller pieces so that some tasks are performed in secondary threads while the main thread progresses on its own.

Data is read and hashed while a secondary thread is copying the previous chunk to the destination. The data is now compressed in the remote end when backing up over IP and then written to disk. In case of backups/replicas to local NFS storage the effect of the threading is way below what you can achieve when backing up over IP, this is mainly due to the load being split in two: the client reads and hashes while the server at the remote end compresses and writes the data, that is: we are using two CPUs at the same time to do the same work.

As a result of applying these techniques we have substantiated an increase of around 130% in throughput of backups made over IP. In the case of replicas the speed of previous versions, before the extended threading revision, was already close to the capacity of a regular 1gbps link, thus you will only experience the improvements when using 2.5 gbps or 10 gbps hardware.

In regards to local backups (backups made to NFS shares) and local replicas (made to any local datastore including VMFS ones) the improvement in throughput is not despicable at all 10-20%, still, due to the fact that we were already stemming from much poorer speeds in case of local copies the results are not as perceivable as the ones achieved over IP.

How to achieve the best results

As we never loose the chance to say, ©XSIBackup-DC is much faster when used over IP to a Linux (RHEL, CentOS, RockyLinux, Fedora) server. There is no reason why you can't backup over IP, even in a local environment.

Using a VM Linux server to store the deduplicated backups is a very convenient option as you can easily replicate that VM containing the backups to some other ©ESXi host.

We offer the chance to operate locally as it's an easy and rapid way to backup VMs from small to medium size. Still, as your VMs get bigger, there is an implicit limitation of NFS shares that doesn't have anything to do with ©XSIBackup. Every individual FS related system call (access, unlink, get the size, etc...) made from the client to the NFS share will necessarily and unavoidably be increased by the network latency between the ©ESXi host and the NFS server.

This is a limitation of the very NFS protocol and any other file sharing protocol, as they mimick a FS over a network, and is also one of the main reasons why ©XSIBackup is so much slower when used to an NFS share, apart from the other reasons already commented.

Future improvements

There is still room for some additional optimizations, like parallelizing data reads and hashing, which could result in an increase of throughput around 15%.

There is a special case when transferring data over IP links which have a high latency. We have covered this scenario in these two posts:

©VMWare ©ESXi SSH/SCP Throughput Limitations

©VMWare ©ESXi narrowband off-site backup over high latency link

Those scenarios see their throughput limited due to the constrained size of the SSH receiving buffer mainly. Although the TCP protocol already has built in mechanisms (window scaling) to prevent that limitation, in the end, cut down implementations of a full OS such as ©ESXi can still prevent window scaling from taking place.

In those cases the best solution is to parallelize the data transfer over multiple SSH tunnels to saturate the IP link. We will ponder the possibility to use this technique to improve high latency throughput in ©XSIBackup-DC 1.7. Many of you may wonder: why don't you just control the size of the SSH receiving buffer. The simple answer is that we can't control that programatically, at least not with a feasible level of certainty.

©VMWare ©ESXi has its own implementation of OpenSSH, every version has a slightly different one and, if you use some Linux backend as a server we have virtually no control on what is happening at the receiving buffer.

So, we are left with basically two options: implement our own secure protocol over TCP/IP or use multiple SSH tunnels to overcome the receiving buffer limitation at the time to work with huge files over high latency links.

Due to the acceptance of OpenSSH, which is justified apart from this drawback, and the complexity of programming an ad-hoc private/public key data transport mechanism, we are now more convinced that a multiple parallel OpenSSH streams solution would be more reasonable.