Last updated on Saturday 5th of August 2023 03:15:31 PM

©XSIBackup backup prune action

NFS and iSCSI volumes: why latency matters

How latency affects pruning

If you are an experienced sysadmin you will probably smile to yourself as you read this. Nonetheless trying to get some more novice users to understand that a local disk is, in no way, similar to an NFS or iSCSI mount is one of the battles we've been fighting for a longer time.

The fact that an NFS or iSCSI mount appear as a path integrated in the local file system is one of the beauties of UNIX like operating systems, but also one of the main sources of issues, especially when dealing with newcomers.

Even experienced sysadmins sometime end up writing to a directory where an NFS share is no longer mounted, that's part of Murphy's laws, in the end you open your eyes and swear for 5 lost minutes.

The real problem comes when somebody assumes that an NFS mounted share will behave as a local disk just because it looks like a local path. That is not true and when you take decisions based in wrong assumptions, the thing will most of the times end up in learning the "hard way".

Any NFS, iSCSI, CIFS or whatever other protocol you are using is an over the network link to a volume that resides in a different computer. Everytime you perform some I/O operation: reading, writing, deleting or moving a file, that I/O system call has to travel across the network, be received by the network protocol server, be translated in a local system call and then the response be sent to the network protocol client.

The above exposed reality means that a delete system call, per instance will take the time it takes to perform the action in the server locally plus the time it takes to send the instruction from the client plus the time it takes the server to send the result back to it.

If you consider that a healthy network will yield some latency between 300 to 4000 microseconds, namely: from one third of a millisecond to 4 milliseconds, and that some SSD will offer latency figures in the order of nanoseconds, you will realize that any I/O operation will be excruciatingly slower when performed over a LAN than when run locally.

[root@XSIBackup-App .xsi]# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.410 ms
64 bytes from icmp_seq=2 ttl=64 time=0.361 ms
64 bytes from icmp_seq=3 ttl=64 time=0.383 ms
64 bytes from icmp_seq=4 ttl=64 time=0.353 ms
64 bytes from icmp_seq=5 ttl=64 time=0.397 ms
64 bytes from icmp_seq=6 ttl=64 time=0.372 ms
64 bytes from icmp_seq=7 ttl=64 time=0.329 ms
64 bytes from icmp_seq=8 ttl=64 time=0.340 ms
64 bytes from icmp_seq=9 ttl=64 time=0.359 ms
64 bytes from icmp_seq=10 ttl=64 time=0.321 ms
64 bytes from icmp_seq=11 ttl=64 time=0.361 ms

When you take this considerations to deciding how you must devise some operations when using ©XSIBackup it is somowhat clear that running prune from the ©ESXi host against an NFS share is not going to be a good idea.

The reason is that although the time it takes to grab all the disk meta-data to run the prune algorithm will be reasonably low for small to medium repositories, each I/O operation, namely: deleting the pruned blocks is going to take much longer.

Thus when you run the --prune action from an ©ESXi host against an NFS share that is hosting a big backup repository with, let's say, tenths of millions of blocks, you are going to need to be patient, as if will take up to hundreds of times more than if you run the very same prune command from within the backup host OS, or, if you run the --prune action over IP, which has been specially designed so that it is the server side that performs the action and sends some report lines back to the ©XSIBackup client.

Below you have the output of a prune command run over IP to the remote backup server to prune a VM named WXPMK in the repo01-1MB repository. As you can see the whole operation took 24 seconds.

Now let's run the equivalent command to the same repository and the same VM, but this time we'll do it through an NFS share that is connected to the same volume on the remote backup server. As you can see the very same command took 7 times to complete when compared to the over IP --prune operation.

Memory constraints when pruning over NFS/iSCSI

For if the above exposed facts were not enough to convince you to not prune over NFS there's still another limitation derived from how the tools you are using work.

©ESXi is a Hypervisor OS in which not all memory is available to the shell binaries. By default it assigns an 800MB pool that can be increased by using the --memory-size argument.

The --prune action requires 52 bytes of memory per block stored in the repo, as it has to compare all the blocks there with the blocks contained in the portion to prune. That can easily grow over the available memory in the ©ESXi host and produce a SEGFAULT. You could work that around by assigning more memory, but: whay would you want to run such memory intensive operations in the very virtualization hypervisor when you can do so at the backup server and do it 10 to 100 times faster?.

Thus, run your prune commands from the backup server easily by running a cron script like the one below. See how we use a simple find command with the -mtime argument which selects backup restore points newer than 40 days.


TOPRUNE="$( find /home/backup-vol1/xsi/repo01/ -maxdepth 1 -mindepth 1 -type d -mtime +40 | grep -E "[0-9]{14}" | sort )"

for p in $TOPRUNE
/usr/bin/XSI/XSIBackup-DC/bin2/xsibackup --prune ${p}
sleep 30

You should run the above script when the repo is iddle and not being accessed by some backup process that has the repo locked as in this case you would receive a lock error.