I'm thinking ahead off possible failures and what will happen to the backup process.
Say for whatever reason, the esxi host crashes, or the network between it and the backup destination is lost.
How does xsibackup handle such scenarios? Does it have some resume support? will it cleanup the failed backup so it can be run again?
And should I worry about garbage in the backup destination?
All routines have been thought to be as crash consistent as they can be. Per instance: blocks are copied over as a temp buffer and only once the transfer process is confirmed to have finished well, they are put in place with an atomic system call.
The new blocks are only added to the general .blocklog manifest once the full (1)backup process has finished. They are added to the file manifest (.map files) on a per file basis. Thus, if some backup is interrupted, the blocks will be copied again on any subsequent backup, as if they didn't exist, unless you run some --repair argument in the backend, which would account the blocks on the partial files to the .blocklog file and eventually save you some time.
All this operations follow rule of thumb logic, that is: if some block can't be completely transferred, it will not appear as a file in the data folder as part of the block structure. If it is not present in the data folder, it will not be considered by the --repair process.
You have the --check action to allow you to check the data block structure at different levels: file exists or file exists and its checksum matches the name of the block.
In regards to the --replica action. It writes data to the same file on the backend. In case anything would go south, you would get warnings and errors at different levels: 1/ the block transfer, 2/ the final checksum (number of bytes transferred).
By "reductio ad absurdum" logic: the possibility that a block is transferred wrongly while not getting an error on any of those two levels of check is extremely low, not to say despicable.
In case some replica is altered or damaged, you can use the --check action on it, which consists in checking every constituent block and rewriting the correct checksum to the corresponding .blocklog file. Thus next time a --replica is performed on the remote mirrored image the blocks which checksum doesn't match will be updated.
You could still suffer from silent corruption on the disk side. This is infrequent and tends to be much more frequent as read errors than writes. You should rely on your SMART utilities to be safe from that kind of problems.
Still there's room for more refinements and additional checks on the data and that's what we are planning on doing as our software evolves.