Re-run backup if copy cannot be successfully certified

herrep · 2019-07-30 22:05:36

Hi,

I run a backup job covering a list of virtual machines with option --backup-type=running. Furthermore, backup certification is active with option --certify-backup=yes. I use --backup-prog=onediff so that the resulting backup represents a full backup of each of the virtual machines.

In rare cases, the SHA1 certification fails for one of the virtual machines. In this case, I would like to re-run the backup exactly for that virtual machine where the certification fails. Unfortunately, SHA1 certification failure is not even considered as error, as the backup job added to --on-success is executed.

Is there any way to delete the backup of that virtual machine where the SHA1 certification failed and to re-run the backup for this virtual machine? If not, what is the recommended way to cope with such a situation?

Best regards,
Peter

admin · 2019-08-02 07:59:56

Most of the times checksum mistmatches will be due to read errors, they are very sticky though, so you might keep on getting the same read error on the same block a number of times. Generally this kind of repetitive checksum error on the same sector is safe to ignore. The chances that your backup is O.K. are high, you should nonetheless keep an eye on the backups, especially if your hardware is not new. Always use high quality cables, this will reduce errors.

Read this post for a more complete view on the matter:

[url=https://33hops.com/xsibackup-disk-checksum-verification-silent-corruption.html]Checksum verification what it is and what you should expect from it[/url]

herrep · 2019-08-02 08:13:13

Thanks for linking the article for further information which I had already consulted prior to writing the post. At present, I backup four virtual machines and the certification errors occured two times within ten days. So I run totally around 40 full backups with onediff, and 2 out of 40 full backups showed the SHA1 error. When I ran the same backup on the next day, everything was fine again.

I understand that the likelihood is rather low that indeed a write error occurred during the backup. On the other side, I rather think that it would be more safe to have an option to repeat at least the certification in case of errors, as I had no problems even 24 hours later. However, I am not sure how to realize this as I needed to detect the certification error on script level and invoke another certification.

Would it be worth to consider an option to handle certification errors and to decide as how to continue?

admin · 2019-08-02 08:50:21

Sorry we didn't state why we haven't created that option, it's mainly due to the sticky character of these situations which may enter a non stop loop in many cases, that makes the option not very practical.

There are some other considerations to take on account, as whether you are using async or sync NFS.

[url=https://web.mit.edu/rhel-doc/5/RHEL-5-manual/Deployment_Guide-en-US/ch-nfs.html]Network File System protocol description and versions[/url]

Try to use high quality cables and consider this checksum errors some "safe to ignore" mistmatches while you are confident about your disks. They will start to show much more often when they wear out.

You can just add a snapshot to your Onediff image and start your VM after one of these episodes. Then get rid of the snapshot and the Onediff cycle will not be interrupted. Some other manufacturers like Nakivo use some more naive verification systems, like a screenshot of the VM, most probably to avoid explaining users that certifying copies may not always show a perfect match cause disks are not perfect devices, so data storage can only be treated from an statistical point of view.

herrep · 2019-08-02 12:25:01

I used the VmWare Vsphere ESXi 6.7 WebGUI to create an NFS 4.1 connection to my Synology NAS. At the Synology NAS, I activated "async". I assume that the "async" setting on the NFS server side automatically invokes "async" NFS connections to the clients.

Question 1: Would you rather suggest to use "sync"?

Unfortunately, I did not understand at all the following two sentences:
"You can just add a snapshot to your Onediff image and start your VM after one of these episodes. Then get rid of the snapshot and the Onediff cycle will not be interrupted."

Prerequisite is that my onediff backup showed a certification error which cannot be - according to my understanding - tracked inside a script. So backup finalizes as usual and I realize that there is a certification error. This is my starting point when I should add the recommended snapshot to the (potentially) corrupted onediff image.

Question 2: I do not understand what you mean with "onediff cycle will not be interrupted".

---

Some thoughts on automatic recovery of certification errors:

I got the point that the certification requires a statistical point of view and that we need to avoid endless loops. Because of this statistical approach, I believe that all certification errors which remain below a particular threshold, could be automatically repaired. For example, if a certification fails once, there might only be a read error. Therefore, I would rather expect to repeat the certification again so that another read attempt takes place on the same written data. If this works fine, we are all fine, and there is no need for manual interaction.

However, if the second certificaiton of the same data fails again, I would rather automatically re-execute the backup plus the certification. If this is fine, I would not care much about this one-time issue. However, if the error still persist, manual interaction would be the next step.

In my opinion, some basic treatment of certification errors would be quite helpful, especially as xsibackup right now does not appear to trigger an event in case of an certification error.

admin · 2019-08-03 11:19:11

A Onediff cycle is a backup instance in which the differential Onediff snapshot is copied from the source VM and integrated or coalesced with the previously existing data, resulting in an exact copy of the source VM.

If you switch on a Onediff mirror, you will modify the mirror base disk with some unique data on the mirrored side, when you perform the subsequent Onediff cycle, the resulting Onediff mirror VM will not match the source one.

A plain approach to prevent this situation, given the fact that you need to start the mirror VM for some reason, would be to make a full copy of the mirror and start that new VM.

Fortunately there is a simpler approach: you can just add an snapshot to the mirror VM and switch the VM on, then all new data that is generated for that VM will be stored in the snapshot and all disks below the new snapshot will remain untouched. Once you are done with whatever you needed to do on the mirror VM, you just discard the whole snapshot and the base disks will remain untouched thus allowing the Onediff chain of differential snapshots to continue without any mismatch.

admin · 2019-08-03 11:27:10

In regards to the certification mistmatches, you are assuming that when the mismatch is due to a read error, the following time that you try to perform the same action, it will succeed. The thing is that even seldom read errors can become really sticky and persists throught many read tries, thus if you automatically repeat the backup for that VM, you may end up with 40 backups until it eventually disappears.

A certification error doesn't require any automatic action, it can be safely ignored when it's seldom and affects new disks or may require manual intervention to change cables or disks. Reapeating a backup on a failed certification will not ensure that the subsequent certification try will succeed and the chances that you enter a loop are extremely high.

A certification error does require you to evaluate the situation. If you believe your disks are a bit worn out, but not as much as to be worried about its health, you can reinstantiate the Onediff cycle or just switch the VM as recommended above to make sure it's O.K.

Forum ©XSIBackup: ©VMWare ©ESXi Backup Software

#1 2019-07-30 22:05:36

Re-run backup if copy cannot be successfully certified

#2 2019-08-02 07:59:56

Re: Re-run backup if copy cannot be successfully certified

#3 2019-08-02 08:13:13

Re: Re-run backup if copy cannot be successfully certified

#4 2019-08-02 08:50:21

Re: Re-run backup if copy cannot be successfully certified

#5 2019-08-02 12:25:01

Re: Re-run backup if copy cannot be successfully certified

#6 2019-08-03 11:19:11

Re: Re-run backup if copy cannot be successfully certified

#7 2019-08-03 11:27:10

Re: Re-run backup if copy cannot be successfully certified

Board footer