Last updated on Monday 28th of February 2022 08:52:48 PM

Virtual Disk Checksum Certification

Using SHA1 checksum to check backup integrity

 Please note that this post is relative to old deprecated software ©XSIBackup-Classic. Some facts herein contained may still be applicable to more recent versions though.

For new instalations please use new ©XSIBackup which is far more advanced than ©XSIBackup-Classic.

We introduced the --certify-backup argument some time ago. It's a really useful tool that allows to certify the integrity of your backed up VMs with a near 100% accuracy degree. It takes some time, especially when performing a full Virtual Disk checksum on big disks, it's nevertheless worth the time and it can process data at hundreds of MB per second.

Disks in data centerWe have felt that it was the time to delve a bit more into this feature by discussing the pros and cons, so that you can have an idea of what to expect from it. Of course, remember that you can always use some other software that does not offer this insight into your data. In the end; "out of sight, out of mind", if your data ever gets corrupted, you won't be able to complain to the backup software vendor, as it will be something related to your hardware. We prefer to offer you that extra information, so that you can decide whether to ignore it, should frequency be not that high, or take action, in case it starts to show signs of hardware being a bit worn out.

Some clients become a bit puzzled when receiving checksum mismatches. Some people think that the checksums must always match, just as long as their VMs are running and they don't receive some higher level application error from ESXi or XSIBackup.

Let's start by saying that hard disks and SSDs are not perfect, they always fail, even when they are brand new. When we use the word "fail" in this context, we are not talking in absolute terms. Again, people tend to think in dual pairs of terms: cold/ hot, fast/slow, working/faulty.

Hard disks' healthiness is measured from an statistical point of view. We will consider a HD to be faulty when it accumulates an error rate above some pre-established limit. That is the moment when the HD must get retired from the enterprise world and enjoy its last days storing cartoon movies for your children or a desktop OS at home.

Hairavasundaram silent data corruption study  This is an extensive study on all forms of data corruption happening to disks in a data center during a period of 41 months. It is pretty old, but it's nevertheless still revealing and useful as it treats the matter from a wide angle perspective

Among brand new hard disks and the rest of the associated hardware, not all of them are the same. There are enterprise grade devices that are designed to be more resilient to errors than some other commodity hardware. In fact, it's not all about hard disks. Controllers and, the great unknown: cables, are fundamental to reduce the error rate and keep it controlled.

Gone are the days of the original SCSI implementation and its successors, when you had to build your array of disks, terminate them with an appropiate sized electronic resistence and pray for your Adaptec card not burning itself up. Still disks that don't fail live in the Platonic dimension of Ideas.

So, what should you expect from the --certify-backup argument. Well, you should expect some checksum mistmatch from time to time even on brand new disks. What time lapse that "from time to time" will be, will be directly proportional to the amount of data that you backup.

If your hardware is new, of good quality and cabling specs are up to the task, you could very well run hundreds of backups without experimenting a single checksum issue. Nevertheless, this is a probability distribution, thus you may very well get a checksum error on the very first backup cycle. That would be an extremely unlikely possibility though.

As disks start to wear out, the number of checksum mistmatchs will progressively raise. There will be a time overlap in which disk reads produce checksum mistmatches while the disks are still in use, this is not a huge concern for home labs or less critical services, but it is when you manage critical enterprise data.

What does it mean that I have received an error mistmatch?

It means that your original data and the checked backup are not identical bit by bit.

Should I be concerned?

Yes, although the probability that your data is really corrupt is relatively low when compared to the possibility that it's not. This is due to read errors being more frequent than write errors. This kind of errors are really sticky. First of all because the real reason for the mistmatch migth be physical damage in a given address of the HD. Even in case of read errors they tend to be very sticky until they finally and eventually disappear.

So, what should I do?

First of all, check your cabling and if you are unsure whether they are up to the specification, i.e.: old SATA cables on SATA3 disks, change them all, they are cheap. If checksum mistmatches still persist move your VMs to a new disk ASAP. The chances that the controller is causing errors is much lower, but still could be the cause.

Daniel J. García Fidalgo