Registered users
Linkedin Twitter Facebook Google+
Close

In order to improve user's experience and to enable some functionalities by tracking the user accross the website, this website uses its own cookies and from third parties, like Google Analytics and other similar activity tracking software. Read the Privacy Policy
33HOPS, IT Consultants
33HOPS ::: Proveedores de Soluciones Informáticas :: Madrid :+34 91 663 6085Avda. Castilla la Mancha, 95 - local posterior - 28700 S.S. de los Reyes - MADRID33HOPS, Sistemas de Informacion y Redes, S.L.Info

Virtual Disk Checksum Certification

Using SHA1 checksum to check backup integrity

We introduced the --certify-backup argument some time ago. It's a really useful tool that allows to certify the integrity of your backed up VMs with a near 100% accuracy degree. It takes some time, especially when performing a full Virtual Disk checksum on big disks, it's nevertheless worth the time and it can process data at hundreds of MB per second.

Disks in data centerWe have felt that it was the time to delve a bit more into this feature by discussing the pros and cons, so that you can have an idea of what to expect from it.

Some clients become a bit puzzled when receiving a checksum mistmatch. Some people think that the checksums must always match, just as long as their VMs are running and they don't receive some higher level application error from ESXi or XSIBackup.

Let's start by saying that hard disks and SSDs are not perfect, they always fail, even when they are brand new. When we use the word "fail" in this context, we are not talking in absolute terms. Again, people tend to think in dual pairs of terms: cold/ hot, fast/slow, working/faulty.

Hard disks' healthyness is measured from an statistical point of view. We will consider a HD to be faulty when it accumulates an error rate above some prestablished limit. That is the moment when the HD must get retired from the enterprise world and enjoy its last days storing cartoon movies for your children or a desktop OS at home.

Hairavasundaram silent data corruption study  This is an extensive study on all forms of data corruption happening to disks in a data center during a period of 41 months. It is pretty old, but it's nevertheless still revealing and useful as it treats the matter from a wide angle perspective

Among brand new hard disks and the rest of the associated hardware, not all of them are the same. There are enterprise grade devices that are designed to be more resilient to errors than some other commodity hardware. In fact, it's not all about hard disks. Controllers and, the great unknown: cables, are fundamental to reduce the error rate and keep it controlled.

Gone are the days of the original SCSI implementation and its successors, when you had to build your array of disks, terminate them with an appropiate sized electronic resistence and pray for your Adaptec card not burning itself. Still disks that don't fail live in the Platonic dimension of Ideas.

So, what should you expect from the --certify-backup argument. Well, you should expect some checksum mistmatch from time to time even on brand new disks. What time lapse that "from time to time" will be, will be directly proportional to the amount of data that you backup.

If your hardware is new, of good quality and cabling specs are up to the task, you could very well run hundreds of backups without experimenting a single checksum issue. Nevertheless, this is a probability distribution, thus you may very well get a checksum error on the very first backup cycle. That would be an extremely unlikely possibility though.

As disks start to wear out, the number of checksum mistmatchs will progressively raise. There will be a time overlap in which disk reads produce checksum mistmatches while the disks are still in use, this is not a huge concern for home labs or less critical services, but it is when you manage critical enterprise data.

What does it mean that I have received an error mistmatch?

It means that your original data and the checked backup are not identical bit by bit.

Should I be concerned?

Yes, although the probability that your data is really corrupt is relatively low when compared to the possibility that it's not. This is due to read errors being more frequent than write errors. This kind of errors are really sticky. First of all because the real reason for the mistmatch migth be physical damage in a given address of the HD. Even in case of read errors they tend to be very sticky until they finally and eventually disappear.

So, what should I do?

First of all, check your cabling and if you are unsure whether they are up to the specification, i.e.: old SATA cables on SATA3 disks, change them all, they are cheap. If checksum mistmatches still persist move your VMs to a new disk ASAP. The chances that the controller is causing errors is much lower, but still could be the cause.


Daniel J. García Fidalgo



Website Map
Consultancy
IT Manager
In Site
Resources & help
Index of Docs
33HOPS Forum

Fill in to download
The download link will be sent to your e-mail.
Name
Lastname
E-mail


            Read our Privacy Policy