Registered users
Linkedin Twitter Google+
Close

In order to improve user's experience and to enable some functionalities by tracking the user accross the website, this website uses its own cookies and from third parties, like Google Analytics and other similar activity tracking software. Read the Privacy Policy
33HOPS, IT Consultants Download XSIBackup
33HOPS ::: Proveedores de Soluciones Informáticas :: Madrid :+34 91 930 98 66Avda. Castilla la Mancha, 95 - local posterior - 28700 S.S. de los Reyes - MADRID33HOPS, Sistemas de Informacion y Redes, S.L.Info

<< Return to index

©ESXi Snapshot Errors and Solutions.

What are snapshots, how to use them issues and fix procedure.

This post was fairly old and needed some revision. The same basic principles still apply to snapshots though, they have not changed from a conceptual point of view.

One of the most common sources of support requests from part of our registered users is problems having to do with snapshots: creation, deletion, quiescing, etc...
• Can't create snapshot
• Can't delete snapshot
• How to fix issues related to snapshots

What is a snapshot?

CESXi Snaphosts: general view The concept exists associated with multiple storage abstraction systems, like: LVM, LVM2, ©ESXi, ZFS File System, etc...

The way they work varies depending on the technology used, although the concept always implies some kind of discontinuity in the stored data that allows some kind of upper level manipulation from part of the sys admin.

In this post we will cover snapshots in an ©ESXi environment.

In the ©VMWare ©ESXi Hypervisor, snapshots are temporal subsets of blocks generated from some moment in time (the time you take the snapshot) which are stored in separate virtual disk files. They can be read for the data they contain and they can be written to if the VM is sitting on top of one. You can later decide whether you want to delete or discard that data. They are the best way to accomplish a hot consistent ©VMWare backup, as the snapshot mechanism already includes the logic to generate consistent discontinuity in the I/O stream.

Deleting a snapshot in the ©ESXi jargon means to commit that data to the base disk, where the rest of the virtual disk data resides. You do so from the point in the snapshot chain where you position yourself, namely: the famous "You are here" bull. This is an important fact, as deleting has a different meaning in real life, and that tends to generate some fair degree of confusion among some users.

When you "restore" some snapshot you revert your VM to a previous state, that is: you discard the data accumulated since you generated that snapshot file. Again the snapshot related terminology contradicts the implied semantics.

You can chain snapshots, up to 128 of them (if I remember right), although you will most probably never reach that theoretical limit, as the VMs slow down as you add more, and in most hardware just a few will render the VM virtually unusable.

One given snapshot depends on the previous ones, as data stored in a lapse of time alone has no sense on its own. You may want to take some time to think about that fact, you will find it's obvious though.

Snapshots are a conceptual abstraction that allows to accumulate temporal subsets of data. Whatever you use them for is up to you. There exist some common predefined uses which gives them meaning 99% of the times.

What are ©ESXi snapshots useful for?

The most common use of snapshots is to be able to revert some given VM to a previous state. This is very useful in test scenarios, per instance, when you want to try some new version of something or install some software update, but you want to make sure you can revert the guest to the previous state should something go wrong.

In a ©ESXi backup and recovery scenario, snapshots in a VM which has been backed up can be used to revert the backup to any of the restore points the snapshots offer.

Problems with backup snapshots

You may encounter different errors related to snapshot creation and deletion, nonetheless they can be summarized in:
- Can't create snapshot.
- Can't delete snapshot.
This might seem to be an oversimplified explanation, but it's enough by now.

I want to break a lance in favor of ©VMWare in this case, as the snapshot feature works very well. When you hit some issue having to do with snapshots it's usually related to something not working well: hardware failure, disk is full, or to something you aren't doing right: mixed HW or VMFS versions, missing files or misconfigured services in case of quiesced snapshots.

Can't create a snapshot

When a snapshot can't be created, it's usually due to something preventing it from happening. You should get used to scan the VM log files and also the host level log files at /scratch/log in search of hints. The most common situations are:
- Lack of space in virtual disk volumes.
- Lack of space in the main /tmp dir
- Lack of space in the /scratch partition
- Service in the guest OS refusing to be quiesced and raising an error from ©VMWare Tools
The commands below will spit any errors found in the host general log dir at /scratch/log and in the VM logs.

cat /scratch/log/*.log | grep -i "error"
cat /vmfs/volumes/datastore1/YourVM/*.log | grep -i "error"

This other commands will detect whether some partition or virtual file system utilization is above 90%. Please note that only those closer to 100% utilization will represent a real problem.

df -h | tail +2 | awk '{sub("%", "");if($5 > 90){print $0}}'
vdf -h | grep -iv "^ramdisk" | awk '{sub("%", "");if($5 > 90){print $0}}'

The above will allow you to know if you are lacking space in some volume or file system. If your problem is not related to low space availability, then it's more likely due to trying to quiesce the guest OS.

Should that be the case, you should have found some related errors in your logs by now. You have different ways to address the issue when it comes to quiescing your guest. First of all determine if you really need quiescing.

What is quiescing?

To understand snapshots you need to think about how I/O works in a given OS. We don't need to get into the nitty gritty details of how the OS and filesystem work, just comprehending that it is a complex system with many parts involved and that it's not the same a file server than a server hosting services which are constantly writing data to the guest's disks, such as some database server.

A file server is easy to handle, the file system itself takes care to write files consistently. Nonetheless, when you host a busy database, e-mail or similar type of service, the pending I/O operations need to be flushed to disk before the snapshot can actually be taken. If that would not happen, the database files could get corrupted with partially written data.

This is, to some extent, a similar scenario to that of a power cycle, in which the database service needs to be shutdown in a controlled way to make sure that data is written consistently before shutting the service down.

The main difference is that a quiesce operation will stop the service just the time required to take the snapshot, which will usually cause a short glitch in the service functioning. If the quiescing process goes well, users will just notice a short delay, usually shorter than a second.

©VMWare Tools

©VMWare Tools acts as an intermediary service that requests the quiescing operation and confirms it to the hypervisor. Thus, you need ©VMWare Tools installed, as well as any other additional auxiliary service that might be necessary to perform the controlled stop.

Please, note that ©Windows Servers running SQL Server will need, not only ©VMWare Tools, but also Virtual Disk VSS Service plus some additional components depending on the version you are running.

Some other database servers, specially some older versions, may not be eligible to be quiesced. In these cases there are some workarounds that we'll comment ahead.

Can't delete a snapshot

This is somewhat different to not being able to create a snapshot. Some of the causes could also be related to lack of space. We won't comment that any further, as the procedure to detect that situation is the same as above.

As also commented above, deleting a snapshot consists in integrating the data into the base disk. The base disk can be a -flat.vmdk file or another snapshot, depending on how many of them you have piled up.
Possible causes of one or more snapshots not being deleted are:
- Lack of space in any of the above mentioned volumes or virtual FSs.
- Incompatibility between VMFS or VM Hardware Versions
- Missing or corrupt associated files.
- Other issues, buggy behaviours from part of ©ESXi.

Incompatibility issues

For this operation to be possible, you need the snapshots in the chain to be compatible. Per instance, when you move some differential data between ©ESXi servers, like in the case of OneDiff snapshots, you need the hardware versions of the VM and the undelying VMFS versions to be the same, or, at a minimum, the VMFS versions must coincide and the hardware version of the originating VM must be compatible with the target server, namely: you can copy some VM from ©ESXi 5.5 to 6.5 and make it work, but not the other way around if the HW version of the VM running in ©ESXi 6.5 is not supported by 5.5, otherwise you will receive an error.

Missing or corrupt files

A snapshot is composed by multiple files. The .vmsd file contains information on how many snapshots are attached to a VM and how they are related to each other, namely: the hierarchy and relationship order in the snapshot chain. When this file gets damaged ©ESXi cannot figure out how to integrate the data into the base disks, the VM might even be working fine, still the snapshot deletion fails.

Deleting the broken .vmsd file and creating a new snapshot generates a new .vmsd file, this can be used to repair some broken chains of snapshots when the .vmsd file contains wrong information.

When it is damaged, another approach to recover the data could be to just clone the VM from the desired snapshot .vmdk file using vmkfstools, this would generate a consolidated -flat.vmdk file with all data coalesced into a single virtual disk.

Other issues
A general system error occurred

There is a problem that arises from time to time in a given system, you cannot create snapshots, nor delete any pre-existing one, the event log shows A general system error occurred. Sometimes this problem persists even after discarding the snapshot data files manually as explained in point number 6, even after rebuilding data from a chain of snapshots as explained in the following paragraph. This drives users crazy and seems impossible to fix.

We have been able to reproduce this problem, that has to be considered an ©ESXi bug, in 5.X and 6.X systems. For some unknown reason, the VM regenerates a .vmsd file with invalid information, even after deleting all the snapshot files manually, including the .vmsd file itself. The bad .vmsd file that reproduces itself without any apparent reason, contains information about a snapshot that does not exist any more.

It does not matter how many times you turn the VM off and delete the .vmsd file, the wrong information reappears over and over. Even if you clone the VM from the topmost snapshot, the wrong information keeps on being thrown into the .vmsd file.

It is clearly not something in the ESXi host itself, as unregistering the VM and registering it again with a different id does not help the problem. Thus, it has to be something related to VMWare Tools.

Solution:

We have found that deleting the .vmsd file once the VM has been turned on and the wrong .vmsd file has been recreated, allows to create a new snapshot. From this point on, the problem seems to get resolved.

As some of the snapshot descriptor files can be damaged if this problem affects you, the best way to give remedy to it is to clone from the topmost snapshot, switch the newly created VM on, delete the .vmsd file and take a new snapshot.

How to fix issues related to snapshots

Consolidate snapshots We will limit this paragraph to explaining how to fix a broken chain of snapshots, as not being able to create one is limited to identifying the cause, but there's nothing that has to be undone.

In case of a broken chain of already existing snapshots, you not only need to find the cause and fix it, but you need a way to consolidate your data back into the base disks, be them some -flat.vmdk files or some other snapshot.

First of all: identify the issue, query your logs for clues. Check whether some inconsistency exists that is preventing the snapshot deletion from taking place, like some previous Onediff differential operation or the VM having been moved from one host to another.
• Turn the VM off and try to consolidate the snapshot chain.
• Delete the .vmsd file and try to create a new snapshot on top of the previous ones. The .vmsd file should be regenerated
• Unregister the VM and register it again.
• Restart the host service or reboot the host, the latter is always more decisive.
If the problem persists, you will have to decide whether discarding the information in the snapshot files or try consolidating the data in a new -flat.vmdk disk by running a vmkfstools clone operation.

Rebuild the VM data from the chain of snapshots

If you are lucky, the consolidation will work and you will be able to commit those snapshots to the base vmdk files. If you aren't, then you will need to rebuild everything into a consolidated base vmdk file, or a set of them. If you have more than one virtual disk in that given VM. By doing that, you will loose your snapshots, but will save your data. This means you won't be able to go back to a previous state of the VM, but at least you will keep your valuable data. At this point you should consider yourself fortunate that you can do so with some minor hassle. If your set of snapshots is in a good state, then you may decide the snapshot from which you will consolidate the broken VM, and thus the point in time to which you will revert it. In any case, this procedure is much more complex than a simpler full consolidation starting with the topmost snapshot, specially if you want to preserve the remaining snapshots in the chain. In this post we will cover the simplest scenario and will recover the VM from the topmost snapshot present. If you want to recover from an earlier one, just follow the same procedure, but from a previous snapshot in the chain, and discard the rest of the data.

The standard procedure to give remedy to this is to clone the .vmdk files one by one from the topmost current available snapshot. To do that you will need to use vmkfstools in the following way: find the topmost snapshot .vmdk files, they will be something like yourVM-00000N.vmdk, you’ll find one 00000N.vmdk file per disk in the VM, if you only have one disk, there will only be a set of snapshot .vmdk files. Locate them all and clone each one of them (from the highest N present) by using vmkfstools this way:

vmkfstools -i yourVM-00000N.vmdk /vmfs/volumes/datastore2/yourVM/yourVM.vmdk -d thin


This will create a new .vmdk file containing all the consolidated information from the base disk plus all the snapshots in the chain. Next step is to copy the .vmx file to the destination dir "/vmfs/volumes/datastore2/yourVM" in the example) and edit it by using vi editor to reflect the new paths to each disk. That’s all, you can switch your VM on, and if all steps were taken adequately, you’ll have a new sanitized VM.

Discard the information in the snapshot files

This is the last resort and consists in renouncing to the information contained in the snapshots and reverting the VM back to the state it was previous to taking them.

Needless to say that the above procedure applies to any snapshot in a chain, that is: you can clone a VM from any of its snapshot .vmdk files. Thus you can try to recover the data from any of the intermediate points before getting to the -flat.vmdk file.

Manually discarding a snapshot consists in editing the .vmx file and pointing the virtual disks to the previous snapshot in a chain. In case you only have a snapshot or you want to discard all snapshot files, you point the .vmx file to the base .vmdk disk, namely: the one that does not contain any -00000N string in its name and has a -flat.vmdk counterpart.

Daniel J. García Fidalgo
33HOPS
This page was last modified on 2021-05-21



Website Map
Resources & help
33HOPS Forum
Index of Docs

©33HOPS site relies on the following technologies and partners:
SSL Protocol PayPal Payment Gateway Stripe Payment Gateway

©33HOPS Sistemas de Información y Redes, S.L. | VAT No: ESB83583716 | Avda. Castilla la Mancha, 95, local posterior, 28701 San Sebastián e los Reyes (Madrid) Spain



Fill in to download
The download link will be sent to your e-mail.
Name
Lastname
E-mail


            Read our Privacy Policy

(*) DC & Pro users, please login to your user area to download