Last updated on Monday 28th of February 2022 08:52:48 PM

Live migration of ©VMWare ESXi Virtual Machines

 Please note that this post is relative to old deprecated software ©XSIBackup-Classic. Some facts herein contained may still be applicable to more recent versions though.

For new instalations please use new ©XSIBackup which is far more advanced than ©XSIBackup-Classic.

One common scenario in any IT department is to migrate Virtual Machines which are in production. This sometimes forces us to stay at the office overnight to have everything prepared for the upcoming morning, so that everybody can use their e-mail, SAP, CRM or whatever service they use in their daily duties.

There are many cases in which you can't work the matter around that way though, due to the Virtual Machines being some sort of mission critical service that is used 24x7 and everybody expects to be up at all time: like a website or an e-commerce store.

In case of the latter, there are ways to prepare a seamless transition, but that requires preparation at different levels in the IT stack: deployment of a new virtual machine, installing or revising the software and an strategy related to keeping databases consistent at the same time that you swap users from the old server to the new.

Migration of VMs over IP It's not rocket science, we all have had to deal with some migration sometime, but it involves many resources and planning on advance, so that we can make sure not even one byte of data is lost in the process. Whenever you face such case, you normally take the chance to upgrade the Operating Systems and software versions too, as they will need to be installed and thoroughly checked anyway.

In this post we are going to cover an intermediate case, which is that of a server, or a series of them, that need to be migrated to a new host, but in opposition to the formerly described case, will allow a ten minute stop before being put into production again. The time lapse is arbitrary, there's no reason why it couldn't be less than that. That will mainly depend on your requests and your skills, but some small stop will be inevitable.

©XSIBackup-Pro is an extremely flexible tool, you can use it in many ways. It's used by thousands of people every day to make backups of their Virtual Machines, but it's a must have tool for performing "close to no downtime" migrations easily. It encloses a number of different backup programs, namely binaries that are used to move the data around.

The ©XSIBackup-Pro built in --backup-program we are going to utilize for our live migration case is OneDiff, which is our propietary instant differential tool. You can read more about it in the product page, what it basically does is to store the differential data in between backups in a snapshot, and just send that snapshot to the backup destination next time a backup needs to be performed.

How do you use OneDiff to migrate a VM?

What you will basically do is to perform a series of backups on the running VM (Alice) to the destination host. Right after each backup cycle, the resulting VM (Bob) is identical to Alice. As time passes, users continue to generate data in the guest OS and Alice differentiates from Bob, storing the difference in the OneDiff differential snapshot. If the time taken in between backups is short, the subsequent backup cycle will take little time to complete, as little data will have to be sent to Bob, and it is one of these time windows that we will use to stop Alice, perform the last OneDiff backup start Bob and discard Alice.

The key action in the process is to stop the original production Virtual Machine before the backup actually takes place, run a backup cycle and, once it completes, switch Bob on. Of course there might be a lot of work associated with routers, NAT, IP assignments and so on. For our post, we will consider the simplest possible scenario, which is two ESXi hosts running Virtual Machines in bridged mode in the same network class. This obviously means that we have simplified everything related to network management to almost zero in sake of highlighting all aspects related to the migration of the VMs themselves. In any case you can extend this very same concept to a migration to any other network, the only difference is that in a more complex scenario you may need to rely on additional network tools that will help direct traffic to the adequate server.

So the timeline for your migration is:

1 - Perform a series of OneDiff backups (2 or 3) and reduce the time lapse in between backups progressively. Use the --certify-backup=yes argument to make sure the first copies are identical.

2 - Once you have performed a couple of OneDiff backup cycles, schedule the migration time not too much time ahead of the last backup. The time you can hold since the last backup cycle will depend on the amount of new data you are generating. This won't be a problem most of the times, unless you are migrating a huge server with a lot of I/O activity going on and the available bandwidth for the migration is poor in comparison with the amount of data, this is very unlikely to happen anyway.

3 - Stop the production VM and run the last backup. You can remove the --certify-backup=yes argument in this last step to minimize the downtime. This is the moment in which your countdown until the newly migrated server starts to give service begins. The process should not last more than some minutes, even in case of big heavy loaded servers.

4 - Once the last backup has finished, just turn on Bob and let the ARP tables refresh, if everything is working properly at the network level, your newly migrated server will start serving users requests with an exact consolidated copy of the original production server some minutes ago.
This method of VM migration should be suitable for anybody who can deal with some minutes of downtime of the server to be migrated. You should perform this method individually per each Virtual Machine to reduce the down time and have more focused control on each of the individual operations.

This method of migration represents a huge save in time and resources when compared with a fully seamless migration, which would involve different works at different levels. The downside is a short down-time versus a complicated deployment that would last hundreds of hours more.

Some other big advantages are the abstraction in regards to the underlying software and services, plus the possibility to make a number of intermediate tests that will ensure each individual step consistency, without disturbing the user experience using the services provided by the production server.

Pros: fast, reliable, you can abort if something goes wrong and retry any number of times.
Cons: implies some service downtime, it can be minimized tough.

Daniel J. García Fidalgo