You are not logged in.
Pages: 1
Hi,
Is there a way to use XSIBackup DC to sucessfully replicate a MS Windows Domain Controller?
In my case there is only a PDC, no secondaries. So no USM rollbackup to worry about and as no secondaries I can't transfer roles.
Will quiesce work? Or does a warm/cold backup need to be taken?
At the moment I need to use DSRM to make the server functional. Any replicas will get a 0xc00002e2 BSOD on boot.
(e.g. [url]https://community.spiceworks.com/topic/1345241-2012-r2-dc-crash-and-won-t-boot-with-error-0xc00002e2[/url] )
Thanks
Offline
That is due to the AD DB getting corrupt due to some pending I/O operation.
"This error is an indication that the Active Directory database (NTDS.DIT) is corrupt."
[url=https://support.hostway.com/hc/en-us/articles/360001126259-How-to-fix-Error-0xc00002e2-after-rebooting-Windows-Domain-Controller]How to fix AD 0xc00002e2 error[/url]
It's not difficult to fix it, still obviously the best approach is to have a 100% functional DC after restoring.
You have a number of ways to ensure the integrity of your DC:
1/ The easiest way is through a warm backup, if you can afford to stop the DC for 30 sec. to 1 minute at most.
2/ Revise the MS documents to find out how you must configure your DC to allow the AD DB to be quiesced in coordination with VMWare Tools, namely: make sure that it writes any pending data, just like before taking any snapshot.
3/ Take multiple VSS snapshots during the day and revert to the latest after restoring (not very convinient).
4/ Use pre and post snapshot scripts to stop the AD service or put it in read-only mode before taking the snapshot and start it up or put it back in R/W mode after the snapshot has been taken. This is what the related MS services should do, still you can easily implement it on your own.
Offline
Regarding fixing - I have tried this on a repilica. In case anyone else needs this, quick instructions below.
NB only use if you have only one DC
F8 into DSRM (F8 may bring up blue screen first if so choose Boot Normally and keep hitting F8)
Choose Directory Services repair mode
Logon as .\administrator - you need your DSRM admin password.
Make a copy of C:\Windows\NTDS - just in case.
Run > cmd
c:
Cd c:\Windows\NTDS
Del *.log
NTDSUTIL
activate instance ntds
files
info
quit
esentutl /p "c:\windows\ntds\ntds.dit"
md C:\Windows\NTDS\Temp
Cd C:\Windows\NTDS
NTDSUTIL
activate instance ntds
files
info
compact to “C:\Windows\NTDS\Temp”
quit
Cd C:\Windows\NTDS
copy /Y C:\Windows\NTDS\temp\NTDS.dit C:\Windows\NTDS
del *.log
shutdown /r
cross fingers.
I'm keeping a copy of this ont he server just in case
[quote] still obviously the best approach is to have a 100% functional DC after restoring.[/quote]
Couldn't agree more - especially as servers normally die at the wrong times and you need to work on your phone in the middle of the night from a different country whilst at a night club!
Last edited by Corbeau (2022-01-13 14:33:58)
Offline
Some questions
How does XSIBackup trigger the shutdown in a warm backup - is it via vmtools? What I really want to know is how safe it is.
Also using --quiesce what happens -- does xsibackup ask vmtools to quiesce the system using "VMware Tools Quiescence"?
From DC manual
--backup-how[=hot|war|cold] I like the idea of a war backup!:)
Thanks
Offline
Take a look at <install dir>/etc/xsibackup.conf
# When power on/off request is issued, the VM power state is queried every N seconds
power_query_interval=2
# When power on/off request is issued, the VM power state is queried N times
# Thus the power state will be queried a total of power_query_interval*power_query_times seconds
# Should the query_times limit be reached, a plain power off will be issued
power_query_times=10
As explained there (c)XSIBackup will try to perform a controlled shut down as per the above mentioned variables before issuing a plain power-off.
We like to torture VMs specially VMs hosting DB servers. We have some CentOS 6.0/ MySQL 5.6 here that we have been excruciatingly powering-off in the rudest manner for years and they never suffered from DB corruption, although that will off course depend on how busy the DB is when you commit the crime.
Yes, --quiesce will issue a quiesce request, thus you can use regular pre-freeze/ post-thaw VMWare Tools scripts to prevent DB corruption.
We already fixed that typo, it will show up in some hours.
Offline
Update on this.
Today I tried a boot of 2 replicas. Neither worked.
I was planning on booting and fixing the AD as per previous post.
Neither normal boot or DSRM boot worked on either replica.
I booted via a server iso but it's not possible to fix it this way.
So I have updated my xsibackup config to try a warm backup rather than a hot backup - not something I'm kean on doing but I will give it a try tonight.
Offline
further update.
Warm backup worked. server down for a couple of minutes.
I suspect rebooting a windows server regualarly like this will likely break it at some point.
Work ongoing....
(I would like to make it very clear to anyone else reading this. The server was Windows Server Essentials. So only one DC.
On testing replicas I discovered 1) it wouldn't boot due to corrupt AD. 2) I couldn't boot into directory services mode to fix things.
So do not use hot backup of a DC )
Last edited by Corbeau (2022-03-07 09:54:10)
Offline
Thank you for your feedback.
This is yet another issue having to do with quiescing your FS. We are writing about this all the time, still we have recently updated the main post relative to this topic and [url=https://33hops.com/esxi-snapshot-errors-and-solutions.html#quiescing-notes]added some specific notes[/url].
Every user should try to make the effort to see this kind of problems as a broad issue, even though each particular situation should require a slightly different procedure to solve it.
Of course your proposed solution will always work, as you are shutting your server down before taking the backup snapshot. Even though it is immediately switched on after taking it, the snapshot is indeed taken from a stopped state of the VM, thus the possibilities that your Active Directory DB gets corrupted are zero.
If you can afford to stop the VM for some seconds, a warm backup is definitely the simplest solution to this kind of problems. Still, not everybody can afford to stop the DC to backup the AD VMs.
[b]Problem description[/b]
AD information is kept in a DB. That DB could become corrupt, just like any other DB server which is abruptly stopped. The snapshot issue is about the same as a sudden power outage, which before virtualization became popular was the most frequent way to corrupt the AD database.
The mere fact that it does indeed become corrupt is random and proportional to how busy it is. You might be lucky and the service might be iddle just when you take your snapshot, you should not count on that though.
The DB becoming corrupt does not mean that the whole database goes corrupt. People tend to think in maximalistic terms all the time, which causes terror, doubt and in the end wrong decisions.
Databases become corrupt on power outages or non-quiesced snapshots just because the last pages that are being written get chopped before the end of the page is written to disk. Thus, the system preprocessing routines detect this unfinished write because some page in the DB lacks a footer or closing structure.
Fixing the problem consists in the same conceptual thing in every case: detecting the wrong pages and removing them, which is usually done with the database repair commands. This obviously varies depending on the DB system. In case of a DB server like MySQL or MS SQL Server, you would just loose the last writes or updates. In case of AD, the repairing would chop off the latest AD related operations.
Active Directory adds an additional problem, which is that the DC controller is dependent on the healthyness of the Active Directory DB to boot up. This could be considered an OS design flaw, as it puts you in a technical paradox. The solutions proposed by Microsoft don't seem to work in your case, still, there should be a fairly easy way to fix that DB, as said, this is an old issue which has mature fixing procedures since many years ago, as stated, power outages were a common source of AD relates corruption problems before they were replaced in frequency by virtualization snapshots.
[b]Quiescing backups[/b]
All this kind of issues are prevented the same way: quiescing the FS before actually taking the snapshot. It consists in about the same as a controlled shutdown for DB services, still done with the OS running and resuming normal operations ASAP. It usually takes some seconds at most to quiesce the different DB services in a server.
In the [url=https://33hops.com/esxi-snapshot-errors-and-solutions.html#quiescing-notes]notes on quiescing[/url] we describe the procedure to follow in case of DB services in Windows servers.
There are a few services related to quiescing a Windows guest: VSS, VMWare Tools, Virtual Disk and in some cases some additional helper services. Just as long as those services are configured as described in our post and all other related services are installed and configured properly, using a quiesced snapshot should prevent any corruption on the different DB services that may be running in your guest.
Quiescing in a nutshell consists in the (c)ESXi server communicating a snaphot is about to be taken to the VMWare Tools service in the guest, then the VMWare Service should coordinate the controlled pause of the running DB services.
Still, if you have some host that is not responding to automatic quiescing. You can control the process on your own, how?:
(c)VMWare Tools offer a way to run custom pre and post backup scripts, like described in the post. This scripts can handle three events related to snapshots: pre-FREEZE, THAW and FREEZEFAIL.
FREEZE happens right before the snapshot is taken, THAW happens right after the snapshot has been created (please, note that some documents on the web wrongly describe THAW as happening when the snapshot is deleted), finally FREEZEFAIL is run in the event that some error is triggered.
Controlling your AD services quiescing on your own would consist in adding the necessary AD Service stop command to FREEZE and AD Service start command to THAW, as well as to FREEZEFAIL. That way you make sure that before your backup snapshot is taken the AD Service is stopped gracefully preventing any data corruption and that once the snapshot has been completed it is started again.
It is conceptually the same as running a "warm" backups, still, you make sure that you don't have to reboot the server. It is indeed the same that the coordinated services in the server should do when they are configured the right way.
@echo off
if "%~1" == "" goto USAGE
if %1 == freeze goto FREEZE
if %1 == freezeFail goto FREEZEFAIL
if %1 == thaw goto THAW
:USAGE
echo "Usage: %~nx0 [ freeze | freezeFail | thaw ]"
goto END
:FREEZE
net stop YOUR_AD_INSTANCE_NAME
goto END
:FREEZEFAIL
net start YOUR_AD_INSTANCE_NAME
goto END
:THAW
net start YOUR_AD_INSTANCE_NAME
goto END
:END
Offline
Hi, I'm interested in trying this, but how can I get the value for "YOUR_AD_INSTANCE_NAME" ?
Thank you.
Offline
The goal of that line in the procedure is to set AD off and then back on once the snapshot has been finally taken.
Active Directory Domain Services usually appears as [b]NTDS[/b] in the Services applet, that may vary depending on your setup and customization level.
Most of the times you will issue:
net stop NTDS
net start NTDS
Please, note that this is a straight solution that will turn your AD service off during a couple of seconds. VSS services in your DC should take care to hold NTDS writes while the snapshot is being taken just as long as VMWare Tools are correctly installed and configured. The checklist that works for most of our users is:
Virtual Disk service is started and startup type is Automatic.
VMware snapshot provider service is stopped and disabled.
VMware Tools services are running.
Ensure that Volume Shadow Copy service start up type is Automatic
We can't obviously guarantee that your MS Server DC will behave as you expect it to, that will depend on so many other things, that's why we offer this straight procedure that should work in every case.
Offline
another follow up.
I have 3 networks using this setup. All have old servers which hold replicas of the production server (essentials 2016 or 2019)
All are using warm replicas.
I tested a boot of a replica and got the c00002e2 bsod which means corrupted AD and server unbootable.
All 3 different networks had same result.
I tested with an internet connection and also using cold replicas, same BSOD.
The following got all 3 servers to boot normally. I haven't checked server integerity further yet.
power on replica
choose moved it
keep tapping f8 and boot dsrm
logon as local admin user = .\administrator & dsrm recovery password
in admin command prompt:
cd C:\Windows\NTDS
del *.log
c:\windows\system32\esentutl.exe /p ntds.dit
(agree to prompt window)
shutdown /r
--- ---
For good measure I also use a script to backup the windows system state every day to a local drive.
There is also a restore script that can be run from DSRM. This is a better method probably but takes hour(s) so is not useful to get a replica up running very quickly.
Still not happy with this and time permitting will look into getting the replicas to boot without a corrupted AD.
Offline
Sorry for reaching back so late, your last response passed inadvertent to us.
AD machines are picky to quiesce, probably pickier than they should be. The AD database should be more resilient to blackouts and simple power offs, anyway...
Both warm and cold --backup-how methods should work with any system. The reason is simple: a "controlled shutdown" (we'll get back to this in the next paragrah) precedes the snapshot, thus the snapshot is taken from a stopped state of the VM. That stopped state of the VM should yield a perfectly coherent state of the VM as the shutdown process must take care to bring down all services in a controlled maner including AD services. This controlled shut down should take care to flush all pending data to disk and make sure that the AD DB is in a perfectly coherent state, just as if you were switching off any hardware based server.
Now the thing is whether the shutdown was really a controlled shutdown. ©XSIBackup tries to perform a normal controlled shutdown and if it can't it issues a power off. You can find a tweakable variable in the [b]etc/xsibackup.conf[/b] file to control the time ©XSIBackup waits before issuing a plain power off.
If your server is busy and it takes longer than the default configured time ©XSIBackup may be powering off your AD server instead of shutting it down gracefully causing the AD database to become corrupt. Check that in the backup log and extend the default timeout if you need to.
You can also use a freeze script via VMWare Tools (see previous post above) to switch AD services off before the snapshot is taken, this would allow you to backup the VM while it is on.
The key is to make sure the AD related services are brought down in a controlled manner to avoid corrupting the AD database.
Offline
Thanks going to test this.
For info purposes:
Running custom quiescing scripts inside Windows & Linux virtual machines (1006671)
[url]https://kb.vmware.com/s/article/1006671[/url]
Offline
Thanks for the link. We published a post with a more [b][url=https://33hops.com/esxi-snapshot-errors-and-solutions.html]comprehensive approach to freeze scripts[/url][/b] some time ago.
It's somewhat awkward that MS has not published detail information on how to stop AD services in a controlled manner. There is very little information on the subject. Nonetheless AD is a complex service dependent on a database where it stores its configuration. The principles that apply are the same as with any other system dependent on a consistent I/O scheme.
Just as long as you are able to stop it in a controlled way you got it. Controlled freezing (stop service before taking the snapshot and restart after the snapshot was done) is even better, as you don't need to stop the VM.
Offline
When stopping AD you need to stop a number of other services also. There is an undocumented switch /y that will cope with this.
net stop NTDS /y
Until I worked out what was happening my backups would stall at 'Creating snapshot VM'
---
Current state of play - I have tested a hot replica with quiescing on and still getiing the 2e2 BSOD. That's a head scratcher.
Last edited by Corbeau (2023-10-21 15:39:08)
Offline
Yes, AD is a complex service it will depend on other services.
We assume that you did use a freeze script to issue the [b]net stop NTDS /y[/b] command, you don't mention.
Did you check in the Event Viewer that the freeze script was actually run?.
Before trying the quiesce from VMWare Tools do try the process manually:
1/ Issue the [b]net stop NTDS /y[/b] command.
2/ Power off the VM. Do issue a hard power off, so that if AD is still active the power off would mimick a plain snapshot.
3/ Turn on the VM and check the state of the AD service. The AD service should start normally, the same way as after a controlled shut down.
If the above succeeds.
1/ Issue the [b]net stop NTDS /y[/b] command.
2/ Backup the VM with a plain snapshot, no quiescing.
3/ Turn on the VM and check the state of the AD service. The AD service should start normally, the same way as after a controlled shut down.
If the above succeeds then the AD service is being properly stopped. If it still fails, then it's the [b]net stop NTDS /y[/b] command that isn't fully working.
Offline
Sorry was just a short post previously. Yes the freeze script stalls without the /y. I have tested that the service stops (I'd put a timeout of 15 seconds into the script and observed the NTDS service was stopped)
I actually believe the issue will occur if I power off the VM can do a cold copy. I think it is something to do with the vm changing exsi host, but I cannot confirm this yet. I have to reduce the ammount of memory and processors for the vm on the backups esxi server. On start I tell esxi I 'moved' the vm rather than copied. When I am able to I'll stop NTDS, power off vm and do a copy and then see if it will start - I don't believe it will. ( I am restricted to when I can take the servers down)
I have a script on each server that I can run from dsrm to make things quick and easy in an emergency but I'd still like to work out what is going on and, of course, boot the replica without a bsod. I will update this thread if and when I find anything else out.
Offline
Thank you for your feedback. There is something going on there. From time to time some user tells us about something similar.
The thing is that a correct shut down or snapshot with proper service management should yield a fully working system. If you get a BSOD and you do know that the BSOD is related to AD, then there is only one possible cause: [b]the AD service wasn't properly shut down[/b]. Your BSOD is clearly stating that point with the [b]0xc00002e2[/b] error (this error is an indication that the Active Directory database NTDS.DIT is corrupt).
Thus, the solution is conceptually simple, yet, the usual MS fuzz around simple issues seems to reach it's peak around this problem, which has been around for decades. Virtualizing Windows Server only makes the problem worse, but this already happenned in hardware based servers 20 years ago. The simple solution to this is: [b]how the H. to shut the AD service down in a controlled way[/b].
It is rather obvious that a normal Windows shut down should yield a working system, thus using the --backup-how=cold argument has to work. There is only one thing to take on account in that case and it is to make sure that ©XSIBackup was indeed able to perform a controlled shut down and not a power off instead. Increase the timeout in xsibackup.conf to make sure this is indeed what's happening. The event viewer in the Windows OS should make it clear whether the previous shutdown was performed correctly or it was a mere power off.
--backup-how=cold does shut the server down and then backs up the VM from a stopped state, this must yield a fully working system 100% of the times, just as long as the shut down was indeed controlled. Please do make sure that your VM is not being powered off instead, we are sure this is the only possibility in case you are still getting the BSOD after a cold backup or replica.
UPDATE:
The command
net stop NTDS /y
Should produce a controlled shutdown of the AD services. If you are getting a BSOD after running it, then your AD service is not working right, you may need some additional patch for your OS or some additional service installed.
We would start by issuing [b]net stop NTDS /y[/b] manually, then reboot. Also trying [b]net stop NTDS /y[/b] followed by [b]net start NTDS[/b] and making sure that AD is still working right would help determine whether the issue is with the AD service itself or with the VMWare Reboot process.
Also make sure that all services related to snapshots are present in the way described in our [url=https://33hops.com/troubleshooting-windows-snapshots-in-esxi.html]post on Windows snapshots[/url]:
Volume Shadow Copy
Microsoft Software Shadow Copy Provider
VMware Snapshot Provider
COM+ System Application
COM+ Event System
Virtual Disk
Things such as having Secure Boot features enabled in your BIOS can also contribute to create these type of issues.
As a workaround we suggest that you backup the C:\windows\ntds\ntds.dit file from a known working state to be able to easily restore it in case of need.
Offline
short update. Had a chance to do a replica from a 'powered off' vm. This was carried out to a local fs and then scp'd over to a second esxi server. The replica on the second esxi server booted first time. A basic test but wanted to make sure.
Offline
Yes, of course, that should always work if the VM is off.
The key matter here is that when you use --backup-how=warm|cold ©XSIBackup does issue a controlled shut down, thus this technique must always work, unless...: the configured timeout for giving up a controlled shut down and perform a simple power off in the xsibackup.conf file is not enough.
If it is the power off that is bringing the VM down, then you may end up with a corrupted AD.
Offline
Pages: 1