Domain Controller Reboots – Lsass.exe crash

We recently ran into an issue where some of our domain controllers would get stuck in reboot loops in one of the child domains. The domain controllers were recently upgraded to 2022. Not all domain controllers in the domain would do it either; it was random, but once it started, it persisted. The only way we found to stop the reboot was to disconnect the NIC so the machine didn’t get any traffic.

Symptoms

Domain controller would pop up a message saying this machine will reboot in 1 min.

In the event log, event ID 1000 thrown

Faulting application name: lsass.exe, version: 10.0.20348.4163, time stamp: 0xf055926b

Faulting module name: ntdll.dll, version: 10.0.20348.3932, time stamp: 0xc75bbaae

Exception code: 0xc0000005

Fault offset: 0x0000000000072eb0

Faulting process id: 0x330

Faulting application start time: 0x01dc3f6cd20cb038

Faulting application path: C:\WINDOWS\system32\lsass.exe

Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll

Report Id: 16d31b63-2138-4bea-9ac3-d19e62f7170b

Faulting package full name:

Faulting package-relative application ID:

Issue found

Opened a ticket with Microsoft to troubleshoot as everything we tried failed. Rebuilding the domain controllers would help for a short while, but the problem would resurface. After many memory dumps and sessions with tech support, we found the issue.

The problem was with a shortcut trust that was established 20+ years ago when the domain was created. This shortcut trust was missing the securityIdentifier attribute and, under very specific circumstances, would cause lsass to crash. This is an issue that Microsoft admitted is a known issue, but not many customers have created the perfect set of conditions to trigger the bug (lucky me!). It is an issue that started with 2022 and newer domain controllers.

To check for this issue, run the following command. You can also look using adsiedit connected to default naming context, trust will be in the system container.

ldifde -f c:\temp\trusts.txt -r “(&(objectClass=trustedDomain)(|(trustType=1)(trustType=2)))” -t 3268

Open the trusts.txt file which will output like below. Look for a securityIdentifier. If its missing that is a problem.

dn: CN=root.com,CN=System,DC=child,DC=root,DC=com
changetype: add
objectClass: top
objectClass: leaf
objectClass: trustedDomain
cn: root.com
distinguishedName: CN=root.com,CN=System,DC=child,DC=root,DC=com
instanceType: 4
whenCreated: 20020314230047.0Z
whenChanged: 20251014125300.0Z
uSNCreated: 18653
uSNChanged: 3818287685
name: root.com
objectGUID:: fGv6ZUgwSUCMfaf3FWIhYw==
securityIdentifier:: AQQABBAAAAUVAAAArEm1Z0g6xjUnNbxw
trustDirection: 3
trustPartner: root.com
trustType: 2
trustAttributes: 4194304
objectCategory: CN=Trusted-Domain,CN=Schema,CN=Configuration,DC=root,DC=com

The Solution

Simple. Delete the shortcut trust and add it back. This will create the missing security identifier and the reboots will stop. I deleted this in the middle of the day on a very busy domain with no issues. Auth will continue to function as you still have your trust chains to your root domain. In fact, one must wonder, why do we need this shortcut trust at all? Evaluate your environment and make the call.

Storage Replica in Server 2016

One of the new features of 2016 Server (Datacenter edition only) that I have been playing with is storage replica. I now have this technology running in production for one of our file servers. Before storage replica came along, we were using DFS-R to replicate this particular set of data (~16TB, 27m files) to our DR facility for faster recovery. Now anyone out there that has been using DFS-R for any length of time knows how much of a pain it can be. Anytime we had a slight hiccup on any side of the replication link it would take weeks to get back to normal. Granted, DFS-R has its use cases that work very well, but for a dataset this large it was nothing but problem after problem for me. This is what made me look for alternatives that could be done quickly and cheaply. In comes storage replica, a new feature in 2016 that is volume block-level replication. I have tested this in a stretch cluster configuration and in a server to server configuration. This post will focus on the server to server replication as that is what I decided to do for production.

Why server to server and not stretch cluster?

Stretch cluster works with Synchronous replication only. This is Microsoft’s official documented stance. I found out that async will work with a stretch cluster but it will not failover automatically, it was a manual process to get it to failover.
Our link did not meet the requirements for Sync replication. Our wan link to DR is 10Gb with around 25ms latency. MS recommends nothing over 5ms latency. Now did I try sync over 25ms? I sure did, I put the cowboy hat on and went to town testing all kinds of scenarios that were not officially documented or supported. But did it work? Yes, it did, quite well actually since the way Microsoft does sync replication it actually behaves more like async in reality. Why not stretch cluster then? Because this is a production set of data and Microsoft does not support sync over 25ms links (as of now). I was not going to put us in an unsupported setup even though I think it may have worked just fine.

So server to server replication it is. Here is the setup:

16 TB of unstructured data on 1 volume
2 VMs running on VMware with a standard 18TB vmdk. RDMs will work too.
D drive is replicated, formatted in GPT NTFS, 8k Allocation unit size
Users access data using DFS-N

Steps to configure

Install features on both servers

Install-WindowsFeature -ComputerName SERVERNAME -Name Storage-Replica,FS-FileServer -IncludeManagementTools -restart

Add Disks and Format

Each server needs 2 disks. 1 for the data and another for the replication log. In my setup, D is the data volume, L is the log volume. Both formatted in GPT NTFS. D drive set to 8k blocks (default is 4k), this was to allow for a volume larger than 17TB. Note: I have seen some info on the interwebs that have stated not to use storage replica for volumes larger than 10TB. However, I can’t find anything in MS documentation that states that so I continued on full steam ahead.

2017-04-12 12_12_32-Clipboard

Verify your partitions sizes are exactly the same on both sides. The best way to see this is to run “Get-Partition -DriveLetter D | select size”. Look at the size of the partitions on both servers. This number should match (represented in Bytes). In my case, these did not match exactly and I got an error back “Data partition sizes are different in those two groups“. From the research I have done this is most likely due to the fact that these VMs were not on identical storage (Primary server on Tintri, Secondary on EMC VNX) and the NTFS format did some rounding differently on each VM. To fix this problem, run “Resize-Partition -DriveLetter D -Size xxxxx”, xxxxx being the size of the smaller volume.

Copy your Data over

I used robocopy with /mir to get the data copied to our new primary 2016 server. You only need to copy to one of the servers, not both. As soon as we fire up storage replica the initial block copy will get the data onto your secondary server for you (much faster than robocopy will, I promise you). MS did mention that seeding the data “may” help the speed of this initial block copy. During testing, I did not see a major difference in sync times if I seeded the data first. Your mileage may vary.

Configure Storage Replica

Before turning this on, make sure windows is fully updated. There were some bugs in the RTM version that were fixed in later updates.

Test Topology

This will create a pretty little HTML output to verify you have everything you need in place.

Test-SRTopology -SourceComputerName PrimaryServerName -SourceVolumeName d: -SourceLogVolumeName l: -DestinationComputerName SecondaryServerName -DestinationVolumeName d: -DestinationLogVolumeName l: -DurationInMinutes 5 -ResultPath c:\temp

If you have errors when running this, fix them before going on.

Create replication connections

New-SRPartnership -SourceComputerName PrimaryServerName -SourceRGName Groups-rg01 -SourceVolumeName d: -SourceLogVolumeName l: -DestinationComputerName SecondaryServerName -DestinationRGName Groups-rg02 -DestinationVolumeName d: -DestinationLogVolumeName l: -ReplicationMode Asynchronous

Warning: Make sure you get the replication direction correct! If backwards you will replicate your empty volume over to your primary and wipe out all that data you just robocopied over.

After you run this, the D drive will go offline on the secondary server. Initial block copy will start. Check status by running “get-srgroup”.

Another thing you will probably want to do is increase the log size. By default, it is set to 8GB. I changed ours to 250GB by running these commands.

Set-SRGroup -LogSizeInBytes 250181844992 -Name Groups-rg01
Set-SRGroup -LogSizeInBytes 250181844992 -Name Groups-rg02

You can also run this handy snippet on the secondary server to see how much data is left to copy.

while($true) {
 $v = (Get-SRGroup -Name "Groups-rg02").replicas | Select-Object numofbytesremaining
 [System.Console]::Write("Number of bytes remaining: {0}`r", $v.numofbytesremaining)
 Start-Sleep -s 5
}

I found this initial replication to be very fast. We moved 16TB of data to the secondary site in just under 24 hours. DFS-R took weeks to move this same data.

Configure DFS-N

In our case, users were already using DFS-N to access the data. However, we need to change some defaults to make this work the way I wanted. By default, the namespace client cache is good for 1800 seconds (30 minutes). The problem with this is if we failover to the secondary server, the users would need to reboot or wait up to 30 minutes for their drives to reconnect. That could be a problem so I lowered the cache time to 30 seconds. Yes, it will create a lot more referrals from your namespace servers but it will be worth it :). To do this, open the DFS management console and open properties for the folder your working with. Under referrals, change the cache duration to 30.

2017-04-12 13_21_34-Clipboard

Next, create the shares and add your primary server in as a target to the needed folders. Now, obviously, you need to be careful here and know what you’re doing with your data. This data is not replicated from your old server to the new, only robocopied over. I am not going into all scenarios here on how that is a problem but just think about that for a few.

Now failover to your secondary server. See the section below on how to do that. You need to do this so the volume will come online to configure the shares on the secondary. Once the shares are created add the secondary server in as a DFS-N target. Failover to the primary.

I keep the secondary server target disabled in DFS to prevent clients from trying to use a volume that is offline.

Failing over to secondary server

Set-SRPartnership -NewSourceComputerName SecondaryServerName -SourceRGName Groups-rg02 -DestinationComputerName PrimaryServerName -DestinationRGName Groups-rg01

Run the above command to bring the volume online on the secondary server and reverse replication. Then run over to DFS and enable secondary target, disable the primary target. Clients should reconnect in 30 seconds or less.

Questions I still have

RPO by default is set to 30. But what that means exactly I don’t know. 30 blocks, 30 seconds, 30 minutes, 30% of the log size? I cannot find anything from Microsoft that explains it in detail. My guess would be it is a percentage of the log. I say that because before I resized the log to 250GB we would fall out of RPO pretty quick, after the resize we almost never fall out of RPO.

Update: Microsoft answered this finally. RPO is defined by seconds.

What is the recommended size of the log? Percentage of active data may be or could be set to change rate times a certain RTO? No clear guidance from MS.

Update: https://docs.microsoft.com/en-us/windows-server/storage/storage-replica/storage-replica-frequently-asked-questions#FAQ15.5

Final thoughts

So far I have been impressed with how storage replica works. It was fairly simple to setup, replicated data very fast and overcomes a bunch of DFS limitations and problems.

Positives

Fast replication
Replicates encrypted and locked files
decent visibility into replication status
Runs over SMB3 (side note, we did try and run this through a riverbed steelhead for testing, got about 30% reduction in WAN traffic)

Negatives

1 to 1 replication. Cannot do 1 to many like DFS.
manual steps to failover (unless you stretch cluster it)

References

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/storage-replica-overview?f=255&MSPPError=-2147217396

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/server-to-server-storage-replication?f=255&MSPPError=-2147217396

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/storage-replica-frequently-asked-questions

Monitor Print Server Activity

I was recently approached by management and asked to monitor an older print server to see who was still using it. They wanted a report that would show which users and machines were still using this print server so they could correct old printer mappings. Well of course Microsoft gives us no easy way to do this out of the box so it was time to get creative. Here is the method I used to enable logging and generate a nice pretty TPS report from the data.

Enable Print service logging on the print server
1. Open the event viewer
2. Expand application and service logs > Microsoft > Windows > PrintService
3. Right-click the operational log, click enable log.

Now that logging is enabled you will begin to see the logs fill up with print jobs (Event 307). I let this run for about a week so we could get plenty of data. You may need to adjust the size of the log because I found it filled up way too quickly. The default is 1MB which is stupid low.
Now on to generate a pretty report for these guys. As it turns out this sounds so easy but in fact is a PITA. You can export the log to CSV but there are several issues with this, the biggest one is it exports the users SID instead of username, which of course makes sense because we all know the SID for each user right? There are also several formatting issues with exporting to CSV and long story short, it just didn’t work, you’re welcome to try that on your own if you think I am crazy. What I really needed was a way to pull the XML params out of the detail section of each event and export that to CSV. To accomplish this we turn to our friend, Powershell.

$PrinterList = @()
$StartTime = "04/10/2014 12:00:01 PM"
$EndTime = "04/18/2014 6:00:01 PM"
$Results = Get-WinEvent -FilterHashTable @{LogName="Microsoft-Windows-PrintService/Operational"; ID=307; StartTime=$StartTime; EndTime=$EndTime;} -ComputerName "localhost"

ForEach($Result in $Results){
$ProperyData = [xml][/xml]$Result.ToXml()
$PrinterName = $ProperyData.Event.UserData.DocumentPrinted.Param5
$hItemDetails = New-Object -TypeName psobject -Property @{
DocName = $ProperyData.Event.UserData.DocumentPrinted.Param2
UserName = $ProperyData.Event.UserData.DocumentPrinted.Param3
MachineName = $ProperyData.Event.UserData.DocumentPrinted.Param4
PrinterName = $PrinterName
PageCount = $ProperyData.Event.UserData.DocumentPrinted.Param8
TimeCreated = $Result.TimeCreated
}

$PrinterList += $hItemDetails
}

#Output results to CSV file
$PrinterList | Export-Csv -Path "C:PrintAudit.csv"

This script runs locally on the print server and pulls all the info I needed and dumps into a CSV that can then be fondled within excel. Optionally, you can run this remotely by changing the –ComputerName param at the end of line 4. Of course, you do need to modify the start and end times to your needs.