FTP Docker container with Active Directory Authentication

I am working on a project that required me to come up with a container that could do FTP and use active directory as its authentication provider. After several hours of tinkering around and reading blog after blog (thank you all for inspiration!) I finally have a working configuration that is stable.

Shout out to the most useful blog I ran across that help me get further down the line was: https://warlord0blog.wordpress.com/2015/08/04/vsftpd-ldap-active-directory-and-virtual-users/

Source code for this project is located at https://github.com/joharper54SEG/Docker-vsftp-ldap

Running the container

Let start with the basics. This project is based on vsftpd and ubuntu 18.04. It’s using the libpam-ldapd module for authentication. It’s using confd as a method to dynamically configure the services using environment variables I pass via a kubernetes configmap and secret.

The environment variables you will need to run this solution as-is are:

LDAP_BASE: DC=domain,DC=com
LDAP_BINDDN: username@domain.com
LDAP_BINDPW: password for bind user
LDAP_FILTERMEMBEROF: memberOf=distinguished name of group
LDAP_SSL_CACERTFILE: /etc/ssl/certs/ca-certificates.crt
LDAP_SSL_ENABLE: “off”
LDAP_URI: ldap://ldap.fqdn.com
VSFTPD_GUEST_ENABLE: “YES”
VSFTPD_LOCAL_ENABLE: “YES”
VSFTPD_PASV_ADDRESS: IP of host
VSFTPD_PASV_ENABLE: “YES”
VSFTPD_PASV_MAX_PORT: “10095”
VSFTPD_PASV_MIN_PORT: “10090”
VSFTPD_SSL_CIPHERS: HIGH
VSFTPD_SSL_ENABLE: “NO”
VSFTPD_SSL_PRIVATEKEY: /etc/ssl/certs/privatekey.key
VSFTPD_SSL_PUBLICKEY: /etc/ssl/certs/publickey.pem
VSFTPD_SSL_SSLV2: “NO”
VSFTPD_SSL_SSLV3: “NO”
VSFTPD_SSL_TLSV1: “YES”

Notable ones here are:

PASV_ADDRESS – this needs to be set to the IP address of the host the container is running on. Without this, you will be unable to connect using passive FTP.

LDAP_FILTERMEMBEROF – I only wanted to allow members of a certain group to have FTP rights to this server. This is a standard LDAP filter and can be modified to fit whatever you would like.

Confd Process

I wanted a way to dynamically adjust settings without having to rebuild the entire container. I also did not want to store passwords and other revealing information in the config files because it’s best practice not to and because I wanted to share the project with the public. At first, I started off trying to use bash scripting to dynamically build the config. That turned out to be a giant PITA so I starting digging for alternates and come across confd. I am going to do a high-level review of how this works. Refer to the confd project for more detail. http://www.confd.io/

To get confd working you needed the config files instructing confd what to do. These are .toml files located in /etc/confd/cond.d and look like this:

[template]
src = "vsftpd.conf.tmpl"
dest = "/etc/vsftpd.conf"
keys = [
  "/vsftpd/pasv/address",
  "/vsftpd/pasv/min/port",
  "/vsftpd/pasv/max/port",
  "/vsftpd/pasv/enable",
  "/vsftpd/ssl/enable",
  "/vsftpd/ssl/tlsv1",
  "/vsftpd/ssl/sslv2",
  "/vsftpd/ssl/sslv3",
  "/vsftpd/ssl/ciphers",
  "/vsftpd/ssl/publickey",
  "/vsftpd/ssl/privatekey",
]

src = name of the template file located in /etc/confd/templates. You will have a template file for each config you want confd to update.
dest = where confd is going to write the file after it does its thing
keys = these correspond to the environment variable we want confd to use. “/vsftpd/pasv/address” is the environment variable VSFTPD_PASV_ADDRESS we pass at container runtime. Note: environment variables have to be all caps.

Next, are the templates. These are your config files with the variable string confd uses. Just copy your config files here, rename them with a .tmpl extension and change the key values to the confd strings.

Example: pasv_address={{getv “/vsftpd/pasv/address”}}

Now in the container start script (/sbin/init.sh) the line below runs confd which will take all our environment variables (keys), build to config file and copy it into the correct location (dest).

confd -onetime -backend env

Active Directory Authentication

To make AD authentication work properly the mapping had to be configured. These mappings are located in the /etc/nslcd.conf file. Or if your looking at the source it would be the /etc/confd/templates/nslcd.conf.tmpl file which gets copied to /etc/nslcd.conf on container start via the confd process.

Each user in AD needs 3 attributes set in order for them to be sucked in by nslcd. These attributes are uidNumber, gidNumber, and unixHomeDirectory. Not going to go into details on these, I will only say that uidNumber should be unique for each user in your environment. How you do that is up to you.
unixHomeDirectory is not being used in this example, but is still required to be set. If you want to use this then uncomment line 18 in /etc/nslcd.conf. In my case, I am setting the home directory to /home/vftp/samaccountname.

Next up are the PAM modules. We have one module we referenced in our config (/etc/pam.d/vsftpd) as I only want AD users logging in. To use local users this would have to be modified.

Next look at /etc/nsswitch.conf. You will see the ldap entries in there for passwd, group, and shadow.

Now if everything lines up, you should be able to login to FTP with your AD accounts. Run “getent passwd” from inside the container and you should see users listed there from AD.

Virtual Users

So before this project, I had no idea what virtual users were. basic explanation is these are users that are not local on the system. If you run “cat /etc/passwd” you will not see them listed there, but you will see them if you run “getent passwd”. these user would not be able to login via ssh or any other service but will be granted access to FTP due to our PAM configs. This is why we need the ” guest_enable=YES ” option in our vsftpd.conf. This also creates some interesting things when it comes to file permissions as the vsftpd daemon will also run as the “ftp” user when a guest logs in. you can create your own user and tell vsftpd to use it instead via the ” guest_username=svc_ftp ” option. As you can see, I needed to do this as I wanted to share data with other containers on the system via docker volumes. I needed to set a uid consistently across this dataset for permission to work and the built in ftp user had a conflicting uid. This also means that the uid set for the user in AD is not really used here beyond adding them into the cache service as a virtual user.

Depending on your requirements you may want to tinker with these permission settings a bit more if security is a concern. For my use case, these settings work fine.

Conclusion

It works! Feel free to use this information as you need and take the time to familiarize yourself with all the settings in each config file as I am skipping over quite a few of them in this blog.

Domain Controller Reboots – Lsass.exe crash

We recently ran into an issue where some of our domain controllers would get stuck in reboot loops in one of the child domains. The domain controllers were recently upgraded to 2022. Not all domain controllers in the domain would do it either; it was random, but once it started, it persisted. The only way we found to stop the reboot was to disconnect the NIC so the machine didn’t get any traffic.

Symptoms

Domain controller would pop up a message saying this machine will reboot in 1 min.

In the event log, event ID 1000 thrown

Faulting application name: lsass.exe, version: 10.0.20348.4163, time stamp: 0xf055926b

Faulting module name: ntdll.dll, version: 10.0.20348.3932, time stamp: 0xc75bbaae

Exception code: 0xc0000005

Fault offset: 0x0000000000072eb0

Faulting process id: 0x330

Faulting application start time: 0x01dc3f6cd20cb038

Faulting application path: C:\WINDOWS\system32\lsass.exe

Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll

Report Id: 16d31b63-2138-4bea-9ac3-d19e62f7170b

Faulting package full name:

Faulting package-relative application ID:

Issue found

Opened a ticket with Microsoft to troubleshoot as everything we tried failed. Rebuilding the domain controllers would help for a short while, but the problem would resurface. After many memory dumps and sessions with tech support, we found the issue.

The problem was with a shortcut trust that was established 20+ years ago when the domain was created. This shortcut trust was missing the securityIdentifier attribute and, under very specific circumstances, would cause lsass to crash. This is an issue that Microsoft admitted is a known issue, but not many customers have created the perfect set of conditions to trigger the bug (lucky me!). It is an issue that started with 2022 and newer domain controllers.

To check for this issue, run the following command. You can also look using adsiedit connected to default naming context, trust will be in the system container.

ldifde -f c:\temp\trusts.txt -r “(&(objectClass=trustedDomain)(|(trustType=1)(trustType=2)))” -t 3268

Open the trusts.txt file which will output like below. Look for a securityIdentifier. If its missing that is a problem.

dn: CN=root.com,CN=System,DC=child,DC=root,DC=com
changetype: add
objectClass: top
objectClass: leaf
objectClass: trustedDomain
cn: root.com
distinguishedName: CN=root.com,CN=System,DC=child,DC=root,DC=com
instanceType: 4
whenCreated: 20020314230047.0Z
whenChanged: 20251014125300.0Z
uSNCreated: 18653
uSNChanged: 3818287685
name: root.com
objectGUID:: fGv6ZUgwSUCMfaf3FWIhYw==
securityIdentifier:: AQQABBAAAAUVAAAArEm1Z0g6xjUnNbxw
trustDirection: 3
trustPartner: root.com
trustType: 2
trustAttributes: 4194304
objectCategory: CN=Trusted-Domain,CN=Schema,CN=Configuration,DC=root,DC=com

The Solution

Simple. Delete the shortcut trust and add it back. This will create the missing security identifier and the reboots will stop. I deleted this in the middle of the day on a very busy domain with no issues. Auth will continue to function as you still have your trust chains to your root domain. In fact, one must wonder, why do we need this shortcut trust at all? Evaluate your environment and make the call.

Fix XenDesktop MCS provisioning scheme after VMware network change

Recently changed our VMware switching from standard to distributed switches. After the change, we were no longer able to deploy new VDIs using MCS. We would get an error on the machine catalog that stated “There is no available network enabled for provisioning in this hosting unit Path not found”

So we consulted the manual (Google) and found this article from citrix. https://support.citrix.com/article/CTX139460. This article gives directions on how to update the hosting unit network after a change on the Hypervisor side. Followed the article and updated the network only to find our that it will not fix our issue. The citrix article even warns that this will not fix any new VDI deployments and they recommend creating a new machine catalog. This thoroughly aggravated me! I should not have to create a new machine catalog to update this parameter! So I called citrix support and well… we know how that turned out.

So fresh off the call with support and unwilling to accept defeat, I found a way! Here is how to fix it! Disclaimer: This is not supported by citrix and does involve modifying the database directly. Proceed with caution!

Follow CTX139460 and update your hosting config.
Create a new machine catalog (dont worry, its only temporary)
Backup your site database – CYA!
Open SQL Management studio and connect to the SQL server
Open your site database and look for a table named DesktopUpdateManagerSchema.ProvisioningSchemeNetworkMap
Right click the table and select top 1000 rows.
You will see an output like the below. You should have 1 entry for each machine catalog that uses MCS. Note the bottom record is the new catalog we just created and it has the right network.
Right click the table again and click edit top 200 rows.
Copy the correct values in networkid and networkpath to your other Provisioning schemes that are broken. (Get-ProvScheme to see your schemes, look for the UID that will match what you see in the database.)

After you modify these values you should be able to deploy machines using your existing catalog. You can now remove the new catalog we created in step 2.

Azure DevOps Agent in Docker with Deployment Groups

Recently started learning how to user Azure DevOps, formerly VSTS, to do automated build and deploy tasks to several docker hosts. In this POC I am running I will have about 600 machines running docker in remote locations to act as an edge/branch server. These docker hosts will be running 5-6 containers to handle basic services such as DNS, FTP, FTE and PowerShell for automation. I wanted an easy way to deploy and update containers on these 600 remote machines. I looked at various things from using Rancher to custom PowerShell scripts to hit the docker API and/or SSH into each machine and run the docker commands. For this post, we are going to focus on how I accomplished these tasks using Azure DevOps.

Microsoft has already released pre-built Docker images for the VSTS agent (https://hub.docker.com/_/microsoft-azure-pipelines-vsts-agent). While this worked great it only allowed me to register the agent into an agent pool in DevOps. In my case, I wanted the agent to register with a deployment group so I could run the same task on every agent. Turns out Microsoft has no documented way of how to accomplish that. So I started digging through how they built their container, the agent looks exactly like the one they install on Linux (no surprise there) which they do provide instructions for on how to register the agent in a deployment group. So now its just down to getting the container version to accept the options for deployment groups. Heres how I did it.

Pull the image down from the hub to your localhost. In this case, I want the agent to have the docker command line utilities installed so I can interface with docker on the host through the Unix socket. Microsoft has an image already built for this:

docker pull  mcr.microsoft.com/azure-pipelines/vsts-agent:ubuntu-16.04-docker-18.06.1-ce

Now, if you look at the directions in DevOps on how to deploy an agent to Linux (Pipelines > Deployment Groups > Register) you will find that two options are needed that are not in the documentation for the docker based agent, nor is the container from Microsoft setup to accept these options. Those options are –deploymentgroup and –deploymentgroupname.

After inspecting the image we can see that it is launching the start.sh script. CMD /bin/sh -c ./start.sh

This script can be downloaded from github here:
https://github.com/Microsoft/vsts-agent-docker

To make this do what we need we will need to modify the start.sh script and then build a custom container with our changes.

#Lines 40-46
if [ -n "$VSTS_DEPLOYMENTPOOL" ]; then
  export VSTS_DEPLOYMENTPOOL="$(eval echo $VSTS_DEPLOYMENTPOOL)"
fi

if [ -n "$VSTS_PROJECTNAME" ]; then
  export VSTS_PROJECTNAME="$(eval echo $VSTS_PROJECTNAME)"
fi

#Lines 91-100.  Remove --pool. 
#Add --deploymentgroup, --deploymentgroupname, --projectname

./bin/Agent.Listener configure --unattended \
  --agent "${VSTS_AGENT:-$(hostname)}" \
  --url "https://$VSTS_ACCOUNT.visualstudio.com" \
  --auth PAT \
  --token $(cat "$VSTS_TOKEN_FILE") \
  --work "${VSTS_WORK:-_work}" \
  --deploymentgroup \
  --deploymentgroupname "${VSTS_DEPLOYMENTPOOL:-default}" \
  --projectname "${VSTS_PROJECTNAME:-projectnamedefault}" \
  --replace & wait $!

Create a new dockerfile that will copy in your modified start.sh and build your new container.

FROM mcr.microsoft.com/azure-pipelines/vsts-agent:ubuntu-16.04-docker-18.06.1-ce

COPY start.sh /vsts

WORKDIR /vsts

RUN chmod +x start.sh

CMD ./start.sh

Now you can run your container. Specify your deployment group and project name as environment variables at runtime.

docker run -e VSTS_ACCOUNT=yourAccountName \
-e VSTS_TOKEN=yourtoken \
-e VSTS_DEPLOYMENTPOOL=yourpoolname \
-e VSTS_PROJECTNAME=yourprojectname \
-v /var/run/docker.sock:/var/run/docker.sock \
-d \
--name DevOps-Agent \
--hostname $(hostname) \
-it \
yourprivateregistry/devops-agent

And boom, achievement badge unlocked! You now have agents reporting to a deployment group.

Detach LUN by naa ID

The supported way to remove a LUN from VMware is to unmount and detach before removing your zoning. Problem is the unmount is easy as they provided a GUI option to do that one. But the detach has to be done on each hosts adapter, there is no easy button. PowerCLI to the rescue! This script will run through each host in a specified cluster and detach it.

Connect-VIServer vcenter.domain.com

$LunIDs = “naa.xxxxxxxxxxxxxx”

$Clustername = “clustername”

function Detach-Disk {

param(

[VMware.VimAutomation.ViCore.Impl.V1.Inventory.VMHostImpl]$VMHost,

[string]$CanonicalName)

$storSys=Get-View$VMHost.Extensiondata.ConfigManager.StorageSystem

$lunUuid=(Get-ScsiLun-VmHost $VMHost|where {$_.CanonicalName-eq$CanonicalName}).ExtensionData.Uuid

$storSys.DetachScsiLun($lunUuid)

}

$ClusterHosts = Get-Cluster $Clustername | Get-VMHost

Foreach($VMHost in $ClusterHosts)

{

Foreach($LUNidin$LunIDs)

{

Write-Host”Detaching”$LUNid”from”$VMHost-ForegroundColor “Yellow”

Detach-Disk -VMHost $VMHost -CanonicalName $LUNid

}

PowerBI report for VMware using streamed dataset

Been “experimenting” with PowerBI lately and gotta admit, I kinda like it! I am not a data sciency kinda guy but I can appreciate a nice dashboard that can supply data at a glance. During my experimenting I have used PowerBI to plug into all sorts of data, from Citrix, to AD to CSV files and databases. Today I will show you what I did around VMware using a PowerBI streamed dataset (API). At time of writing, this feature is in preview, but it works so gets my vote for production deployment! 🙂

My first thought was to plug directly into the vCenter database. All my vCenters are appliances that are using the built-in PostgreSQL DB. PowerBI does have a connector for this DB so it is an option. However, upon further reading it looks like I needed to modify some config files on the appliance and install some powerbi software to make the connector work. All doable, but VMware does not officially support the changes needed to vCenter to allow remote database access. As a result I was hesitant to mess with production vCenter. No doubt it would work and everything would be fine but I had to take of my cowboy hat and be a good little architect (my boss might read this! 🙂 ) and find a supported path. I tried several different methods to access the data I wanted. For this post we will used the API method.

Streamed datasets in powerBI is a fancy name for API post. Basically we are collecting data and using a REST post operation to post the data into PowerBI. For this to work you will need a PowerBI Pro subscription. Streamed datasets, from what I can tell, are only available on the cloud version of PowerBI, so the desktop app and free edition wont work for this method.

To get started login to PowerBI (https://powerbi.microsoft.com). In the top right corner, under your pretty photo, will be a Create button. Click that and select streaming dataset.

powerbi1

Select API and click next.

powerbi2

Next you will see a screen like this. You need to fill out each field that you plan to post data to. Fill in the info and click create.

powerbi3

Next you will be presented with a screen that displays the info you need to post data. Select the powershell tab and copy the code output. Will will use this later.

powerbi4

Now we need to create a powershell script that will connect to vCenter and pull the data we need. This script will pull the data from vCenter using PowerCLI and then do a post to PowerBI. (squirrel* why are all these product names starting with power? Is it supposed to make me feel powerful?) Anywho, here is the script I wrote to get this done:


$vcenter = "vcenter host name"
$cluster = "cluster name"

Import-Module VMware.VimAutomation.Core
Connect-VIServer -Server $vcenter

#This line passes creds to the proxy for auth.  If you dont have a PITA proxy, comment out.
[System.Net.WebRequest]::DefaultWebProxy.Credentials = [System.Net.CredentialCache]::DefaultCredentials

$date = Get-Date
$datastore = Get-Cluster -Name $cluster | Get-Datastore | Where-Object {$_.Type -eq 'VMFS' -and $_.Extensiondata.Summary.MultipleHostAccess}

$hostinfo = @()
        ForEach ($vmhost in (Get-Cluster -Name $cluster | Get-VMHost))
        {
            $HostView = $VMhost | Get-View
                        $HostSummary = "" | Select HostName, ClusterName, MemorySizeGB, CPUSockets, CPUCores, Version
                        $HostSummary.HostName = $VMhost.Name
                        $HostSummary.MemorySizeGB = $HostView.hardware.memorysize / 1024Mb
                        $HostSummary.CPUSockets = $HostView.hardware.cpuinfo.numCpuPackages
                        $HostSummary.CPUCores = $HostView.hardware.cpuinfo.numCpuCores
                        $HostSummary.Version = $HostView.Config.Product.Build
                        $hostinfo += $HostSummary
                    }

$vminfo = @()
            foreach($vm in (Get-Cluster -Name $cluster | Get-VM))
        {
                $VMView = $vm | Get-View
                $VMSummary = "" | Select ClusterName,HostName,VMName,VMSockets,VMCores,CPUSockets,CPUCores,VMMem
                $VMSummary.VMName = $vm.Name
                $VMSummary.VMSockets = $VMView.Config.Hardware.NumCpu
                $VMSummary.VMCores = $VMView.Config.Hardware.NumCoresPerSocket
                $VMSummary.VMMem = $VMView.Config.Hardware.MemoryMB

                $vminfo += $VMSummary
            }

$TotalStorage = ($datastore | Measure-Object -Property CapacityMB -Sum).Sum / 1024
$AvailableStorage = ($datastore | Measure-Object -Property FreeSpaceMB -Sum).Sum / 1024
$NumofVMs = $vminfo.Count
$NumofVMCPUs = ($vminfo | Measure-Object -Property "VMSockets" -Sum).Sum
$NumofHostCPUs = ($hostinfo | Measure-Object -Property "CPUCores" -Sum).Sum
$HostVM2coreRatio = $NumofVMCPUs / $NumofHostCPUs
$TotalHostRAM = ($hostinfo | Measure-Object -Property "MemorySizeGB" -Sum).Sum / 1024
$TotalVMRAM = ($vminfo | Measure-Object -Property "VMMem" -Sum).Sum / 1024 / 1024
$NumOfHosts = $hostinfo.count
$NumOfHostsSockets = ($hostinfo | Measure-Object -Property "CPUSockets" -Sum).Sum

## This section is where you paste the code output by powerBI

$endpoint = "https://api.powerbi.com/beta/..."
$payload = @{
"Date" = $date
"Total Storage" = $TotalStorage
"Available Storage" = $AvailableStorage
"NumofVMs" = $NumofVMs
"NumofVMCPUs" = $NumofVMCPUs
"NumofHostCPUs" = $NumofHostCPUs
"HostVM2coreRatio" = $HostVM2coreRatio
"TotalHostRAM" = $TotalHostRAM
"TotalVMRAM" = $TotalVMRAM
"NumOfHosts" = $NumOfHosts
"NumOfHostsSockets" = $NumOfHostsSockets
}
Invoke-RestMethod -Method Post -Uri "$endpoint" -Body (ConvertTo-Json @($payload))

Now that we have data in the dataset you can create the report. Click the create button in powerBI (the one under your pretty picture) and click report. It will ask your to select a dataset, select the one we just created.

In the fields selection area you will see all the datapoints we setup in the dataset. Each one should contain the data that we just posted via the powershell script.

powerbi5

I am not going to get into how to use powerBI during this post as there are plenty of google-able blogs that have already been written on the topic. But for a quick example, this is what my initial report looks like.

powerbi6

The sky is the limit here when it comes to how you present the data and build your report. PowerBI is a really neat (and currently cheap) tool that Microsoft offers to build good looking dashboards and reports. This example is just what I started with in an attempt to play with streamed datasets. You can add/remove as many data points as you want to build this report out as you wish. You can also use this method on other stuff outside of just VMware. VMware is just the product I choose to test with.

Some limitations to this method.

once data post into the dataset it cannot be removed. Streamed datasets are a preview feature at time of posting so this may not be true in the future.
data is not real-time. Only refreshes when the powershell script runs.
PowerBI cannot manipulate data, only report on it. If it can I have not found it yet. This means you cannot do math on two sets of data to come up with a third datapoint. Thats why you see math being done via powershell and then posting the result to powerBI for reporting. pCPU to vCPU ratio is a good example of this.

Clear SEL Logs for ESXi Hosts

I recently had a problem where VMware was reporting a memory module failure for all of our UCS blades. After working with Cisco and VMware we came to the conclusion that the alerts were false and were caused SEL logs being full on the hosts. To clear these logs VMware support requested that I run “localcli hardware ipmi sel clear” on every host and then reset the management agents. I have about 100 hosts that needed this done so going one by one was not something I wanted to do. PowerCLI to the rescue!

You will need the latest PowerCLI installed for this to run as is. PowerCLI changed to module based in the latest versions. You will also need to install Posh-SSH from the PowerShell Gallery. If unsure how to do that, consult the manual (google)!
#Script to clear the SEL logs of each host in a cluster. #Requires Posh-ssh to be installed from the powershell gallery #Requires Latest PowerCLI to be installed (Module based, not snapin based)


Import-Module -Name VMware.VimAutomation.Core
Connect-VIServer vcenter
$cluster = "ClusterName"

$esxilogin = Get-Credential -Credential "root"
$esx_all = Get-Cluster $cluster| Get-VMHost
foreach ($esx in $esx_all){

$sshService = Get-VmHostService -VMHost $esx.Name | Where { $_.Key -eq “TSM-SSH”} Start-VMHostService -HostService $sshService -Confirm:$false # Connect with ssh and execute the commands New-SSHSession -ComputerName $esx.Name -AcceptKey -Credential $esxilogin Invoke-SSHCommand -SessionId 0 -Command "localcli hardware ipmi sel clear; nohup /sbin/services.sh restart > foo.out 2> foo.err < /dev/null &" #pause to allow management agents to fire back up Start-Sleep -Seconds 75 Remove-SSHSession -SessionId 0 Stop-VMHostService -HostService $sshService -Confirm:$false }

Adding Multiple Disks to VMs with PowerCLI

I don’t know about all of you, but it seems like I get a request to build a new SQL server every couple weeks. Whether or not we should have that many SQL servers is a different matter, but we do, so I got tired of building the same thing over and over. You could just build a VM template just for SQL deployments or you could do it using Powershell like I am writing about. Our SQL DBAs have a certain configuration they want to be laid down on each new SQL server we pump out. The problem.. its 12 disks. yes, 12. So I have to add 12 disks to the VM, then initialize them, then format them and mount to the correct location. It takes a while and its boring so I want to get it over with ASAP and move on to more youtu… err work..

Before we start, assumptions here are that you already have a VM deployed from a template and its running. For this script, I have added a 2nd SCSI controller (Paravirtual) to the VM for all the SQL Disks.

Adding disks to the VM

This will add the disks I need to the VM. If you looking at this and thinking; you’re not using RDMs for SQL? You’re using Thin provisioning for SQL? You’re not using different Datastores for each disk? Are you crazy?. The answer, yes, yes I am. But let us stay on topic and not get into the weeds for this post.

Connect-VIServer vcenter.domain.com
$VM = get-vm -name yourvmname

New-HardDisk -CapacityGB 50 -StorageFormat Thin -Controller "SCSI controller 0" 
New-HardDisk -VM $VM -CapacityGB 250 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 250 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 250 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 250 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"
New-HardDisk -VM $VM -CapacityGB 50 -StorageFormat Thin -Controller "SCSI Controller 1"

Do all the “stuff” inside the VM

The script is pretty easy to read and is pretty self-explanatory. Basically, it is doing all the steps I would have done by hand before.

#Bring Disks Online
get-disk | where operationalstatus -eq Offline | Set-Disk -IsOffline:$false

#Find RAW online disks and init with GPT
get-disk | where PartitionStyle -eq RAW | Initialize-Disk -PartitionStyle GPT

#Make mount point dirs
mkdir D:\DBDataFiles01
mkdir D:\DBDataFiles02
mkdir D:\DBDataFiles03
mkdir D:\DBDataFiles04
mkdir D:\DBLogFiles01
mkdir D:\DBLogFiles02
mkdir D:\DBLogFiles03
mkdir D:\DBLogFiles04
mkdir D:\TempDB01
mkdir D:\TempDB02
mkdir D:\TempDB03
mkdir D:\TempDB04

$i = get-disk -Number 2
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBDataFiles01"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBDataFiles01 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 3
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBDataFiles02"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBDataFiles02 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 4
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBDataFiles03"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBDataFiles03 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 5
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBDataFiles04"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBDataFiles04 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True



$i = get-disk -Number 6
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBLogFiles01"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBLogFiles01 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 7
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBLogFiles02"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBLogFiles02 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 8
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBLogFiles03"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBLogFiles03 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 9
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\DBLogFiles04"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel DBLogFiles04 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True



$i = get-disk -Number 10
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\TempDB01"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel TempDB01 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 11
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\TempDB02"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel TempDB02 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 12
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\TempDB03"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel TempDB03 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

$i = get-disk -Number 13
New-Partition -DiskNumber $i.Number -UseMaximumSize
Add-PartitionAccessPath -DiskNumber $i.Number -PartitionNumber 2 –AccessPath "D:\TempDB04"
Get-Partition –Disknumber $i.Number –PartitionNumber 2 | Format-Volume –FileSystem NTFS –NewFileSystemLabel TempDB04 –Confirm:$false
Get-Partition -DiskNumber $i.Number -PartitionNumber 2 | Set-Partition -NoDefaultDriveLetter:$True

A couple interesting notes. The “Set-Partition -NoDefaultDriveLetter:$True” needs to be run or the first time you reboot the VM all your disks will get a drive letter assigned to them. Since we are using mount point here, we don’t need a drive letter.

Make sure the disk numbers match up to what is shown in disk management or via get-disk if your on core with no GUI.

I realize there are probably different ways to code this to make it shorter. If that is something you want to improve upon then knock yourself out. This was the quick and easy way I threw it together to GSD (get stuff done).

Storage Replica in Server 2016

One of the new features of 2016 Server (Datacenter edition only) that I have been playing with is storage replica. I now have this technology running in production for one of our file servers. Before storage replica came along, we were using DFS-R to replicate this particular set of data (~16TB, 27m files) to our DR facility for faster recovery. Now anyone out there that has been using DFS-R for any length of time knows how much of a pain it can be. Anytime we had a slight hiccup on any side of the replication link it would take weeks to get back to normal. Granted, DFS-R has its use cases that work very well, but for a dataset this large it was nothing but problem after problem for me. This is what made me look for alternatives that could be done quickly and cheaply. In comes storage replica, a new feature in 2016 that is volume block-level replication. I have tested this in a stretch cluster configuration and in a server to server configuration. This post will focus on the server to server replication as that is what I decided to do for production.

Why server to server and not stretch cluster?

Stretch cluster works with Synchronous replication only. This is Microsoft’s official documented stance. I found out that async will work with a stretch cluster but it will not failover automatically, it was a manual process to get it to failover.
Our link did not meet the requirements for Sync replication. Our wan link to DR is 10Gb with around 25ms latency. MS recommends nothing over 5ms latency. Now did I try sync over 25ms? I sure did, I put the cowboy hat on and went to town testing all kinds of scenarios that were not officially documented or supported. But did it work? Yes, it did, quite well actually since the way Microsoft does sync replication it actually behaves more like async in reality. Why not stretch cluster then? Because this is a production set of data and Microsoft does not support sync over 25ms links (as of now). I was not going to put us in an unsupported setup even though I think it may have worked just fine.

So server to server replication it is. Here is the setup:

16 TB of unstructured data on 1 volume
2 VMs running on VMware with a standard 18TB vmdk. RDMs will work too.
D drive is replicated, formatted in GPT NTFS, 8k Allocation unit size
Users access data using DFS-N

Steps to configure

Install features on both servers

Install-WindowsFeature -ComputerName SERVERNAME -Name Storage-Replica,FS-FileServer -IncludeManagementTools -restart

Add Disks and Format

Each server needs 2 disks. 1 for the data and another for the replication log. In my setup, D is the data volume, L is the log volume. Both formatted in GPT NTFS. D drive set to 8k blocks (default is 4k), this was to allow for a volume larger than 17TB. Note: I have seen some info on the interwebs that have stated not to use storage replica for volumes larger than 10TB. However, I can’t find anything in MS documentation that states that so I continued on full steam ahead.

2017-04-12 12_12_32-Clipboard

Verify your partitions sizes are exactly the same on both sides. The best way to see this is to run “Get-Partition -DriveLetter D | select size”. Look at the size of the partitions on both servers. This number should match (represented in Bytes). In my case, these did not match exactly and I got an error back “Data partition sizes are different in those two groups“. From the research I have done this is most likely due to the fact that these VMs were not on identical storage (Primary server on Tintri, Secondary on EMC VNX) and the NTFS format did some rounding differently on each VM. To fix this problem, run “Resize-Partition -DriveLetter D -Size xxxxx”, xxxxx being the size of the smaller volume.

Copy your Data over

I used robocopy with /mir to get the data copied to our new primary 2016 server. You only need to copy to one of the servers, not both. As soon as we fire up storage replica the initial block copy will get the data onto your secondary server for you (much faster than robocopy will, I promise you). MS did mention that seeding the data “may” help the speed of this initial block copy. During testing, I did not see a major difference in sync times if I seeded the data first. Your mileage may vary.

Configure Storage Replica

Before turning this on, make sure windows is fully updated. There were some bugs in the RTM version that were fixed in later updates.

Test Topology

This will create a pretty little HTML output to verify you have everything you need in place.

Test-SRTopology -SourceComputerName PrimaryServerName -SourceVolumeName d: -SourceLogVolumeName l: -DestinationComputerName SecondaryServerName -DestinationVolumeName d: -DestinationLogVolumeName l: -DurationInMinutes 5 -ResultPath c:\temp

If you have errors when running this, fix them before going on.

Create replication connections

New-SRPartnership -SourceComputerName PrimaryServerName -SourceRGName Groups-rg01 -SourceVolumeName d: -SourceLogVolumeName l: -DestinationComputerName SecondaryServerName -DestinationRGName Groups-rg02 -DestinationVolumeName d: -DestinationLogVolumeName l: -ReplicationMode Asynchronous

Warning: Make sure you get the replication direction correct! If backwards you will replicate your empty volume over to your primary and wipe out all that data you just robocopied over.

After you run this, the D drive will go offline on the secondary server. Initial block copy will start. Check status by running “get-srgroup”.

Another thing you will probably want to do is increase the log size. By default, it is set to 8GB. I changed ours to 250GB by running these commands.

Set-SRGroup -LogSizeInBytes 250181844992 -Name Groups-rg01
Set-SRGroup -LogSizeInBytes 250181844992 -Name Groups-rg02

You can also run this handy snippet on the secondary server to see how much data is left to copy.

while($true) {
 $v = (Get-SRGroup -Name "Groups-rg02").replicas | Select-Object numofbytesremaining
 [System.Console]::Write("Number of bytes remaining: {0}`r", $v.numofbytesremaining)
 Start-Sleep -s 5
}

I found this initial replication to be very fast. We moved 16TB of data to the secondary site in just under 24 hours. DFS-R took weeks to move this same data.

Configure DFS-N

In our case, users were already using DFS-N to access the data. However, we need to change some defaults to make this work the way I wanted. By default, the namespace client cache is good for 1800 seconds (30 minutes). The problem with this is if we failover to the secondary server, the users would need to reboot or wait up to 30 minutes for their drives to reconnect. That could be a problem so I lowered the cache time to 30 seconds. Yes, it will create a lot more referrals from your namespace servers but it will be worth it :). To do this, open the DFS management console and open properties for the folder your working with. Under referrals, change the cache duration to 30.

2017-04-12 13_21_34-Clipboard

Next, create the shares and add your primary server in as a target to the needed folders. Now, obviously, you need to be careful here and know what you’re doing with your data. This data is not replicated from your old server to the new, only robocopied over. I am not going into all scenarios here on how that is a problem but just think about that for a few.

Now failover to your secondary server. See the section below on how to do that. You need to do this so the volume will come online to configure the shares on the secondary. Once the shares are created add the secondary server in as a DFS-N target. Failover to the primary.

I keep the secondary server target disabled in DFS to prevent clients from trying to use a volume that is offline.

Failing over to secondary server

Set-SRPartnership -NewSourceComputerName SecondaryServerName -SourceRGName Groups-rg02 -DestinationComputerName PrimaryServerName -DestinationRGName Groups-rg01

Run the above command to bring the volume online on the secondary server and reverse replication. Then run over to DFS and enable secondary target, disable the primary target. Clients should reconnect in 30 seconds or less.

Questions I still have

RPO by default is set to 30. But what that means exactly I don’t know. 30 blocks, 30 seconds, 30 minutes, 30% of the log size? I cannot find anything from Microsoft that explains it in detail. My guess would be it is a percentage of the log. I say that because before I resized the log to 250GB we would fall out of RPO pretty quick, after the resize we almost never fall out of RPO.

Update: Microsoft answered this finally. RPO is defined by seconds.

What is the recommended size of the log? Percentage of active data may be or could be set to change rate times a certain RTO? No clear guidance from MS.

Update: https://docs.microsoft.com/en-us/windows-server/storage/storage-replica/storage-replica-frequently-asked-questions#FAQ15.5

Final thoughts

So far I have been impressed with how storage replica works. It was fairly simple to setup, replicated data very fast and overcomes a bunch of DFS limitations and problems.

Positives

Fast replication
Replicates encrypted and locked files
decent visibility into replication status
Runs over SMB3 (side note, we did try and run this through a riverbed steelhead for testing, got about 30% reduction in WAN traffic)

Negatives

1 to 1 replication. Cannot do 1 to many like DFS.
manual steps to failover (unless you stretch cluster it)

References

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/storage-replica-overview?f=255&MSPPError=-2147217396

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/server-to-server-storage-replication?f=255&MSPPError=-2147217396

https://technet.microsoft.com/en-us/windows-server-docs/storage/storage-replica/storage-replica-frequently-asked-questions

Provisioned XenApp Servers Stop Accepting Connections When the License Server is Unavailable

Recently we had an outage in our environment which was caused by the license server daemon failing. All of our XenApp servers are provisioned servers and they reboot nightly to pull in a fresh image for the next day. In this particular case, the license server failed sometime in the middle of our reboot cycle so some servers came up and could not contact the license server. This caused the servers to not accept user connections. You might say but “what about the 30 day grace period?”, well yea about that, these are provisioned app servers so the license cache file that XenApp stores locally gets wiped out when you reboot the server. Citrix provided a fix for this back in XenApp 6 as documented in http://support.citrix.com/article/CTX131202. This article is confusing and not written correctly so I thought I would put the steps we followed in our environment out here for your pleasure.

Create a folder on a file server and share that folder
Modify the permission on that folder to add the computer account for each xenapp server in your farm. We created a group in AD with all the computer accounts in the group and then assigned the group to this folder. The accounts need modify permission to this folder. Make sure you check your share permissions as well.
Modify the registry key from step 2 in the citrix article to point to the UNC path of the folder you just created. This key should be modified on your image while it is in private mode.
Commit your image and assign it to your servers.

The next time your XenApp servers boot they will create a file called MPS-WSXICA_MPS-WSXICA_<em>SERVERNAME</em>.ini in your new share. Each server will create its own file, this is the part that the citrix article fails to clarify. The license server must be online during the first boot so the servers can pull a license, you cannot try this in the middle of a license server outage.

After these changes are made if your provisioned servers come online while the license server is down you will then start the grace period and prevent an outage. That is as long as your newly created file share is accessible.