How a DFS Namespace change ruined my morning…

April 16, 2012, 3:00 am

Hello! Jake Mowrer here to share an “experience” with you all that will hopefully help you avoid running into a nightmare the next time you add a DFS Namespace (DFS-N) server. I find that DFS-N in customer environments works well, so well that they require very little maintenance. This tends to lead to admins not wanting to touch the environment unless they really have to, which in turn leads to “We don’t look directly into the eyes of that system, it just works, not sure how and we’re not sure who manages it.”

Ultimately, you will have to add systems to host DFS-N whether due to an OS upgrade, a new AD site rolling out, or just need to scale out a bit. This is what my client and I were doing one late Tuesday night which seemed to go well, until the calls started pouring into the helpdesk the next morning.

We were adding a Windows Server 2008 R2 server into the domain based DFS-N which was entirely Windows Server 2003. The reason we were making this change was that a site was complaining about slow DFS Namespace enumeration. Upon further investigation, we found this site had to hop over two sites to get to the nearest DFS-N. So easy enough, we figured we would add a DFS-N server in a closer site to improve the enumeration time. Here were the steps we used:

1) From a Windows 7 client with the DFS File System RSAT tools installed or a Windows XP machine with the Windows XP Support Tools installed, we ran a DFSUTIL /PKTINFO to confirm what server the client was using for the DFS-N prior to the change. Here was the relevant output:

2) Using the DFS console on the newly installed Windows Server 2008 R2 machine, we added the namespace to the new DFS-N server:

c. We clicked Edit Settings to make sure All Users had read to the share.

d. Clicked OK and it added successfully.

3) We then did a DFSUTIL /PKTFLUSH on the same client, accessed the namespace, then ran a DFSUTIL /PKTINFO, and everything looked fine:

4) The results showed it came up in 2 seconds vs. 20 seconds, so we were happy.

We all went to sleep.

Winter is coming, and so are the helpdesk tickets.

I woke up to my phone ringing and it was the morning shift admin indicating there was a DFS namespace issue. The symptom was that Windows XP clients were receiving this error when accessing a subfolder in the DFS Namespace:

“This file does not have a program associated with it for performing this action. Create an association in the Folder Options control panel.”

“What the heck kind of error is that when trying to open a folder?!?!” I thought to myself.

Here are the things we checked:

1) Ensured the server name was entered correctly in the DFS console, it was.

2) Ran another pktflush on the client, didn’t resolve.

3) Ensure the share existed on the serveralotcloser machine, it did.

By this time it was getting about an hour into the outage, so we were running out of time to find root cause, eventually we needed to back out the change.

Before we did, I had the admin capture a network trace using Network Monitor 3.4 (shameless plug) and repro the issue. We removed the new server from the namespace and the problem was resolved. But how are we going to get these servers added with this issue lingering out there?

I took some time to wake up, eat a Cliff bar, and drive into their office. I looked over the network trace and here is what caught my eye:

That looks like a permissions issue, but I checked the share permissions, so it had to be NTFS!

Researching cases I found one case where my friend David Everett from our DS support team used iCACLS to look at the permissions.

I ran the iCACLS command against the existing servers in the DFS-N (ie. The working servers) and here is what they looked like:

BUILTIN\Administrators:(OI)(CI)(F)

NT AUTHORITY\SYSTEM:(OI)(CI)(F)

BUILTIN\Administrators:(F)

CREATOR OWNER:(OI)(CI)(IO)(F)

BUILTIN\Users:(OI)(CI)(RX)

BUILTIN\Users:(CI)(AD)

BUILTIN\Users:(CI)(WD)

I saw his comment in the case that if the permissions were inherited they wouldn’t replicate.

We checked the c:\dfsroots directory on the server that we removed (it remained there even after removing the server from the DFS-N configuration) and here is what showed up:

BUILTIN\Administrators:(OI)(CI)(F)

NT AUTHORITY\SYSTEM:(OI)(CI)(F)

BUILTIN\Administrators:(F)

CREATOR OWNER:(OI)(CI)(IO)(F)

Again all inherited but notice that the BUILTIN\Users group is not listed. Where were these permissions coming from, we never set them when adding the server to the DFS-N root? Why was BUILTIN\Users missing?

They were coming down from the root of the C: drive, meaning the customer had changed the default permissions on the C: drive in their build process. This filtered down to the c:\dfsroots directory thus “softly” denying “Users” access to the folder. By default on Windows Server 2008 R2, BUILTIN\Users (or computername\Users if you are not on a DC) should have Read/Execute/List Folders on the C: drive. To harden security, the customer removed this ACE on their Windows Server 2008 R2 build.

To fix this on the server we planned on adding to the DFS-N root, we added the BUILTIN\Users group and gave it Read/Execute to the C:\dfsroots directory. We could have added it back to the root of the C: drive but I didn’t want to make that big of a change. Once we added the server back into the DFS-N root we tested using the same method above and it worked as expected. Problem resolved!

To wrap things up:

I hear all the time that hardening a server also adds complexity, this is one prime example.

I asked myself, “So what should we have done different to make the change less impactful?” Here’s my “should have” list:

1) We should have tested with an account that had equal privileges to one that was actually used in the field to access the namespace. I thought we were, but I never asked the customer to confirm, shame on me.

2) We should have checked the NTFS permissions for c:\dfsroots on the new DFS-N root servers after making the change.

I hope this write up helps add one more thing to your checklist when adding a server to your DFS Namespace. Have a great week!

Jake Mowrer

↧

Who Moved the AD Cheese?

April 22, 2012, 9:51 am

≫ Next: DHCP, Dynamic DNS, and DCs: How about Some PowerShell to Spice Up a Mind-Numbing Topic?

≪ Previous: How a DFS Namespace change ruined my morning…

Sometimes, we Microsoft engineers get called into a 'forensics' type situation to help a customer try to answer the "W" questions - where someone (WHO?) did something (WHAT?) at some point (WHEN?) in Active Directory (AD) or some other aspect of a Windows infrastructure. Usually, if we get the call, the change had a big (sometimes catastrophic) negative impact on the company's business or operations.

Depending on how auditing was setup before the event, we may or may not be able to help answer those questions.

This post provides details on how I set up a Windows Server 2008 R2 Active Directory environment for effective auditing of certain AD changes. The changes I chose to audit for this post are a direct result of customer incidents and trying to answer those "W" questions. Some of the incidents resulted in massive outages such as an OU deletion executed via script that was mis-coded and deleted a root-level OU and all contents resulting in thousands of User Accounts getting deleted. Of course, there are additional items that can be audited such as:

Creating and/or deleting objects - User Accounts, Site Links, Sites, etc
Editing/deleting files and folders
Users logging in and/or logging out of the Domain – this one can be tricky to pin down
FSMO Role transfers, Directory Service Restore Mode password changes
Domain/Forest Functional Level changes

This is a post focused on AD object auditing, so I'm not going to cover user logon/log off, file server access or other types of auditing (so little time; so many audit options).

IMPORTANT NOTES AND DISCLAIMERS:

The event details and auditing settings in this post are specific to Windows Server 2008 R2 and are not applicable and/or different in a Windows 2000 or 2003 AD.
!!**WARNING**!! Improper auditing can, among other things:
- SWAMP your DCs and other servers – as with anything, vet this information out as I did - in a lab – proceed with caution.
- SWAMP your Security Event Logs on your DCs or other servers, over writing critical data required for audit compliance – proceed with caution.
- SWAMP your alerting and/or audit collection system and cause Alert storms – proceed with caution.
Auditing is not a 'black or white' technology in Windows and there isn't always a clear answer to the "W" questions, even with auditing enabled.
I chose not to care about failure to make changes – I only cared about successfully making changes, so I enabled SUCCESS audits but not FAILURE audits. This can help to reduce auditing noise. Some would say this reduces visibility into potential denial of service (DOS) attacks.
Turn on auditing at the proper levels. This, too, can help to reduce auditing noise:
- The AD object level –
  - On a partition (i.e. the configuration or domain partition object)
  - On a certain OU(s) or sub OU(s)
  - Other AD objects, such as a certain group or service account
- The AD object inclusion level –
  - This and every object below - "This object and all descendant objects"
  - Only instances of the specific object type = "All descendant OUs"
Consider not enabling auditing for TEST/DEV objects/OUs/ PC/servers, etc
- This can help to reduce auditing noise due to frequent changes in lab environments.
This post only brushes the surface of Auditing in Active Directory and is by no means 'all there is.'

Auditing in AD has come a long way since Windows 2000 but many customers haven't taken or had the time to set it up so that valuable data can be derived from the infrastructure. As a result, often, those WHO/WHAT/WHEN questions for sensitive and/or unexpected/unplanned changes or deletions cannot be answered with empirical evidence.

OK, let's do some stuff!!

Several things need to be addressed before AD auditing can be fruitful.

AD events occur on Domain Controllers; hence, we need to enable Advanced Audit Policy settings on the DCs. In my lab, I set these options in the Default Domain Controllers GPO:

Here's the relevant output of AUDITPOL /get /category: * from the DC:

Here's the setting which forces the newer granular Audit settings to prevent potential conflicts with legacy Audit settings. See the links for a further discussion of this setting: http://technet.microsoft.com/en-us/library/dd408940(v=WS.10).aspx

http://technet.microsoft.com/en-us/library/dd772710(v=WS.10).aspx

Here's one of the audit events for enabling "Success" on Directory Service Changes above (this is audited/logged by default).

Here's a screenshot of the Default Domain Controllers GPO in my lab after my changes:
We need to create one or more System Access Control List entries (SACLs) for what we want to audit.

IMPORTANT – if you enable the above Audit Policy settings but don't also create SACLs, you won't get any audit events from those Audit Policies. I've seen this unpleasant 'surprise' with customers, too.
This is what I set in my lab for this post - adjust to meet your environment's needs/specifics:
- Open AD Users and Computers MMC (DSA.MSC)
- Right-click the Domain or the target AD Object > click Properties

Click the Security tab > Click the Advanced button

Click the Auditing tab
Click Add to begin adding SACL entries

SACL Entries

Activity to Audit - Create and Delete Organizational Units (OUs)

EVERYONE > CREATE Organizational Unit objects > DOMAIN and all descendent objects

EVERYONE > DELETE > Descendant Organizational Unit objects

Activity to Audit - Create and Delete Computer Accounts (including a Move)

EVERYONE > CREATE Computer objects > DOMAIN and all descendent objects

EVERYONE > DELETE > Descendent Computer objects

Activity to Audit - Create and Delete Group Policy Objects (GPOs)

EVERYONE > CREATE Group Policy Container objects > DOMAIN and all descendent objects

EVERYONE > DELETE > Descendent Group Policy Container objects

Activity to Audit - Link and unlink GPOs to OUs

EVERYONE > WRITE GPLink > DOMAIN and all descendent objects

Activity to Audit – Edit Group Memberships and/or Delete Groups

EVERYONE > Write all properties + Delete > Descendant Group objects

SUMMARY SACL LIST TABLE

SACL list for the Domain - After the changes to my lab

You need to consider the increased amount of data gathered as it relates to the size of the Security Event Log on your DCs. You may need to increase the size of the Security Event Log so the data doesn't roll through the Log before you even know you need it.
Some customers enable auditing for Everyone at the Domain level, with all descendent objects, capturing Success and Failure events on everything and think they're all set. However, when they go into the Security Event Log "looking for answers," they're stunned to realize their Security Event Logs wrap in less than 2 hours and the event data they need from the "WHAT THE &*$#@ ?!?" this morning is no longer available.
- I'm stating the obvious here, but this depends on numerous variables such as:
  - How many objects are in the environment
    - An AD with 5 users will produce a lower volume of Audit data than an AD with 50,000 users. Realize that the default Event Log sizes are the same in both cases, though.
  - How many items are being auditing
    - Auditing 4 OUs for deletions will produce a lower volume of Audit data than auditing every attribute on every object in AD for success and failure.

Scenario – Linking a GPO to an OU

If someone links a GPO to an OU, it could produce dramatic results on the contents of that OU, including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5136 – A directory service object was modified.
This example lists the OU DN path and the linked-GPO's GUID
1. What OU had the link added?
  1. Object "DN:OU=SERVICE ACCOUNTS,OU=-PRODUCTION OU….,DC=LAB"
2. What GPO was linked to the OU?
  1. Attribute:
    1. LDAP Display Name "gPLink"
      1. Value: <GUID of the GPO>
  1. How is this differentiated from removing a GPO Link from an OU?
    1. Operation Type: Value Added

NOTE: In the screenshots I've included, the relevant information to help answer the "W" questions is called out via the red boxes. I labeled this first screenshot with Who/What/When text and arrows, too, but for clarity, on the rest of the screenshots, I only used the red boxes.

You can use a DSQUERY one-liner to derive the Display Name from the GPO GUID, then use GPMC to review the new GPO, if needed.

dsquery * "cn={E83C3E6F-2864-46CD-B6C1-C29CE4D04A88},cn=policies,cn=system,DC=DOMAIN,DC=LAB" -scope base -attr displayname

SCENARIO – Deleting a GPO Link from an OU

If someone unlinks a GPO from an OU, it could produce drastic results on the contents of that OU, including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5136 – A directory service object was modified.
This example event lists the OU DN path and the un-linked-GPO's GUID
1. What OU had the link deleted?
  1. Object "DN:OU=SERVICE ACCOUNTS,OU=-PRODUCTION OU ….DC=LAB"
2. What GPO was unlinked from the OU?
  1. Attributes to review:
    1. LDAP Display Name "gPLink"
      1. Value: <GUID of the GPO>
    2. Use the same DSQUERY one-liner from the prior example.
3. How is this differentiated from linking a GPO to an OU event?
  1. Operation Type: Value Deleted

SCENARIO – Deleting a GPO

If someone deletes a GPO from AD, it could produce drastic results on the contents of the Site, Domain or OU(s) to which it is linked, including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5141 – A directory service object was deleted.
This sample event lists the DN path (which is also the GUID) for the GPO that was deleted
1. I could not correlate an Event that listed the Name of the GPO that was deleted and the DSQUERY command from before won't work because the object is gone.
2. I checked in the Deleted Objects Container to see if I could get the Name of the GPO from there but that attribute (along with most others) is cleared upon deletion – no luck.
3. However, I was able to look in my nightly GPO Backups (you do backup your GPOs, right?) and found the GUID for the deleted GPO and got the Name from the GPOReport file that is generated during the GPO backup job.

SCENARIO – Delete an OU

If someone deletes an OU (and everything in it) can produce drastic results. We want to be able to determine who made the change.

Event ID 5141 – A directory service object was deleted
This example event lists the DN path of the OU deleted.

SCENARIO – Moved a computer account from one OU to another OU

This can produce drastic results on the system moved (i.e. a critical application server), including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5139 – A directory service object was moved.
1. This sample event lists the old and new DN path for the Computer Account.
2. How do we know it was a Computer Account move?
  1. Object > Class: computer

SCENARIO – An OU was moved (possibly drag-n-dropped on accident?)

Moving an OU (and its contents) can produce drastic results on the systems/users in the OU(s). This includes systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5139 – A directory service object was moved.
1. This sample event lists the old and new DN path for the OU.
2. How do we know it was an OU move?
  1. Object > Class: organizationalUnit

SCENARIO – Editing a Group's membership

If someone edits membership to sensitive Groups in AD (such as Domain Admins, Enterprise Admins or others), it could produce drastic results, including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5136 – A directory service object was modified.
This sample event lists who was added or removed from the group

Group Member Added
- Operation > Type > Value Added
Group Member Removed
- Operation > Type > Value Deleted

SCENARIO – Deleting a Group

If someone deletes a critical Group in AD, it could produce drastic results, including systems or users falling out of audit compliance. We want to be able to determine the who/what/when for the change.

Event ID 5141 – A directory service object was deleted.
This sample event lists the DN path of the group deleted.

BONUS AD Auditing nuggets

If you've stuck with me this long, you must enjoy this stuff as much as I do! So, just between us, here are a few bonus Events for AD environment 'awareness'

Domain Functional Level changed (two events)

Directory Services Event Log Entry

Security Event Log Entry

Forest Functional Level Changed (two events)

Directory Services Event Log Entry

Security Event Log Entry

Directory Services Restore Mode DC Boot-up

RID FSMO Role Transfer

From the prior FSMO Role DC – Directory Service Event Log – notice the "User"

From the new FSMO target DC – Directory Service Event Log – notice the "User"

Domain Naming Master FSMO Transfer

PDCE FSMO Transfer

Infrastructure Master FSMO Transfer

Schema FSMO Transfer

Links:

Eric Fitzgerald's Blog - http://blogs.msdn.com/b/ericfitz/

Minimizing audit noise -http://blogs.msdn.com/b/ericfitz/archive/2005/01/11/350848.aspx
Auditing GPOs - http://blogs.msdn.com/b/ericfitz/archive/2005/08/04/447951.aspx
Audit changes to Audit Policy - http://blogs.msdn.com/b/ericfitz/archive/2010/07/16/auditing-changes-to-audit-policy.aspx
Auditing impact on system performance - http://blogs.msdn.com/b/ericfitz/archive/2009/08/10/auditing-system-impact-on-performance.aspx

TechNet

Advanced Auditing - http://technet.microsoft.com/en-us/library/cc731607(v=WS.10).aspx

AD Object Auditing - http://technet.microsoft.com/en-us/library/cc773209(v=WS.10).aspx

Advanced Security Auditing - http://technet.microsoft.com/en-us/library/dd408940(v=WS.10).aspx

Block accidental deletion of OUs – http://technet.microsoft.com/en-us/library/ee617237.aspx

Look into the "ProtectedFromAccidentalDeletion" switch

Advanced Auditing FAQ - http://technet.microsoft.com/en-us/library/ff182311(WS.10).aspx

Compares legacy and new Audit Policies and how they interaction

Related Ask PFE blog post – Part One of a two-part series on a real-world forensics scenario –

http://blogs.technet.com/b/askpfeplat/archive/2012/03/05/how-to-track-the-who-what-when-and-where-of-active-directory-attribute-changes-part-i-the-case-of-the-mysteriously-modified-upn.aspx

Go-Dos

Get into a lab and run through some of this – a small lab running one DC VM is all you need.
Once you're comfortable with the inputs/outputs in your lab, collaborate with your IT peers/teams and consider rolling out some changes to your Production environment.
Consider combining this information with Event Forwarding/Subscriptions for small-scale environments or a true Audit/Alerting/Monitoring solution such as Ops Manager to achieve near real-time Alerting delivered to a Console and/or a monitored mailbox and ….

KNOW what's happening in your AD!

Additional Screenshot - showing actual User ID

↧

DHCP, Dynamic DNS, and DCs: How about Some PowerShell to Spice Up a Mind-Numbing Topic?

April 30, 2012, 10:54 am

≫ Next: How to Track the Who, What, When and Where of Active Directory Attribute Changes – Part II (The Case of the Mysteriously Modified UPN)

≪ Previous: Who Moved the AD Cheese?

If the title of this blog hasn’t already put you off, you’re probably interested in the interaction between Microsoft DNS and DHCP services. Specifically, you should understand how Microsoft DHCP servers can be configured to dynamically register A and PTR records in DNS on behalf of their clients.

The default behavior of a Microsoft DHCP server is to only perform dynamic DNS registration on behalf of a client, if the client requests. The default behavior of a relatively modern Microsoft client (XP or higher) is to perform the dynamic registration of their A-record themselves, and to allow the DHCP server to perform the dynamic registration of their PTR-record.

Things become slightly more complex when you use secure-only dynamic DNS, and I hope you do you use secure-only dynamic DNS. Unfortunately, secure-only is not the default configuration for a DNS zone, so you should verify. If DCs run DHCP and perform dynamic DNS registration on behalf of their clients, the potential problem is that DCs are over-privileged with respect to secure dynamic DNS. DCs could, theoretically, hijack any DNS record on-behalf of their clients. Thus, you should use alternate credentials for dynamic DNS registration on the DHCP server.

If you’re really into this stuff, fellow PFE Karam Masri has written a nice, deep in the weeds blog about how DHCP and secure dynamic DNS registrations work (or don’t always work) on Domain Controllers.

At the end of the day, there is some very simple guidance for running DHCP on Domain Controllers – configure DHCP with alternate credentials for dynamic DNS registration. How? Simply use the DHCP management tool, open the DHCP server properties (or IPv4 properties in Windows 2008 R2), then follow three simple steps.

Or if you’re a command-line admin, you can use netsh:

Netsh.exe dhcp server \\servername set dnscredentials username domainname password

Where servername is the name of the DHCP server, username is the name of the user account, domainame is the domain where the user account resides and password is the password associated with the account.

If you just want to see the credentials already configured for a dhcp server:

Netsh.exe dhcp server \\servername show dnscredentials

Some Basic Best Practices

If you’re running dynamic DNS, be sure your zones are allowing “Secure Only” dynamic updates. If you are allowing non-secure and secure updates, alternate credentials are irrelevant and you’ve got bigger security concerns.
When you provision a domain account for alternate credentials, DO NOT grant the account any special privileges. You’ve already got too many service accounts that are over-privileged (and don’t think I don’t know it). Don’t add to the problem.
Even if you’re running DHCP on member servers (not Domain Controllers), you may want to consider alternate credentials for dynamic DNS registration. This makes for a nice transition when you’ve got to replace your existing DHCP server with a new one. If you use the same account on the old and new DHCP servers, ownership of the DNS records will not have to change.

Didn’t You Mention Something About PowerShell?

The real point of this blog was to help you check the alternate credentials for Dynamic DNS, across all of your domain controllers. Manually checking credentials on more than 2 DCs can be a real pain. In fact, I often run into customers who have dozens, if not hundreds of Domain Controllers running DHCP.

Enter the PowerShell Script

If you’ve seen some of my other scripts, some of the code in this new script may look familiar. I basically do the following:

1. Discover DCs (the old fashion way - without using the AD cmdlets, because they require a 2008 R2 DC):

##################################
Function EnumerateDCs
{
     $arrServers =@()
     $rootdse=new-object directoryservices.directoryentry("LDAP://rootdse")
     $Configpath=$rootdse.configurationNamingContext
     $adsientry=new-object directoryservices.directoryentry(LDAP://cn=Sites,$Configpath)
     $adsisearcher=new-object directoryservices.directorysearcher($adsientry)
     $adsisearcher.pagesize=1000
     $adsisearcher.searchscope="subtree"
     $strfilter="(ObjectClass=Server)"
     $adsisearcher.filter=$strfilter
     $colAttributeList = "cn","dNSHostName","ServerReference","distinguishedname"
     Foreach ($c in $colAttributeList)
     {
          [void]$adsiSearcher.PropertiesToLoad.Add($c)
     }
     $objServers=$adsisearcher.findall()
     ForEach ($objServer in $objServers)
     {
          $serverDN = $objServer.properties.item("distinguishedname")
          $ntdsDN = "CN=NTDS Settings,$serverDN"
          if ([adsi]::Exists(LDAP://$ntdsDN))
          {
               $serverdNSHostname = $objServer.properties.item("dNSHostname")
               $arrServers += "$serverDNSHostname"
          }
          $serverdNSHostname=""
     }
     $arrServers
}
##################################

2. Walk through the DCs and use WMI to discover whether or not they are running DHCP:

##################################
Function isRunningDHCP
{
     Param($computer)
     $DHCP = "FALSE"
     $Query = "SELECT Name, Status FROM Win32_Service WHERE (Name = 'DHCPServer') AND (State = 'Running')"
     Try
     {
          $DHCPRunning = Get-WmiObject -Query $Query -ComputerName $Computer -EA Stop
          If ($DHCPRunning){$DHCP = "TRUE"}
     }
     Catch {$DHCP = "FALSE"}
     Finally {$DHCP}
}|
###################################

3. Use Netsh and some string manipulation to determine if DHCP is using alternate credentials for DDNS:

###################################
Function GetAltCreds
{
     Param($computer)
     $AltCreds = $Null
     $Query = Netsh dhcp server "\\$computer" show dnscredentials
     $username = $Query[2].substring(14)
     $domain = $Query[3].substring(14)
     If ($username.length -eq 0){$AltCreds = "NULL"}
     Else {$AltCreds = "$domain\$username"}
     $AltCreds
}
##################################

4. Put it all together and report back the findings. Both on-screen and logged into a CSV file.

Now You Try

Simply download and run the attached script (DHCPDNSCreds.ps1). It requires administrative privileges against your DCs. It will discover all DCs in your forest, and report which are running DHCP and what alternate credentials (if any) are being used. Simply analyze the output (either on-screen, or in the DHCPDynamicDNS.csv file. Look for DCs where Running DHCP=TRUE and AltCreds is blank.

I hope you enjoy the script, and come back for more PowerShell goodness in the future.

Doug Symalla

Update: 16August2012. Some users reported issues with the script when NetSH DHCP commands are not available. In that case, the script will execute, but bleed red everywhere as it errors. I added some error control for the case when NetSH DHCP commands are not available. In this case, you should install the DHCP remote server admin tools (Windows Server 2008 R2) before running the script. The updated script is attached.

↧

How to Track the Who, What, When and Where of Active Directory Attribute Changes – Part II (The Case of the Mysteriously Modified UPN)

May 7, 2012, 7:00 am

≫ Next: Roaming AD Clients, with an Updated Script

≪ Previous: DHCP, Dynamic DNS, and DCs: How about Some PowerShell to Spice Up a Mind-Numbing Topic?

Hello, Ray Zabilla and Rick Bergman again. As promised in our previous post on this topic we will go into the details of how we created the script, the challenges we had during testing and what final code looks like. We are even so generous that we are going to share the scripts with you too. :)

How to Track the Who, What, When and Where of Active Directory Attribute Changes – Part I (The Case of the Mysteriously Modified UPN)

Quick Review – The story you’re about to hear is true and the names have been changed to protect the innocent…

Some unknown process, running on some unknown computer, at some unknown time was changing the UPN on the Active Directory user accounts.

Since Contoso is running Windows Server 2003 R2 X64 Domain Controllers, we recommended they search the Security event log for Event ID 642 which indicates a successful “User Account Change”. The Event ID includes information that identifies the attribute which was changed and the “calling account” initiating the change. This means that each domain controller will have to be scanned for the Event ID 642, because you never know on which writable DC the change is going to be made.

Contoso uses an enterprise auditing and collection system so the logical thing to do was to use the tool to search for the 642 Event ID versus searching each DC independently. Contoso IT made an inquiry to their security auditing team to give us all event ID 642 from all DC’s in the environment from their enterprise collection system and we would search through them. This effort turned out to be unsuccessful since for some reason the archived logs did not contain all the data and they were only able to provide part of the data needed, which of course did not contain any of the specific UPN change events we were hoping to find. We discussed Microsoft’s SCOM ACS tool, System Center Operations Manager Audit Collection Systems as a solution, but the customer declined because that was not a strategic direction for them.

The Solution – Version 1

It was stated in the previous post that the domain controllers had a security event log size of 180 MB, this means that it took less than 15 minutes for the event log to wrap. The security event log needed to be increased on each DC in order to buy additional time to see if it would be possible to capture the Event ID 642. It was good that Contoso was running Windows Server2003 x64, because the 64-bit OS would be able to handle larger event log sizes.

Change the Security Event File Size

The first thing the Contoso IT team did was increase the Security Event log size to 2GB. We had researched and found there had been a few incidents reported on Server 2003 x64 domain controllers when the security log files were set to the maximum 4GB. We recommend Contoso set the log file size at 2GB which should give us enough data to capture the 642 event but be well below the max size. After increasing the log file size some quick analysis found they now had 3 – 3.5 hours before the security event log wrapped. Our thoughts were this should give us enough time to find Event ID 642.

There is one odd situation that occurred on only one of the Domain Controllers that had its Security Event Log size changed by the GPO. The properties showed that it was properly set at 2GB and the size of the on the disk was at 2 GB, but there were not as many entries in the event log as the others. When changing the size of the Event Viewer Logs, best practice is to use the “clear log” button to allow the event log to properly resize.

Get Event ID 642 from the Domain Controllers

The information provided by the REPADMIN /showobjmeta meant we should only have to search the Security log on the domain controller where the user objects UPN was changed to find the Event ID 642. REPADMIN /showobjmeta gives the precise time when the changed occurred and on which domain controller, allowing our search to specific and limited. Once the Event ID 642 was found in the appropriate security event log we would know the AD account that made the change and could identify 4 of the 5 key variables (who, where, when, what), which would hopefully provide enough information to lead us to the process making the change.

Figure 1 – REPADMIN /showobjmeta output

In the example shown above, Figure 1 - REPADMIN /showobjmeta output, you can see the change that was made to test5455 on DC02NA which we got from the metadata. It also shows the other key piece of information, the AD account initiating the change, which in this case is Administrator.

Armed with this knowledge, we started down the path of creating PowerShell scripts to identify user accounts where the UPN that had been set to an incorrect value and create a “Bad UPN” report/log file with the associated replication metadata. We quickly realized it is going to take an enormous amount of time to do this for all users, we needed subset of users to focus on and that would be quicker.

Contoso IT team came up with 1600 users that they would watch for changes to the UPN and we would use that for the input file for the script. Great, how do we script this in powershell? We looked at the output from REPADMIN /showobjmeta, and quickly surmised that it would take a lot of work to parse the output. How can we do all this in powershell?

We decided to use the PowerShell equivalent of the REPADMIN /showobjmeta, the GetReplicationMetadata method of the System.DirectoryServices.ActiveDirectory.DirectoryContext object for the ease of handling the data. We opened up our favorite search engine, www.bing.com, to start looking for examples and we found one really good example.

AD Replication Metadata (when did that change?)

http://bsonposh.com/archives/253

We used the sample from above website along with some minor changes to retrieve the AD Users object meta data and get the OriginatingServer and LastOrginatingServer values. See figure 2.

Figure 2 – PowerShell REPADMIN /ShowObjMeta example

These values along with the UPN would be written out to a log file, “BadUPN”, so we could use them for searching for Event ID 642. From the “Bad UPN” list, we got the list of domain controllers we needed to get the security event log from.

The Solution – Version 2

Reality set in at this point, it made no sense to repeatedly query the same DCs for the Event ID 642, when we only needed to get the data once. We changed the script so only a filtered list of the DCs would be queried versus all DCs.

Filtered Server List approach

We already had the script looping through the each user identifying their metadata. Part of the information being collected was the name of the server. Using PowerShell, when looking at the server data, if the servers had been removed from the domain, for any reason, the server name returned a Null value. Remember that the server names in AD are actually GUIDs and not the FQDN name. This meant we needed to handle the Null value to ensure it wasn’t included in the array and ensure there weren’t any duplicates server names in the array. FYI, handling the Null value in the array was not obvious to us right way and took a while to figure out why things weren’t working as expected in the script. In Table 1 below, you can see what the Repadmin /showobjmeta output looks like when the DC is missing. The server name looks like “cd5d12e9-ad2e-4e44-a785-f6757f209d4e” when it is missing. When it is there it looks like “Default-First-Site\DC01”.

We checked if the returned Server Name was a Null, and if it was then it was skipped and not put in the Array. If the Server name was a not a Null, then check to see if the array contained this server name. If the array did not contain the Server Name, then add it to the array; otherwise continue on without adding the server name to the array. See Figure 4 for the example code.

Loc.USN Originating DSA Org.USN Org.Time/Date Ver Attribute

======= =============== ========= ============= === =========

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 objectClass

1133651 Default-First-Site\DC01 1133651 2011-08-11 12:04:10 1 cn

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 sn

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 st

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 title

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 postalCode

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 physicalDeliveryOfficeName

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 givenName

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 instanceType

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 whenCreated

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 displayName

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 department

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 company

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 391356011 2011-06-30 10:54:29 3 homeMTA

118496097 Default-First-Site\DC02 65408863 2011-10-26 20:41:02 3 proxyAddresses

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 streetAddress

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 nTSecurityDescriptor

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 mDBUseDefault-First-Sites

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 extensionAttribute9

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 mailNickname

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 employeeType

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 name

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255842 2010-05-11 02:07:58 3 userAccountControl

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255839 2010-05-11 02:07:58 1 codePage

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255839 2010-05-11 02:07:58 1 countryCode

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 employeeID

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255840 2010-05-11 02:07:58 2 unicodePwd

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255840 2010-05-11 02:07:58 2 ntPwdHistory

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255843 2010-05-11 02:07:58 3 pwdLastSet

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255839 2010-05-11 02:07:58 1 primaryGroupID

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255841 2010-05-11 02:07:58 1 supplementalCredentials

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 objectSid

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255839 2010-05-11 02:07:58 1 accountExpires

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255840 2010-05-11 02:07:58 2 lmPwdHistory

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 sAMAccountName

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 division

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 sAMAccountType

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 legacyExchangeDN

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 userPrincipalName

392184935 Default-First-Site\DC01 392184935 2012-01-23 09:13:47 1 lockoutTime

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 objectCategory

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 msNPAllowDialin

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 textEncodedORAddress

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 mail

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 departmentNumber

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 middleName

41048248 Default-First-Site\DC01 41048248 2011-09-07 10:18:35 4 msExchPoliciesIncluded

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 391356011 2011-06-30 10:54:29 3 msExchHomeServerName

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385296006 2010-05-11 02:09:51 2 msExchALObjectVersion

1133651 cd5d12e9-ad2e-4e44-a785-f6757f209d4e 127255838 2010-05-11 02:07:58 1 msExchHideFromAddressLists

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 msExchMailboxSecurityDescriptor

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 msExchUserAccountControl

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 385295851 2010-05-11 02:08:51 1 msExchMailboxGuid

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906234 2011-03-02 07:42:52 1 msRTCSIP-InternetAccessEnabled

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906237 2011-03-02 07:42:52 1 msRTCSIP-PrimaryUserAddress

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906233 2011-03-02 07:42:52 1 msRTCSIP-UserEnabled

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906238 2011-03-02 07:42:52 1 msRTCSIP-PrimaryHomeServer

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906236 2011-03-02 07:42:52 1 msRTCSIP-OptionFlags

1133651 5c66f360-c067-4e66-959a-d11bba47e42c 610906235 2011-03-02 07:42:52 1 msRTCSIP-FederationEnabled

Table 1 - REPADMIN /showobjmeta output Missing DC

Figure 4 – Filtering DC for the Array

Let’s Test the Script

We were pretty confident that the script would work quite well and it would be something that could be run frequently, like once an hour or two. After the first test taking over 3 hours to complete on just one DC, we were back to rethinking how to improve the performance of collecting Event ID 642 from the list of DCs.

The Solution – Version 3.0

After seeing how long it took to return results we did some more thinking, research and testing on how to improve the performance of collecting the event log entries. While discussing the scenario, it sure made sense to use a multithreaded approach, so we could collect from each DC at the same time. Let’s rephrase the previous sentence into tech speak, “It would be really cool if we could do multithreading with PowerShell.”

Researching PowerShell and Multi-Threading

We fired our favorite search engine, www.bing.com, and starting looking ways to collect from all the DC in the array list the script created. The following sites were extremely helpful in getting us started using PowerShell V2 Start-Job command let.

Ryan's PowerShell Blog

Multi-Threading in PowerShell V2

http://ryan.witschger.net/?p=22

TechNet Library

Start-Job

http://technet.microsoft.com/en-us/library/dd347692.aspx

PowerShell Multi-Threading using Start-Jobs

PowerShell 2.0 has the ability to a form of multithreading and it is called “Jobs.” This would give us the ability to use a parallel versus a serial approach for gathering the events from the event logs. Great, let’s try using “Jobs” to see if it will speed up the security event log collections for the list of DCs. It took a while to figure out what was the proper syntax to use when collecting the security event log from multiple domain controllers. While working through the syntax, we learned that we needed to use the “-ScriptBlock” parameter to get what we wanted to do working correctly. One of the syntax tricks to the Start-Job cmdlet is where the “{“ is placed. Normally for readability of code, the “{“ is place on the next line, but that doesn’t work correctly with Start-Job. The curly bracket needed to go on the same line and after your last parameter for everything to work correctly. The last interesting tidbit of information was figuring out how to use variables in the Job. After reviewing samples and reading forums, we determined that global variables do not work and we needed to pass them. The way to do that is with “-ArgumentList” parameter.

In figure 4 – Start-Job Section, we are showing the working code we came up with to collect the security events logs from multiple Domain Controllers. There is a more to this part of the script than what we have talked about to this point.

Figure 4 – Start-Job Code Section

We tested the script and collected data to prove less time was spent collecting the events from the event logs. The data was collected with start and stop time for each DC we collected event logs from by writing to a log file. It discovered when reviewing the log files that some DC were taking longer than others. When we tested we quickly found out that a couple of the DCs were the bottleneck when it came to collecting the security event logs. The only reason we came up with to explain this was that those slower DCs were averaging more security events/second that the rest.

Event Filtering Performance

We checked to make sure we were using the fastest methods possible for collecting events from the security log. We ended up testing WMI and Get-Eventlog from PowerShell. We found that if we did anymore than just filtering for a single Event ID using “Where” clauses, it took longer to get the data. The reason it takes longer is that it reads the entire event log first and then filters it before out dumping the information to file.

Through multiple tests we determined that using the simple filter for retrieving Event ID 642 only and placing that data in a log file worked the fastest. One other interesting observation we need to share, is the amount of audit data being logged into the Event Logs and how that affects the performance of retrieving the event logs. Especially on Windows Server 2003, the more entries being written the longer it took to retrieve the events from the security log.

Testing & Performance Thoughts

We ran this version through some testing and determined that it did work and actually helped us catch one account that was making the change to the UPN’s. The testing also showed us that we could not run the full version of the script every hour or two like we were thinking. The reality was it needed to be run in the full version right after we knew when the changes were being made to the UPN’s.

We also changed the script to allow it to gather the UPN changes only, but not go out and pull from the DC’s security event logs. The script can be started in in either mode, by using a command line parameter when launching the script. If the script launches without any command line parameters, then it only collects the changed UPN values in a log file. If the ‘Full’ command line parameter is used then the security event log scan is done too.

Command Line Parameters Examples

O365UPNCheckV4.ps1 – will log only UPN update information and NOT gather the Security event logs

O365UPNCheckV4.ps1 ‘Full’ – will log UPN update information and gather the Security event log

Analysis of the Event logs

A Log file is generated from each DC containing the Event ID 642 that has been collected. The log could be quite large and would take a long time to manually review them. To speed that process up, Ray developed another script that will do the analysis of the each of the log files looking for the Event ID 642’s that had the UPN values changed. If a 642 that had its UPN changed, it outputs to a .csv file listing the originating domain controller, date/time, and the new value of the of the UPN attribute. This gives us one place to look for the “who, when and where” the UPN’s changed.

Parallel Tasks

The Contoso IT team was working in parallel to find out what they could about what could be making changes to AD. From the work they had already completed, once they knew the AD account making the change, they were able to identify the offending process in about 15 minutes and get the migration back on track.

PostScript

We hope you found the additional detail we included of how we approached solving the problem, the challenges we went through and the script development process we used. The scripts are attached the blog post and we hope you find those helpful too.

We hoped you enjoyed this post.

Ray and Rick

Attached Sample Scripts

O365UPNCheckV4.txt

ScanForUPN.txt

↧

Roaming AD Clients, with an Updated Script

May 14, 2012, 4:08 am

≫ Next: MCM: So You Want to Be a Active Directory Master, eh?

≪ Previous: How to Track the Who, What, When and Where of Active Directory Attribute Changes – Part II (The Case of the Mysteriously Modified UPN)

Three months ago I posted some information on AD Sites, Subnets and Roaming Clients. The heart of the blog was a PowerShell script that collected and collated netlogon.log files across all Domain Controllers in the forest to report a list of hostnames and IP addresses that have authenticated from IP addresses with no corresponding subnet defined in Active Directory. I call these roaming clients, because they randomly seek out Domain Controllers, with no sense of closeness.

In the past three months, I’ve fielded some good questions from customers about roaming clients and the PowerShell scripts. Below are some of the questions, my responses and an updated script.

Question:

Do I even need to define any subnets in Active Directory?

Answer:

There is one very specific scenario where you don’t need to define any subnets – If you have exactly one site defined in Active Directory. I’ve got some customers who have a simple Active Directory (for an extranet, for example). It has two Domain Controllers (for redundancy) in a single site (usually Default-First-SiteName). In this case you don’t need to define any subnets. All IPs will be associated with that site, and DCs will report no roaming clients in their netlogon.log files. Beware, though. If you define one subnet, you need to define all subnets.

Question:

Getting subnet information from my network team is harder than catching a greased pig. Can I just deploy a “Catch All” subnet to deal with roaming clients?

Answer:

There is a compelling case for using the “Catch All” subnet. To summarize, a “Catch All” subnet is a subnet with a broader scope, which encompasses most/all of your specific subnets. For example, if you have numerous 10.10.x/24 subnets associated with various sites, you could configure a 10.10.0.0/16 subnet and associate it with a major Hub site. So if you forget one of the 10.10.x/24 subnets, the clients will automatically be “caught” by the 10.10.0.0/16 subnet and gravitate to the hub site. So instead of roaming, these forgotten clients will gravitate to the hub site.

While there’s nothing wrong with a “Catch All”, I’m not a big fan. First, it hides the problem of subnet definitions. By associating roaming clients with the hub site, you will never see them logged in the netlogon.log file, so you will never be able to properly fix them. Second, while AD doesn’t have a problem with overlapping subnet definitions, some AD-Aware applications like SCCM don’t like them. The SCCM client in a specific site, which is also covered by a “Catch All” may not be able to determine whether it belongs to the specific site or to the “Catch All” site. AD will always use the more-specific subnet definition when determining which site to which it belongs.

Question:

Since you don’t recommend a “Catch All” and there is no practical way I can keep subnet definitions up-to-date at all times, is there anything else I can do for roaming clients?

Answer:

If your AD environment spans multiple sites, and roaming clients feel the pain of not finding a “close” DC, you can/should use our Branch Office Recommendations. Specifically, you should configure “remote” domain controllers to NOT register generic SRV records. Generic SRV records are used by roaming clients (or non-AD aware clients) to discover DCs. If you configure your remote DCs to NOT register these records, they will only register site-specific SRV records. Thus, only clients in the remote site will find remote DCs. Roaming clients will not find remote DCs, and they will gravitate to hub DCs. Chapter 4 in the Branch Office Guide describes this in more detail, while KB 306602 contains a summary. The beauty of this configuration is that it addresses a number of scenarios where clients might roam, including undefined subnets, in-site DCs being unavailable, or non-site aware applications that look to DNS to discover DCs.

Question:

I’ve tried using your script to report roaming clients. However, after I add subnets to AD and re-run your script, it still reports the clients as roaming. What am I missing?

Answer:

You’re not missing anything. The script isn’t intelligent enough to distinguish between old events and new events. So it will report clients as roaming as long as there are entries in the netlogon.log, regardless of the date. The new script (FindRoamingClientsv2.ps1) is now date aware. You simply run the script and pass it the number of days in the past you would like to consider. For example, the following will only consider events in Netlogon.log from the past 5 days:

.\FindRoamingClientsV2.ps1 5

Note that you can run the script without the # of days parameter. In that case, it will go back 7 days.

Question:

Are there any other improvements to the script that you’d like to mention?

Answer:

The script now report progress better. So in large environments, you can see which DC you are currently collecting logs from. It also includes the number of the current DC and the total number of DCs, so you have an idea of how much longer it will run. I’ve run the script in environments of 200-500 DCs, across multiple continents. In those cases, it took from 2-6 hours to run.

I hope you enjoy the new script, and it helps you stay on top of AD subnet definitions.

Update (6.August.2012): Some customers informed me the script was not returning the expected data in their environments. We discovered, in some cases, the NO_CLIENT_SITE entries in the netlogon.log file contained 6 fields (columns) and in other cases it contained 5 fields (columns). The script assumes these entries will always have 6 fields, with the client name in field 5 and the IP address in field 6. Since this assumption is not valid in all cases the script has been modified to deal with either case. Specifically, instead of looking at position 5 and 6 in an array, we look for position –2 and –1 (the last and second –to-last positions). The attached script contains this update.

Doug Symalla

↧

MCM: So You Want to Be a Active Directory Master, eh?

May 20, 2012, 11:00 pm

≫ Next: Best Practices for Implementing Schema Updates or : How I Learned to Stop Worrying and Love the Forest Recovery

≪ Previous: Roaming AD Clients, with an Updated Script

Back in February 2012, I was lucky enough to take part in the Windows 2008 R2 Directory Services Masters class and I promised that I would blog about my experience. Consequently, this will probably turn into another series as I wouldn’t do it any justice by only writing one entry about it.

Introduction

For those unfamiliar with our Microsoft Certified Master’s program, think of it like the Cisco CCIE of the Microsoft world. Microsoft was looking for a way to distinguish the breadth of knowledge and experience of select Microsoft engineers beyond the MCSE and hatched a program about 5 years ago originally called the Ranger Program. It was first started for Exchange engineers and due to overwhelming demand branched out to encompass Active Directory, SQL, OCS/Lync, and Sharepoint. I originally heard about this “Ranger” accreditation through an Exchange engineer friend of mine. I heard it was a grueling three-week long class that would test your deepest technical abilities and the strength of your spirit. I immediately knew I had to do it. :) I told my wife that I eventually wanted to be a Ranger, and she honestly thought I was changing careers to become a Forest Ranger, made sure to tell her friends about it, and occasionally made jokes about it. Here is more information about the program:

http://www.microsoft.com/learning/en/us/certification/master.aspx#tab1

I contacted my manager and told her about my desire to get into the program and was told that there was a two year waiting list. I added my name to the list and waited almost 3 years and even then, it took the recommendation of another accredited Master to get my name into the conversation. Nonetheless, I was now a candidate for the class. This didn’t mean I would get in but I was one hurdle down, many more to go.

Once the excitement wore off, I then read the introduction email and quickly become discouraged as though I was applying for a new job or something. To quickly give you some background on my experience, I’ve been working in IT for over 12 years ranging from web development to teaching MCSE classes to now being a PFE at Microsoft. And with 8 years now in PFE and having delivered almost 200 ADRAP’s, I’ve felt like I’ve seen it all! But even after all of this, I worried whether it would be enough to successfully get through this class?

Prerequisites

The prequisites for the Active Directory Masters class are:

Five or more years of hands-on experience with Windows Active Directory: installing, designing, configuring, and troubleshooting
Thorough understanding of Windows Active Directory design and architecture
300-level understanding of site component topology, forest operations and topology, the Active Directory distributed file system, file replication services, security, client interactions, and Group Policy
Basic understanding of Active Directory Certificate Services, Rights Management Services, Active Directory Federated Services, and ADAM/Active Directory Lightweight Directory Services
Functional skills in basic protocol analysis, Hyper-V, scripting, PKI, and IP addressing and routing
Ability to speak, understand, and write fluent English

And then one of the following certifications:

Microsoft Certified Systems Engineer (MCSE) on Windows Server 2003
Or
Microsoft Certified Systems Engineer (MCSE) on Microsoft Windows 2000 Server

And one of the following exams:

Exam 70-219 or Exam 70-297
Or
Microsoft Certified IT Professional (MCITP): Enterprise Administrator

Once I had met these prerequisites, I then had to complete the following:

Complete the brief application.
Upload your resume or curriculum vitae (CV).
Submit supporting documents including two write-ups on projects that I had been a part of that demonstrated my breadth and knowledge of Active Directory and Microsoft Technologies.
If they can’t verify my experience, I will then be asked to go through a 30 to 60 minute interview.
Register and then pay in full for the program.

It took me a few weeks to pull it all together but I submitted my application and all my supporting documents and waited patiently. Later that week, I got the email that I had gotten in.

The Basics

The MCM class consists of two straight weeks of training in Redmond, WA. During those two weeks, you’ll get only 1 day off although you’ll probably be studying during all your free time. When it starts, it will be 8-10 hours a day Monday through Friday. On Saturday, you’ll have a 3 hour written exam testing you on topics from the previous week. Sunday is the one day off. Then Monday-Friday, classes again are 8-10 hours a day. On that next Saturday, you’ll have another 3 hour exam and the very next day, which is Sunday, you’ll have a very long, grueling 9 hour lab exam. It boils down to about 90 hours of class time, 6 hours of written exam time, and 9 hours of lab exam time. Add this to all the study time and it makes for a very long, exhausting two weeks.

The class covers each of the following topics in depth:

Active Directory Internals
Domain Name Resolutions (DNS)
Client-Side Interactions
AD Site Topology and Replication
RODC
Authentication
Lightweight Directory Services (LDS)
Group Policy
AD Disaster Recovery
PKI
FRS
DFS including DFS+N and DFS+R

Now remember, this class is not for someone that wants to learn about these topics. I really can’t stress this enough but this class is for those that have extensive experience and knowledge on these topics and want to take it to the next level. If you’re not intimately familiar with each of the above topics nor have the desire to learn the internals to each of the above topics, you probably won’t pass this class. I’m not trying to scare but you can’t just read some online brain dump and then pass this class. I’m convinced that successfully getting through this class takes experience + desire + hard work, like most good things in life :)

Preparation

As I began preparing for the MCM, I wasn’t sure exactly how to prepare because I didn’t really know what it would entail. Should I go back and read the Microsoft Resource Kits, Windows Internals, or review every ADRAP I had ever done? In between work, travel, and family, how would I have time? As the MCM approached, I thought back to my college days and all those late nights before those big final exams. I would stay up all night cramming, walk into the classroom like a zombie, and walk out with a C+. But this wasn’t college anymore; this wasn’t a topic I had been studying for only 4 months. This was my career…Something I had been passionate about and worked on every week for almost 14 years; a culmination of my professional career. I decided that if this wasn’t enough, perhaps it just wasn’t meant to be and if this wasn’t enough, I was dying to know the Microsoft studs who wrote this class. Even though I wasn’t sure how to prepare for the class, over the course of the month before the MCM, I was passively going through various scenarios and/or topics in my head to help fill in any gaps.

The best advice I can give for preparation besides studying and knowing the above topics inside and out is to know all the differences and functionality availability based on OS version, domain functional level, and forest functional levels. Also, be familiar with Active Directory troubleshooting to the extent that you’re comfortable with all the built-in AD tools, support tools, and resource kits tools… For example, do you why and what repadmin, klist, certutil, or dfsutil are used for? I don’t think that knowing these tools will necessarily help you get through the class but if you’re tool chest doesn’t comfortably include these, you’re probably not where you need to be for this class.

Over the course of this new series of mine, I’m continue to share my experience of going through the MCM class, the challenges, and mental breakdowns as we slowly start to unfold the mysteries of Active Directory. Stay Tuned!

Next - Part 2: MCM - Active Directory Internals

↧

Best Practices for Implementing Schema Updates or : How I Learned to Stop Worrying and Love the Forest Recovery

May 28, 2012, 6:02 pm

≫ Next: A Global Enterprise … in your basement?

≪ Previous: MCM: So You Want to Be a Active Directory Master, eh?

Note: This is general best practice guidance for implementing schema extensions, not the testing of their functionality. There may be some additional best practices around design and functionality of schema extensions that should be considered. Understand that the implementation of a schema extension may well succeed, but the functionality around the extension may not behave as expected.

As with any change to the Active Directory infrastructure, the two primary concerns around implementing a schema extension are:

1. Have you tested it, so you can be reasonably sure it will behave as expected when implemented in production?

2. Do you have a roll-back plan? And is it tested?

Digging into the details of each of these is where things get a little stickier. However, having personally helped customers with dozens of schema updates, I can honestly say that staying within best practices isn’t that hard, and definitely makes implementation less risky and less stressful.

Have you tested your schema update, so you can be reasonably sure it will behave as expected when implemented in production?

The reason this question gets so sticky is that customers either don’t have a test environment, or they don’t have a test environment that reasonably reflects the production environment. With respect to testing a schema extension, the best test environment is one that has an identical schema to the production environment. How can you build and/or maintain a test environment that has a schema that is identical to production?

1. Maintain a test Active Directory environment. On an ongoing basis, be sure to apply all schema extensions to your test environment that you do to your production environment.

2. Build a test Active Directory environment, then synchronize the schema to production. Specifically:

a. Start by building the test environment to the same AD version as production. That is, if all your production DCs are Windows Server 2003 or lower, make sure your test environment has a 2003 schema. If the production schema has been extended to 2008 R2, apply the 2008 R2 schema extensions to your test environment.

b. Apply other any known production schema extensions to the test environment. This includes things like Exchange, OCS, LYNC or SCCM.

c. Fellow PFE Ashley McGlone has a cool PowerShell script that will analyze your production schema for other extensions, to help you “remember” any other schema extensions.

d. AD LDS (formally known as ADAM) has an awesome schema analyzer tool that will compare two schemas, and prepare an ldif file so you can actually synchronize the schemas. You should definitely use this tool to otherwise sync the schemas across your production and test environments.

3. Perform a Forest Recovery Test on your production forest. (Please be sure you isolate your recovery environment when you test forest recovery). Your recovered forest will most certainly have an identical schema to production. Perform your schema update test on this recovered environment.

Typically people will shy away from #3 because it seems the hardest (and potentially most dangerous if you forget to fully isolate the recovered forest). However, based on my experiences, I think #3 is the best option. Why? Because if forces you to do something you should be doing anyways (see the section below), and there is no doubt that the schema in your test/recovered environment will be the same as the schema in production.

Do you have a roll-back plan? And is it tested?

There’s no delicate way of saying this, so I’m just going to say it:

The only supported/guaranteed way to roll back a schema change is a full forest recovery.

Thus, the best (only?) roll-back plan is a well-designed, documented and tested forest recovery plan. I know it sounds harsh (and it is), but you must be prepared for forest recovery. A couple points to make this otherwise bitter pill a bit easier to swallow:

1. You should have a documented and tested forest recovery plan anyways. It’s a general best practice. You’ve probably been ignoring it for a while, so if you’re serious about a roll-back plan for your schema update, now is the time to get serious about documenting and testing forest recovery plan.

2. It’s not as hard as it appears. But it is very unforgiving in the details. We’ve got a great whitepaper to help you through the details.

3. You can actually kill two birds with one stone here. The forest recovery test will actually generate a great test environment for testing your schema extension (see option #3, above, for testing schema updates).

If you’ve avoided testing forest recovery this long, I expect you won’t go down without a fight. Here are some of the “alternatives” I’ve heard people used for potential roll-back strategies:

1. Disable inbound/outbound replication on the schema master. Then perform the schema update on the schema master. Any badness is contained to the schema master. If something goes bad, blow up the schema master and repair the rest of the forest (seize schema master on another DC and clean out the old schema master).

2. Shut down/stop replication on select DCs. Do the schema upgrade, and if something goes bad, kill all the DCs that were on-line and may have potentially replicated the “badness”. Light up the DCs that were offline and repair/restore your forest.

Typically, I don’t like to go down those rabbit-holes. First, choosing one of those strategies still does not absolve you from needing a documented and tested forest recovery plan. Second, either of those strategies requires a good bit of work in preparing and executing. Failure to execute properly could be disastrous. Third, if I’m upgrading the schema I like to make sure AD replication is healthy before, during and after the update. Taking DCs offline, or isolating them, significantly impairs the ability to check health, you need to be on your toes to distinguish real errors from self-inflicted errors (caused by the isolation). Finally, be aware that for some schema upgrades (ADPREP specifically), Microsoft recommends against disabling replication on the schema master. Also, check out another strong recommendation against isolation.

Thus, I would recommend investing your valuable resources in a forest recovery test, and a schema extension test (on the recovered forest). After that, there’s not a lot of value in additional risk-mitigation strategies like schema master isolation. If you’ve tested the schema extension and validated recovery you’ve done your due diligence, so know the odds are monumentally in your favor. Schema extensions, especially Microsoft-packaged schema extensions, have a proven and well-tested track record. And real-life examples of customers needing to perform a production forest-recovery are almost non-existent.

Put it all together and it’s really quite simple

Get yourself in the habit of preparing for all schema extensions with a one-two step. First, test your forest recovery plans. Second, test your schema extensions in your recovery environment and in any other test/non-production environments you may have. The first time you perform the exercise, be sure to document. Every subsequent time, be sure to review/update your documentation. You can them be confident that you’ve done everything possible to insure the schema extension goes off without a hitch.

↧

A Global Enterprise … in your basement?

June 3, 2012, 10:00 pm

≫ Next: Slow Boot Slow Logon (SBSL), A Tool Called XPerf and Links You Need To Read

≪ Previous: Best Practices for Implementing Schema Updates or : How I Learned to Stop Worrying and Love the Forest Recovery

In case you haven't heard, we're hard at work on the next release of Windows

Windows Server 2012 -http://technet.microsoft.com/en-US/windowsserver/hh534429.aspx
Windows 8 - http://technet.microsoft.com/en-US/windows/hh771457.aspx?ITPID=mscomsc

As an IT Pro, you have to continually learn new things. A challenge many of us in IT face when a product is released/updated is "how to learn the new product." Additionally, in our day-to-day operations, we need a sandbox-type arena to test planned changes to production, validate our new scripts and win or lose our intra-team bets on the "this should work" ideas. No one tests changes in production, do they?

Some lucky folks work for employers who fund wonderful lab environments full of automation, WAN simulators and other enterprise-class equipment. Most make due with less.

Some lucky folks work for employers who fund wonderful training so their IT talent is retained and up to scratch on current technology. Most make due with less.

Many folks are at least somewhat on their own to manage and grow their knowledge. It's up to the individual to scramble for some lab equipment where she can try things out and learn. For many of us, we do this in some form of a lab in our own homes, often affectionately referred to as a 'basement lab.'

With the leaps and bounds in virtualization technology and hardware advancements, this is greatly improved. One is now able to deploy an entire simulated "global infrastructure" in one's basement lab, with a minimum of hardware - one or maybe two physical machines.

In this post, I'll offer a few bits and bytes I've come up with for running basement labs of one sort or another.

Much of this can apply to a lab beyond your basement - perhaps one you've established at the office that is made up of retired and/or decommissioned systems.
Feel free to add comments to the post with your own great basement lab ideas.

Virtualization-capable Hardware

You'll need at least one 64-bit physical machine – this is the best bet for a virtual machine host
- Newer versions of Windows Server OS (and the Hyper-V Role) are only available/compatible with 64 bit hardware platforms
- Remember, though, that Hyper-V can host both x86 and x64 VMs (I still love this feature)
Consider that the bottleneck for most small lab virtual environments are typically disk-bound:
- Many new desktops and laptops today have enough RAM and multicore processing power to run the host OS and at least a couple VMs at a suitable level.
- However, most desktops and laptops today have only a single disk. No matter how fast the disk is, if you have more than a few VMs and the host OS all running from a single disk, performance is going to suffer, especially if you try to do more than one thing at a time with the system/VMs.
You can find a used server-class system that would do very well as a virtualization host from various vendors. Often these older systems are less than or comparably priced to a new high-end laptop/desktop (I picked up a well-equipped HP Proliant DL 385 G2 for around $700) . Be sure to verify the chipset functionalities are there for virtualization, though.
- A drawback to a server can be electricity use. Even if you only plug in one of the redundant power supplies, you'll most likely notice it on your electric bill.
- They are ofen noisy, too, with lots of internal high-rev'ing fans.
- Clustering adds obvious cost but sometimes not too much more when looking at used equipment. Let the vendor know if you'd entertain a discount for multiple systems for a cluster.

Once you get some capable hardware, I urge you to pause and create some structure for your lab design. Many of us preach process and standards in our day jobs only to come home to our own basement labs and just start spinning up VMs. Let us "do what we say" and plan this out a bit. This helps us stay sharp from a "process" standpoint and on-going, will make the lab much more useful. This alone can be an eye-opening experience and is part of 'staying current' in technology.

In our labs, we often create AD environments without planning beyond the Wizard interface questions. We install OSes on VMs and don't follow any standards on naming them within the virtualization console or for the hostname itself. We create users, OUs, etc without really any thought beyond 'what do I need right now?' We ping until we get a free IP address, and then we set it and forget it.

Then, a few weeks/months pass where we have been away from our basement lab and when we fire up the VMs, we don't recall how we left things. What were the server names? What did I call the AD domain? What in the world was the password I used? At that point, we often rebuild the lab and this process repeats itself. Let's implement some controls in our 'global IT infrastructure' and practice what we preach.

Data Storage –establish a system to manage ISOs, software, utilities, etc so you can easily find what you need and won't have to download it again (and again and again).
- ISOs and software – I try to put my source software on a USB thumb drive or other 'less-capable' storage
  - Think about the common pre-reqs in addition to the OS and product ISOs
    - DotNet framework versions
    - Windows Installer updates
    - MMC updates
    - SQL Express
    - Service Packs and key patches/updates for the OSes and products
    - GPP client-side extensions for XP/2003
  - Some folks purchase a TechNet or MSDN subscription and get a TON of full-version software for a bargain. Consider asking your manager if this can be expensed - I've seen that usually gets approved, if requested. Also, this is often a benefit of a Premier contract with Microsoft, yet folks might not be aware that their company already owns this value-add.
  - Some people use trial versions for their labs since they usually don't require keys
  - Speaking of license keys – if you have them, come up with a system to manage them (even just a notepad txt file)
    - How many of us have scrambled around looking for a CD key?
System Provisioning – establish the ability to rapidly provision systems. Virtualization offers us some great features here and one of the benefits of this lab effort is learning virtualization.

Think about VM naming

VMs within the management console/interface
VM config file storage within the file system/folder structure

Be aware of and consider changing the default storage location for VMs, VM config files and snapshots

VHD names associated with the VMs they're attached to

Example:

VM name in the VM console "W2K8R2-DC01"
In a folder called "W2K8R2-DC01"
Based on a VHD file called "W2K8R2-DC01.VHD
Server hostname = "W2K8R2-DC01

VM templates and documentation

Build an OS instance, patch it, install virtual integration services, other tools/utilities, fully bake it, then SYSPREP it with the "shutdown" option.

Now you have a file that can be copied and then attached to additional VMs to quickly generate a system or an entire global infrastructure in your basement lab.
Store these VM hard disk images somewhere and copy as needed to the VM folders that are created as part of creating new VMs
Consider a common local admin pwd for your base templates so you're not stuck trying to remember what you set it to two months ago

In our basement lab enviro, I consider it passable to write pwds down on the whiteboard, or at least give yourself a reminder or hint of what you set the password to be.
You can put this right next to the diagram of the lab infrastructure layout you are deploying. You do document your environments, don't you? You do have a whiteboard for your lab area, don't you? At least a pad and pen?
A fellow PFE had a great idea to use SkyDrive to store your lab docs - always available and from pretty much anywhere.

I put these VM templates on less-capable storage, too.
Also, consider there are many pre-configured fully-baked and ready to use VHDs files from Microsoft that can be directly downloaded for evaluation use.

Search the download center for VHD files - http://www.microsoft.com/en-us/download/search.aspx?q=vhd
Here's another source to browse - http://technet.microsoft.com/en-us/bb738372.aspx

Storage for running VMs
- Here's where you want to try to get some speedy storage
  - This could be a RAID array on a high-end desktop or possibly a used server with a dedicated storage controller, cache module and a few high speed SAS drives
- At the least, try to have at one dedicated drive(s) for the VMs to run from that is separate from the OS
- Having said that, many-a-PFE (including me) has used a laptop with a single drive running a host OS and one or two light-duty VMs for demos with success but I'd aim for "more" and "better" for our basement labs.
- Consider the space needed if you intend to do a lot with snapshots – these can add up exponentially
  - I've seen snapshots used to manage service-packed OS instances
    - Base OS <snap>
    - Base OS + SP1 <snap>
    - Base OS + SP2 <snap>
Networking
- Consider a common internal IP addressing scheme to easily IP systems as needed
- Consider using a 'private' or 'internal' network for your virtual network/VMs to insulate or isolate the work you do in the lab
- Consider IP address ranges
  - Server IP 'range' = 100-199
  - PC IP 'range'= 200+
- Consider a different IP subnet to that of the rest of your home Internet users (i.e. 10.x for your lab space while your home users are on 192.x)
  - It's not fun to tick off your significant other or kids with your lab efforts
  - An alternative is to schedule recurring "change management" meetings with the family – these don't usually pass as fun, or quality family time, though, and turn-out is usually on the low side. J
- Remember DHCP is often running on broadband routers and if you install DHCP on a server in your lab and you're sharing an IP subnet, you'll likely cause some disruption
Active Directory
- You can establish a longer-term plan to keep the AD forest(s) around or you can plan to spin-up AD as needed, too.
- Like the other aspects of your basement enterprise, though, you'll want a simple 'system' to make it easier to use and operate.
  - Think about server names – in my labs, I often base my names on the OS and role for the system
    - W2K8R2-DC01 (domain controller)
    - W2K3R2-INF01 (infrastructure server)
    - W2K8-APP01 (application server)
  - Don't forget a common Directory Services Restore Mode for your DR practices.
  - Think about AD domain names
    - FQDNs can get loooooonnnnnggg to type over and over as you're tesing/learning
      - HILDE.LAB is much shorter than HILDEBRANDBASEMENT.LAB
  - Think about AD contents and naming
    - User ID syntax/format ideas:
    - Groups – local, global, universal
      - I use a scheme so I can sort effectively and tell by the name what type of group it is
      - Try to get in the habit of putting something useful in the Description fields
    - Sites and Site Links –
      - I use HQ and BRANCH01, 02, etc for site names
      - I combine the Site names to make the Site Link names
      - Remember, in the AD Sites GUI, you can enter a single IP address with a /32 for the subnet mask and then link it to an AD Site, then your various individual systems will 'discover' their AD Site
  - Think about OU structures and names
    - GPOs and delegation should drive the creation of OUs, not a pretty folder layout
    - Consider the ease of viewing within in the GUI and the sort order
      - Notice below, I added a simple 'dash and a space' before my OU name and that puts it at the top of the ADUC view.
      - I also used all caps to make it stand out in the UI
      - Create some simple standard to use for the Description fields that are found throughout Windows and get in the habit of using them.
      - <Brief description> – <last name> – <date>

Some closing thoughts...

A fellow PFE will be doing a future post with information on using RRAS for setting up subnets/routing for VMs and also some NAT tips for enabling VMs access to the Internet – stay tuned!
I just began exploring the use of QoS via GPOs linked to AD Sites to see if I can simulate a slower WAN-type connection between devices in the Sites (as is often the case between AD Sites)

How are you in the field simulating WANs in your labs?

It is helpful for me to have some "tasks" rather than just wandering around in a lab. So here's a "go-do" list for those of you who'll benefit from it.
- Windows 8 and 2012
  - Create Windows Server/2012 and Client/8 VMs and get to know the new OSes
    - This article can help you find you way in the new GUI (server)
      - http://technet.microsoft.com/library/hh831491.aspx
  - You'll want this update on Windows Server 2008 R2 Hyper-V to run Win8/2012 VMs
    - http://support.microsoft.com/kb/2526776
- WDS - use automation
  - Setup WDS in the lab and deploy VM servers and clients using it
- AD auditing – know what's happening in AD
  - Setup some detailed AD auditing – check out my prior post for details
  - http://blogs.technet.com/b/askpfeplat/archive/2012/04/22/who-moved-the-ad-cheese.aspx
- Disaster Recovery – learn and practice this before you need it
  - Perform some recovery operations in your lab
    - Create/delete a user, group (with members), OU (with container objects) using a variety of methods
      - Authoritative Restore
      - AD Recycle Bin
    - Backup and restore a member server and a DC
- GUI and no-GUI
  - Build and explore one or more Server CORE machines
  - Explore remote mgmt options of a CORE machine
  - Contrast that with the Windows Server 2012 features that enable you to remove and re-add the GUI at will
- BONUS TASKS (for extra credit)
  - Deploy the super-cool Security Compliance Manager tool in the lab environment
    - Export a security baseline GPO from SCM and then import that GPO into your lab AD and link it to an OU. Then put one or more of your VMs into the OU and see what you think of the RSOP
    - http://www.microsoft.com/en-us/download/details.aspx?displayLang=en&id=16776
  - Setup WSUS for the lab environment to keep everything patched
    - http://technet.microsoft.com/en-us/library/dd939822(WS.10).aspx
  - Work through some of the new Windows Server 8 beta test lab guides
    - http://social.technet.microsoft.com/wiki/contents/articles/7807.windows-server-8-beta-test-lab-guides-en-us.aspx

Team, the Summer of 2012 is upon us – how far will you go between now and the end of the summer? Set aside time on the schedule and do some of the things discussed here. Soon, you will have learned a great deal (and might be ready for another certification test?).

Have fun in your lab and expand your skills!

Hilde

↧

Slow Boot Slow Logon (SBSL), A Tool Called XPerf and Links You Need To Read

June 9, 2012, 10:23 am

≫ Next: The Mysterious Case of a Saved Favorite URL and UAG

≪ Previous: A Global Enterprise … in your basement?

For the last 6 months I’ve been saying I was going to write a series of posts around the topics of slow boot slow logon (SBSL) and how to use Xperf, but stuff kept coming up. While I kept missing the boat some other awesome engineers totally ate my lunch on this topic and posted them online, but we’ll get to those. This post is really going to aggregate all the posts that are out there in an easy one stop shop to them, fill in some gaps and get you started on troubleshooting your boot time.

Slow Boot Slow Logon All Up In The End Users Face

A huge impact to a user’s perception of how well a system performs is their actual logon experience. I frequently hear stories while I’m with customers where the user comes in, turns on their computer, starts the login process and then goes to get coffee while the system does whatever it needs to do. When they get back it’s usually done. A shiver goes up my spine. Don’t think this is a big deal? This is costing you money! Some of the folks in CTS put together an awesome graph so you can calculate just how much money is being lost. So what is your boot time? 1 minute? 5 minutes? 10 minutes? I’ll keep going. PFE is here to help. First and foremost we have a Windows Desktop Risk Assessement Program service, or WDRAP for short, if you are a Premier member. Simply contact your TAM and tell them you cannot live another minute with out it and say this post said so. Don’t have a Premier contract? Contact this blog and I’ll do my best to get you in touch with the right people so you can. Until then follow this post by CTS to define how long your boot process is for now.

A Tool Called Xperf

As SNL’s Stefon would say, Xperf has everything; graphical charts, ability to do stack tracing, process lifetimes, Spud Webb (kidding), and so much more. Xperf is a FREE tool that is part of the Windows Performance Tool Kit. This CTS post walks you through where you can get the install but I want to point out tip. When select the components to install, you can deselect everything besides the Windows Performance Tool Kit and if you want to install this on other machines, the Redistributable package.

Time To Take A Trace

If you are following the CTS guide from above (you should be), you will take a boot trace with the following command, xbootmgr -trace boot -traceflags base+latency+dispatcher -stackwalk profile+cswitch+readythread -notraceflagsinfilename -postbootdelay 10

Once it’s done open the trace with Xperf view. Here is some hints in what to look for during your analysis. Got it? Alright I’ll get you started with a few tips. First scroll down to the bottom and look for a graph called “Boot Phases” It should look something like this.

A few things should jump out at you, first the boot process is broken out into very distinct phases. CTS did an AMAZING job of breaking them out, you can read about it here. Something else to notice, the red arrow is a dead giveaway, that seems like a major part of our boot time is spent during the Winlogon session. Now we have something to target in trying to reduce our boot time. Finally you should notice that we can tell exactly how long our boot time is taking, 119 seconds.

Other things to look for is the Services chart. You’ll want to look for any services that seem to be running longer than the other services. Here is an example.

There are many things wrong with this but the one that really should stick out is the sftlist which is….APP-V. This is fixed in a hotfix, http://support.microsoft.com/kb/2571168. I documented this entire process over here.

Alright one last hint, in CPU keep an eye out for large spikes or things that look interesting like “teeth”. Constant rows of spikes are not good. Here is another example.

Look at those spikes. Something is going on there. The CTS guys have compiled a list of some common root causes that you can read about here.

A tip I would give is the more traces you take and look at, the better you will get at them. You will see things that don’t seem normal quickly. If this is a topic of interest to the community I can write some future posts of SBSL and what the root causes were.

-Mark Morowczynski

↧

The Mysterious Case of a Saved Favorite URL and UAG

June 18, 2012, 7:00 am

≫ Next: Becoming an Xperf Xpert: The Slow Boot Case of the NetTCPPortSharing and NLA Services

≪ Previous: Slow Boot Slow Logon (SBSL), A Tool Called XPerf and Links You Need To Read

Recently I was asked by one of my customers to assist in a project to replace TMG with UAG, specifically for their Remote Desktop RemoteApp publishing portal. I’m not an expert with UAG, but I can usually get it to do what I need it to do, and I have the secret weapon: I work for Microsoft, and I knew I COULD collaborate with the experts when and if I needed to!

The UAG portal gives them a quick, easy way to manage and handle user credentials, including password expiration, alongside the familiar RDWeb view of the published applications. Add to that the ability to extend the portal to Federated Applications, and it piqued my interest.

Unfortunately, there was one caveat, we had to be able to handle the existing documented and saved favorite URL for https://RemoteApp.Contoso.com/RDWeb. At first glance, I thought of a few different ways to do this, but it turned out that it wasn’t quite as straightforward as I had envisioned. After some research via http://www.bing.com I read several posts that said this can’t be done with various reasons why. It happens to early in the ISAPI handling from the UAG application was the one that stuck in my mind. This seemed like something that would have been thought about with UAG, and I felt that there must be a way to accomplish something as rudimentary as redirecting an inbound request to the main portal page.

I made a few calls, and white boarded a few ideas, even spent time testing and configuring different options in my lab. My range of failures included exposing another website through UAG that hosted a simple RDWeb / Default.htm that redirected back to the main UAG portal. Needless to say, I was having a much harder time getting this to work than I envisioned.

Finally, after stumbling for a little while, I came across a way to do this using the Manual URL Replacement on the UAG Trunk configuration. Now, this was also one of my first theories to make this work, but I just couldn’t seem to get the syntax right. Through trial and error, I finally discovered the proper configuration, and it was much simpler than anything I had been trying to make it.

As you can see, my configuration was pretty simple. I started with a basic UAG portal and then added the RemoteApp and Remote Desktop applications through the Add Application wizard.

Just to be clear, I’m sure there are other, and potentially better, ways to accomplish this URL redirection.

I did state that I’m not a UAG expert, right? In fact, I work with Active Directory as my specialty. This isn’t intended to be the official “THIS IS HOW YOU DO IT” post. I just know the effort that it took for me to find this workaround, and wanted to get it out there in hopes of making someone else’s job a little easier.

I configured the Manual URL replacement policy a few different ways at first and received various error messages when testing from the client portal. The errors ranged from “The URL you have requested is not associated with any application” to “You are not authorized to use this application”. The latter was because I placed “LocalHost” somewhere it wasn’t supposed to be in a redirection rule.

Now for the process I used to actually make the saved Favorites URL redirect back to the main portal page.

The first step was to edit the properties of the Portal application. I needed two things here. The first was to add my public hostname to the list of Web Servers. The second was to copy the Path listed for the portal so I could use it in the manual URL replacement rule later.

Next, I selected Configure on the Trunk.

And in the Manual URL Replacement rules, I added a new rule:

I placed /rdweb/* in the URL: box, I used /rdweb/* because I wanted to make sure that I covered any request coming in with /RDWeb/ in the URL. Then I pasted the Portal Path I copied from the Portal configuration in the To URL:. In my case this is /SecureOutsideAppsPortalHomePage/.

Next I select the Type of action I want to perform, I chose Rerouting for this because I wanted the request rerouted to the main portal page.

Finally in the Server Name box, I used the Public Hostname that I placed in the Web Servers section of the Portal and selected the checkbox for Use SSL.

Now, when my client clicks their saved RDWeb favorite link, it is redirected to the main UAG Portal page without any errors!

Jim Kelly

↧

Becoming an Xperf Xpert: The Slow Boot Case of the NetTCPPortSharing and NLA Services

June 25, 2012, 5:49 am

≫ Next: Clustering: What exactly is a File Share Witness and when should I use one?

≪ Previous: The Mysterious Case of a Saved Favorite URL and UAG

So now that you are in the loop on the XPERF greatness, let's look at a real world example of how XPERF can help us optimize boot
times.

(For those of you that missed the XPERF memo, go back and read Mark's post)

When we first started looking at this client laptop, he was getting to a usable desktop in about 2 minutes. Not bad right? But not great either.

So we took a trace.

For those of you that are curious, the syntax we used to gather the trace is as follows:

xbootmgr -trace boot -traceFlags Latency+DISPATCHER -postBootDelay 120 -stackWalk Profile+ProcessCreate+CSwitch+ReadyThread+Mark+SyscallEnter+ThreadCreate

Note: Notice that we have -stackwalk as one of our switches.

Recall, stackwalking allows us to review modules and function calls in our trace, if needed. Refer to the links in Mark's blog for instructions on setting up and configure symbols, if needed. Since I have -stackwalk as one of my parameters, before gathering a trace on an x64 bit machine, I must first disable the paging executive as part of the kernel is paged out. This is easy to do, but it does require a reboot. Simply run the following command from an administrative command prompt and you're good to go:

Reg.exe add "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

After gathering our trace, we next opened it in Performance Analyzer. For those of you new to this, Performance Analyzer (aka xperfview) is the tool we use to analyze XPERF traces. It is installed with the Windows Performance Toolkit which is a part of the Windows SDK. See the previous link for more information.

One of the first things I do is open System Configuration by clicking on the Trace menu and selecting System Configuration. I do this just to see what I'm dealing with. For customer privacy reasons, I have
removed the first and second line containing the Computer Name and Domain Name from my screen shot:

I can see from here that the client is running Windows 7 Professional SP1 (as indicated by the 6.1 OS version) along with lots of other potentially useful information. Make sure to check it out for yourself. My above screen shot is from the General tab, which is one of many tabs available:

After a cursory glance at the default graphs, including the Boot Phases graph that shows the ~ 2 minute time to reach a useable desktop:

I decide to focus on services. Just like in Mark's example trace in his blog, some services quickly jump out:

Namely, it looks like NetTcpPortSharing service is causing ~ a 30 second delay during startup while we wait for it to start before NlaSvc starts. Then the NetPipeActivator service is causing a similar delay. Also, why is NlaSvc taking nearly 20 seconds to start?

Let's clean this up a bit by changing the NetworkList service from Manual to Automatic and take another trace. After popping open the trace, we already know these changes have helped as we are down to a usable
desktop in ~ 99 seconds.

We again scroll down to services, and see the following:

We still have some room for improvement. We move on to disabling the NetTcpPortSharing and NetPipeActivator services. For those of you that are not aware, these are .Net Framework 4.0 services related to .Net
development. This user does not do any .Net development so we figured we're safe to disable these and alert the user we have done so (just in case). We also have them disabled by default on our Microsoft internal image. After disabling these services, we take another trace, and we again see improvement. We reach a usable desktop in ~ 80 seconds.

We again scroll down to services. This is more along the lines of what we like to see. We don't like to see a lot of what we call "stair stepping". We still have a little of that going on and suspect that some of this will be cleared up with the application of the Symantec SEP RU6MP3 update mentioned here.

We went on to take traces of some additional machines and confirmed we were seeing similar delays across the board. Just as easy at that and in about 30 minutes, we reduced our boot time to a usable desktop from 120 seconds to ~ 80 seconds. How 'bout that? :-) If you experience a similar delay with NlaSvc, rather than change the Network List service to start automatically, try applying the following hotfix, which should correct this problem:

Delay occurs when you log on to a domain from a computer that is running Windows 7 or Windows Server 2008 R2

http://support.microsoft.com/kb/2709630/en-us?sd=rss&spid=14134

(Note: We're still waiting to hear what impact the Symantec update has on the clients and suspect that the 80 second timeframe to a usable desktop will be further reduced.)

~ Charity Shelbourne

↧

Clustering: What exactly is a File Share Witness and when should I use one?

June 27, 2012, 12:06 pm

≫ Next: Becoming an Xperf Xpert Part 2: Long Running Logon Scripts, Inconceivable!

≪ Previous: Becoming an Xperf Xpert: The Slow Boot Case of the NetTCPPortSharing and NLA Services

Customers ask from time to time: “What is a File Share Witness (FSW)?” Sometimes they’ve worked with prior versions of clustering and don’t know what a FSW is, or that the option exists. The next question asked is usually: “When should we use one?” Before going into that, I’ll review some subtle differences between legacy cluster quorum options and what options are available from Microsoft today for a Failover Cluster.

Legacy Cluster Quorum Options

Quorum may be defined as the number of members of a group that must be present to conduct business. For a legacy (Windows NT/2000) two-node cluster that lost all communications and became partitioned, whichever node could maintain reservation of the quorum disk first would survive and own it. The quorum disk was a tie-breaker for systems that could not communicate as well as an additional storage location for cluster configuration data. One downside to this model was that if the quorum disk failed, so did the cluster. A legacy two node cluster could not function without it. So if just the disk failed but both nodes remained, the cluster would cease to function. Therefore, the quorum disk was very important for legacy clusters. With Windows Server 2003 clusters with more than two nodes, Majority Node Set was another quorum option. A MNS cluster can only run when the majority of cluster nodes are available. This model is typically not chosen for two node clusters because with this model you have to have two nodes minimum ((2 / 2) +1 = 2)…to maintain majority. As of 2003 SP1, an option was added to allow use of a File Share Witness to add an additional vote so that in the same example above, a two node cluster could function with the loss of one node.

Quorum Options for Modern Clusters

With Windows Server 2008 and later clusters, if all network communication becomes severed between nodes, and quorum must be established, votes are counted. (The exception to this statement would be the disk only quorum model which is similar to the quorum model in legacy clusters where the only vote that is counted is the disk. This option is not recommended, and is rarely chosen.) By default, each node gets a vote and if configured, a single witness may be counted as a vote. A witness may be either a Witness Disk or File Share Witness (FSW). However, you cannot use both. Half of all possible votes + 1 must exist for there to be quorum. Therefore, for an even number of nodes you would want to have a FSW or witness disk. This means that the disk or FSW can cease to exist and as long as enough nodes are online then the cluster may still function.

The following TechNet link is a great reference for 2008 and 2008 R2 quorum options:

http://technet.microsoft.com/en-us/library/cc770620(v=WS.10).aspx

From that link, the following table describes when you would use an alternate method based on even or odd number of nodes.

Description of cluster	Quorum recommendation
Odd number of nodes	Node Majority
Even number of nodes (but not a multi-site cluster)	Node and Disk Majority
Even number of nodes, multi-site cluster	Node and File Share Majority
Even number of nodes, no shared storage	Node and File Share Majority

What is a FSW and how does it differ from a disk?

All this discussion about votes is a perfect segue into what a File Share Witness (FSW) is and how it differs from a witness disk. A FSW is simply a file share that you may create on a completely separate server from the cluster to act like a disk for tie-breaker scenarios when quorum needs to be established. The share could reside on a file server, domain controller, or even a completely different cluster. A witness share needs to be available for a single connection, and available for all nodes of the cluster to be able to connect to – if you are using the FSW option for quorum. The purpose of the FSW is to have something else that can count as a vote in situations where the number of configured nodes isn’t quite enough for determining quorum. A FSW is more likely to be used in multi-site clusters or where there is no common storage. A FSW does not store cluster configuration data like a disk. It does, however, contain information about which version of the cluster configuration database is most recent. Other than that, the FSW is just a share. Resources cannot fail to it, nor can the share act as a communications hub or alternate brain to make decisions in the event cluster nodes cannot communicate.

Remember the old capture the flag game you might have played at summer camp? Typically each team had a flag and had to protect it from capture by the opposing team. However, a variation on that game is where there is only one flag located at an alternate location in the woods and two teams try to find it. When the flag is captured by one team, that team wins. A FSW is somewhat like the flag at an alternate location where a team of surviving nodes that can obtain it are able to reach quorum and other nodes that cannot drop out of the active cluster.

A Witness Disk is similar to a FSW except rather than being a file share somewhere on your network like a FSW, it is an actual disk provided by disk storage that is common to the cluster. Nodes arbitrate for exclusive access to the disk and the disk is capable of storing cluster configuration data. A witness disk counts as a vote.

By this point you may have a good understanding of what a FSW is, when it might be used, what it is, and what it isn’t. Now let’s look at a couple of make-believe cluster configurations that each use a FSW that are similar but quite different.

This is a 4 node cluster that is split between two sites with a FSW in Site C. In this configuration there are 5 possible votes. Bidirectional communication is possible on the Public and Private networks. Site A and Site B have a link to Site C to connect to the FSW, but neither Site A or Site B may connect with each other through Site C’s network. If one of the bi-directional networks fails, there remains one network that may be used for cluster communications for the nodes to determine connectivity and decide the best recovery method. If both bi-directional networks fail, then this cluster is partitioned into two sets of two nodes and the FSW must be leveraged to determine which set of nodes survives. The first set of nodes to successfully gain access to the FSW will survive. IF the network to Site C were a complete network allowing communication between Site A and Site B as well, then there would be an alternate communication path for the cluster to determine the best course of action…and this configuration would be that much better.

This variation is very similar to the first example and is something that customers have been known to implement. In this case, both bi-directional networks are actually VLANs that go through the same network connection. Therefore, the two separate networks have the vulnerability of a single network. Even if the Leased Net were redundant, that piece of this puzzle remains provided by the same network provider or could be cut by a backhoe somewhere between. So there exists a possibility the Leased Net segment could go down or become unresponsive. The validation process will warn if there is only a single network but it has no insight into what the underlying network actually is between the two sites. An extra network or bi-directional communication link through Site C would once again be an improvement.

Without a FSW in these two configurations, there would be no way to break the tie during communication loss as there are two sets of two nodes otherwise.

Let’s take a non-technical example. Imagine you have 4 people on a camping trip. They park the truck at the camp parking lot and split into two groups of two. Each group hikes 5 miles in opposite directions from the truck they arrived in but at most, the groups are 5 miles from each other. They each make camp in a triangular setup like the second diagram above. Each group has a cell phone to communicate (with voice and text capability) and a heater. The truck contains a spare heater with a partial tank of fuel. Late at night, the heater at each camp fails. One runs out of fuel. The other experiences a mechanical problem. One or both groups are beyond cell range and cannot communicate. The first person that hikes back to the truck to get the spare heater has heat for their tent at their campsite. It doesn’t matter if both camps can hike to the truck or can even see the truck directly from their camp. There is only one spare heater. Since there is no communication available, they can’t call or txt message each other. The camp that ran out of fuel ends up getting the spare heater and almost enough fuel to make it through the night. The other camp wastes energy of one person hiking to the truck and back to find there is no spare heater and ultimately freezes with no working heater and extra fuel they can’t use. Ideally, if the camps could communicate by other means, they would know to meet at the truck, swap the broken heater for the spare and divide up all the fuel so that both camps could have heat, or decide to leave and get coffee on the way home. With only one real communication path, the best decision for all 4 people could not be made.

The cluster configuration is similar. With all communication severed, the cluster has to decide what to do based on the information available which may be limited. The FSW in the cluster example is capable of breaking the tie of two votes against two votes. However, the cluster is not able to help discern which site is actually the best one to continue because of other conditions.

How do I know if I’ve chosen the best quorum model?

For a cluster with an odd number of nodes, Node Majority is the typical model chosen. However, with an even number of nodes it makes sense to have a witness resource as a vote. The validation process for clusters in Windows Server 2008 R2 can validate the cluster against best practices and suggest a different quorum model if the current selection is not the best option.

↧

Becoming an Xperf Xpert Part 2: Long Running Logon Scripts, Inconceivable!

July 2, 2012, 7:46 am

≫ Next: The Case of the Missing SRV Records

≪ Previous: Clustering: What exactly is a File Share Witness and when should I use one?

Login scripts once implemented in an environment tend to never really get removed. They are always there, running. The fact that many of these login script functionality, such as mapping drives, could be moved to Group Policy Preferences (http://blogs.technet.com/b/askds/archive/2009/01/07/using-group-policy-preferences-to-map-drives-based-on-group-membership.aspx ). The real question is do you know how long your login scripts take to complete? What about that one script that was supposed to run one time and never again? Are you sure it is running on every login? We will dive in and find out.

In our previous posts (http://blogs.technet.com/b/askpfeplat/archive/2012/06/09/slow-boot-slow-logon-sbsl-a-tool-called-xperf-and-links-you-need-to-read.aspx ) we’ve talked about how to take an xbootmgr trace. Open up the trace and scroll down until you see “Process Lifetimes”

We are going to want to right click in this area and choose “select view”. This will then highlight the entire view. Right click again and select “process summary table” where a new window will open that gives us a detailed view of every process that started during the trace. Since we are investigating login scripts know that they tend to take a few forms, bat files and vbscripts are often the most popular.

First, we will want to get our columns in order. I recommend the order of Process, Process ID, Parent Process ID, Duration, Start Time, End Time, Hosted Services, Command Line. These can be added or removed from the Columns Menu. Now take a look and see if we have any cmd.exe processes launching during this time frame. In this case we do:

Here we can see that cmd.exe process ID 2,164 is taking 30 seconds to complete. We also see that it kicks off 3 other child processes, 2,236, 2,352, and 2,268. Those run very quickly but we are able to tell this is where they came from, as they all share the same Parent Process ID, 2,164. Understanding this parent child process relationship will help us to continue to drill down. For now if we scroll to the right to look under Command Line to see what is being called:

This is appears to be launching from a group policy since we are in the directory \sysvol\domain name\Policies\GUID\Machine\Scripts\Startup\ and we are calling the script called startup.cmd. Here is the part where I tell the admin that they now have a 30 second login script. Their responses range from “there is no way” to “not possible” and even “inconceivable!” especially if they were Vizzini from Princess Bride. As Inigo Montoya would say, “I do not think that word means what you think it means.” Well we could stop right here and say this startup.cmd script is the problem, but what if this script is large and does lots of things? We need to get more specific to see what else this script is kicking off to get to the root delay. We do this by dragging the Parent Process ID column all the way to the left so that it is now what we are sorting by. Then, we target the process ID we are interested in, 2,164.

This script is busy, they always are. The one that seems to be our bottleneck is a Cscript.exe with the process ID of 2,316 is taking 28 seconds to complete. If we scroll to the right we will see the command being launched.

The top arrow is our original script; the second arrow is process 2,316. It is calling another script, startup.vbs. It also appears to be coming from the same group policy which is confusing, in general, but that is a different issue altogether. As we push on, let’s go investigate what child process are being kicked off from process 2,316.

This process is another cscript with the process ID of 2,364 that is taking 26 seconds to complete. We scroll to the right to see what is being kicked off.

I know what you are thinking, this just keeps on going, doesn’t it? I promise there is a light at the end of the tunnel. We see that we are mapping to a DFS share to launch another VBScript called SDT.VBS. Let’s see what child process come from the process 2,364.

Finally not another script! We can see that this process, 2,728 spawned from SDT.vbs 2,364 is doing an install of some sort. It is also taking 18 seconds complete. The line directly above it (33) is its parent and is taking 26 seconds as we can see above. I don’t have a network trace to confirm this but I’m suspecting it is taking roughly 12 seconds to map the drive to launch the SDT.VBS script. Let’s scroll over to see what all this trouble was for:

There you have it! Vcredist.msi installer is taking a large part of our login script time. Chances are that this doesn’t need to be installed EVERY SINGLE TIME. We were able to remove this from our login script and reduce our boot time substantially, about 30 seconds. Continuing to dig through the parent processes allowed us to identify the actual root cause of the delay.

Hello. My name is Mark Morowczynski. You kill my boot time. Login Script….prepare to die.

↧

The Case of the Missing SRV Records

July 9, 2012, 5:00 am

≫ Next: Too Many Admins in Your Domain: Expose the Problem(s) and Find a Solution. (Don’t forget PowerShell)

≪ Previous: Becoming an Xperf Xpert Part 2: Long Running Logon Scripts, Inconceivable!

I was recently on site with a customer performing an ADRAP when we found that several domain controllers were missing certain generic SRV records from DNS. The environment had around one-hundred DCs and thirty of them were missing records. Unsure why this was inconsistent, we started to investigate, first by restarting the netlogon service on one of the domain controllers in question. Restarting the netlogon service is one way to force record registration of SRV records for that DC. We refreshed the zone and found that the missing records, _kerberos._udp.contoso.com, and _kpasswd._udp.contoso.com were present. A few minutes later, allowing some time for replication to occur, we re-ran our test. The records had vanished once more.

Examining the list of DCs in the environment that had missing SRV records, we found that most of them were in remote Active Directory sites, some a few hops from the main hub. Further examination of the DNS configuration showed that the Active Directory namespace in DNS (the contoso.com zone) was configured for aging and that the no-refresh interval was set to two hours, while the refresh interval was set to seventy hours. A single DC was set to scavenge records every three days and all DCs pointed to themselves for primary DNS.

Things clicked once we looked at the aging settings on the zone. Let’s start by briefly reviewing what the refresh and no-refresh intervals mean to DNS clients (or you can read this great post for a longer explanation).

When a client initially registers a record with its DNS server, and aging is enabled on the zone, the record starts off in the “no refresh” period. During this administrator-defined period of time, which defaults to seven days, the DNS record timestamp cannot be updated by the client. If the client IP address changes, the record may be updated and the timestamp will be written. Again, the no-refresh period just prevents the timestamp from being updated.

After this interval has expired, the record enters the “refresh” period. It’s during this period, if the client is still at the same IP address, it is able to update the timestamp on the record. Once the record is timestamped again, it enters the no-refresh period. This cycle continues as long as the record is consistently updated.

If the record passes the no-refresh interval and the refresh interval without being updated, it’s now eligible to be scavenged. Scavenging is a process that is generally carried out by a small number of DNS servers. The scavenging process checks each zone for which it is authoritative for any records that have aged beyond the no-refresh + refresh period. Records which were updated last beyond this interval are scavenged.

Client A (host) records are updated by the DHCP Client service on Server 2003/XP, or on Server 2008/Vista and later by the DNS Client service, every 24 hours. On domain controllers, SRV records are updated by the netlogon service. These updates occur, by default, every 24 hours on Server 2003 DCs, or hourly on Server 2008 and later DCs. On any version of Windows Server, the records are also registered when netlogon starts.

Back to the vanishing SRV records…

The DNS zone in question had set the no-refresh interval to just two hours. As previously mentioned, the netlogon service in Windows Server 2008 and later will attempt to register SRV records every hour, regardless of zone aging settings. Let’s look at how DNS records are stored in AD integrated zones.

DNS records are stored in the directory in a dnsNode object that corresponds to the name of the record. For example, _kerberos._udp.corp.milt0r.com would actually be a single object viewable in ADSIEdit at this path, assuming the replication scope is forest-wide: “DC=_kerberos._udp,DC=corp.milt0r.com,CN=MicrosoftDNS,DC=ForestDNSZones,DC=corp,DC=milt0r,DC=com”

Each of the individual SRV records you see listed in the DNS management console are stored in an attribute of that dnsNode object. This is a multi-value attribute called dnsRecord.

This dnsRecord attribute stores values that represent information about the SRV record (weight, priority, port, hostname). When an SRV record is created, an entry is added to the dnsRecord attribute. When an existing record is updated, the appropriate value is updated. The screenshot above shows that the _kpasswd._udp dnsNode object has four records under the dnsRecord attribute. If we examine the zone in the DNS management console, we would see this as four individual SRV records. Additionally, because the dnsRecord is replicated via multivalue replication, a change to a single value results in replication of the entire attribute. That brings us to the next logical question…

What happens if two DCs, pointing to themselves for DNS, register their own SRV records for the same dnsNode object at the same time?

Let’s say DC1 and DC2 both have no SRV record registered for _kerberos._udp.corp.milt0r.com. At 1:00PM, you restart netlogon on both domain controllers, forcing the record to update. Netlogon will attempt to register all of its necessary SRV records, including the missing_kerberos._udp record. In this case, since they don’t already exist, we would expect registration to succeed.

A little while later, replication takes place. You have two DCs, both with a dnsNode object for _kerberos._udp, but with different values in the dnsRecord attribute. Whichever domain controller wrote the change last will win. If we examine the zone on either DC after replication, we should only see the SRV record for just one of the DCs was successfully registered. When replication occurred, the copies of the object on either DC were found to have conflicting information. To resolve the conflict, the DC replicating the inbound change either kept its own copy, or replicated the replication partner’s copy, depending on which was most current.

What’s that got to do with a 2 hour no-refresh interval?

With the no-refresh interval set to a small span of time, every domain controller will successfully update its time stamp every other hour. In a small environment with just a few DCs, and a relatively low convergence time, administrators may never notice a problem. In a large environment with many domain controllers across many sites, and if those DCs all point to themselves for primary DNS, you’ll begin to see replication conflicts as DCs register records and those registrations overlap replication intervals. In an environment with over 100 DCs where the no-refresh period is just two hours, it stands to reason that you’d have multiple replication conflicts within any given hour period. Depending on the number of sites, the topology, and the replication intervals, you could have ten conflicts within the same 15 minute interval. Eventually, some DCs are going to “lose” and have their records scavenged since they’ll end up being seen as stale by the scavenging DNS server. This assumes that netlogon hasn’t been manually restarted at certain points.

How do I prevent this weird scenario?

Luckily, there’s a pretty easy fix. Set your DNS no-refresh period to an interval that will allow all DCs in the forest to experience consecutive registration failures or replication conflicts, but to still have a chance during that period to successfully register before other DCs enter the refresh period. Changing the interval to something like 24 hours means that, while netlogon will try to update the record each hour, it will only succeed once a day. This gives all of your DCs a longer window to register without the risk of experiencing multiple replication conflicts.

Conclusion

I hope that this post gives you a better understanding of the DNS refresh intervals, how DNS records are stored in the directory, and how a low no-refresh interval can impact SRV records and replication. As always, leave questions or comments below!

Once I’d finished writing this post, I discovered a pretty similar post on the subject (though a different root cause) on the AD blog. That article is located here: http://blogs.technet.com/b/ad/archive/2008/08/08/a-complicated-scenario-regarding-dns-and-the-dc-locator-srvs.aspx That particular post goes in to a bit more detail in some areas, but covers a case where the cause was the sheer number of DCs running Server 2003, as opposed to a very low no-refresh interval. I recommend reading that as well.

- Tom Moser

↧

Too Many Admins in Your Domain: Expose the Problem(s) and Find a Solution. (Don’t forget PowerShell)

July 16, 2012, 6:00 am

≫ Next: MCM: Core Active Directory Internals

≪ Previous: The Case of the Missing SRV Records

In my role as a transactional PFE, I have the privilege of visiting 40-50 customers per year. Often I’m called in to perform an Assessment of the Active Directory Infrastructure. Without a doubt, one of the biggest challenges most customers face is managing the membership of the privileged groups in their domain. The challenge manifests itself in a couple of ways, but for simplicity let me call out two dimensions of the problem:

1. Privileged groups in the domain have too many members.

2. Some of these privileged accounts are loosely managed, and have passwords that are never/rarely changed.

Before we proceed, let me specify what I mean, when I say “Privileged Groups”. These are built-in groups in Active Directory that have been granted various levels of rights in the domain, from full administrative rights (Enterprise Admins, Domain Admins, Administrators, Schema Admins) to more limited, but still significant, rights (Account Operators, Server Operators, DNSAdmins). Fellow Microsoftie Laura Robinson has a nice blog that details these groups and explains some of their privileges.

Laura further makes the case that these groups should be empty. From a security perspective, her argument is absolutely sound. Unfortunately, I’m an infrastructure dude, so I worry about all the bad things that might happen and how having some of those built-in privileges might help you fix those bad things. So I’m not willing to argue for no members in your (built-in) privileged groups. However, I think we can all agree that having less (privileged accounts) is more (better).

So how many are too many? In our AD risk assessment we flag any privileged group with 20+ members. That’s somewhat arbitrary, but you have to set some goals. I would argue for almost any organization, you should aspire to less than 10 members in each of the most privileged groups (Enterprise Admins, Domain Admins and Administrators). Some of those other groups like (Account Operators, Server Operators, Backup Operators, Print Operators) should probably be empty (see the delegation argument, below).

The challenge is how do you get there from here? If you’re like one of the roughly 75% of customers I visit where these groups have 20-500 members, you probably didn’t design an Active Directory where these group memberships are bloated. And the key word here is design. Unless you have a design for your privileged groups, you will likely always have pressure (political and technical) to unnecessarily add users to these groups. Every administrator in IT wants a domain admin account, so they’re not bothered with access problems. Services are so much easier to deploy when you can simply add the service account to the Domain Admins group. (Hence the large number of privileged service accounts with passwords that never change). This continuous pressure leads you to a place where trying to remove accounts from privileged groups is like playing whack-a-mole. You take some out, and even more come back.

So how do you design your AD for membership in privileged groups? Technically, it’s not that hard. Put your nose into our Best Practices for Delegating Active Directory Administration whitepaper. After you’ve blown through those 500+ pages, don’t forget the appendices. Simply put, you must design your AD (with your own privileged groups), so that you can give every person exactly the rights they should have without dropping them into one of the built-in groups and giving them too many rights. If you do this, you will find that the most privileged built-in groups (Enterprise Admins, Domain Admins, Administrators) can contain a few number of accounts, and some of the other built-in groups (Backup Operators, Server Operators, etc) can be emptied.

Sounds easy, but it does take some time for study, design and testing. It also takes some time to get the political backing to get the resources and allow for eventual implementation. If that’s a battle you’re not ready to fight, start small by exposing the magnitude of the problem. You may even be able take some small steps in the right direction.

Step 1: Expose the Problem

Below is a PowerShell script which will enumerate the membership of the built-in privileged groups. This includes expanding nested groups to get down to all the individual members. The script also handles duplicates so only unique members are listed – for example, an account that is directly a member of a group and indirectly a member through a nested group. Run the script and it will:

Report On-Screen:

The group that is being scanned
The number of unique members (both direct and nested) in the group
Members with old (greater than 365) passwords

Dump to an output file (AllPrivUsers.csv):

Information on each unique user in each privileged group such as password age and enabled/disabled.
Note, users that are members of multiple privileged groups will appear multiple times in the spreadsheet – once for each privileged group of which they are a member.

Note on the Script: Like all my PowerShell scripts, this PowerShell script does not use the Active Directory module that is built-in to 2008 R2. I know the AD module is cool, and lets you write your code much more succinctly. If you’re really clever you might even be able to replace my script with a one-line command. However, the AD module doesn’t help you if you don’t have the AD web services (built-in to 2008 R2 DCs) in your environment. Since many of my customers still use 2003 DCs, I try to write my code for compatibility. My apologies to those of you who are leveraging the goodness of 2008 R2 DCs. For those of you who don’t, contact me and we (PFE and Microsoft Premier Support) can get you on the upgrade fast track.

Step 2: Use the Information to Drive Action

So now what? If you’re concerned or disappointed by the number of users in privileged groups, drive towards designing a delegation model that will help you reduce the membership of users in the built-in privileged groups. If you’re concerned about password ages, you might want to change the passwords on privileged accounts.

What if the privileged accounts are service accounts? While granting a service account the privilege of a non-expiring password is perfectly fine, that does not absolve you from ever changing the password. Aren’t you the least bit concerned that every administrator who’s worked with your Active Directory in the past decade (whether they still work with you or not) knows the password to a privileged account. While you may not be able to prevent a service account from entering a privileged group, you should at least require the owner of that service to change the password periodically. If they can’t, or won’t, you probably shouldn’t be granting them administrative privileges.

Another tool to keep in your belt is managed service accounts. These were introduced in Windows Sever 2008 R2, and should mature even further in Windows 2012. One of the features of managed service accounts is automatic password changes that require no administrative intervention and no service outage.

In all honesty, I hope being able to access/generate this information will give the issue the necessary exposure, and drive you to action. If nothing else, you’ve got another PowerShell script that you can add to your collection, or potentially scavenge some code for your own purposes.

Until next time….

Doug Symalla

↧

MCM: Core Active Directory Internals

July 22, 2012, 10:00 pm

≫ Next: Windows Server and Processor Cores...

≪ Previous: Too Many Admins in Your Domain: Expose the Problem(s) and Find a Solution. (Don’t forget PowerShell)

Disclaimer: For brevity and to get some key points across, quite a bit of detail about about Active Directory, the underlying database, and replication have been purposely ommitted from this blog.

Part 1 - MCM - So You Want to be a Master, eh?

Now, there is no possible way to cover every possible detail from every day during the MCM. Consequently, my plan is to cover the concepts and topics that are most important. Before jumping into topics, I want to set the scene for you.

I flew up to Seattle on Super Bowl Sunday back in February 2012 to our Redmond, WA headquarters. When I showed up at 9 AM, I was greeted by a classroom full of students. These students were from various parts of the world that had flown in to take this class. Some of them were from Microsoft including PFE or MCS while some were external. It was immediately evident that these students were seasoned professionals having anywhere from 7-20 years of experience within IT and having had experience with Active Directory since the very beginning. One thing that I did like about being amongst professionals of this level is that there were very few technical pissing matches because everyone knew someone in the classroom was probably much smarter than them.

Secondly, I want to stress the atmosphere of the classroom and materials. When the instructors were presenting the materials, it was pretty much expected they you partially knew what they’re talking about. The slides are pretty minimal on details. This class delivers the goods by: [Filling in on details about AD through presentations] + [Classroom discussion] + [Labs] + [Self-Study] + [Group Study]. This is one of the reasons that the MCM is such a great experience because so much of it involves the class working together as a whole or within smaller groups. By going through the class, you begin to forge good relationships and bonds with people in the class. It’s as much an exercise in professional networking as it is learning.

Also, I cannot stress this point enough. If you take away one thing from this blog, let it be this: This class is not so you memorize every little detail about Active Directory, or what we call “Geek Trivia”. For example, can you recite from memory, the schema attribute value that enables containerized indexing? The exams or labs will never test you on this sort of thing but you will be expected to know what containerized indexing is, where to set it, and then through your own research, you can figure out what value needs to be set.

So, let’s begin. First off, I want to acknowledge Chris Davis, a PFE from Microsoft. His detailed notes helped ensure that I didn’t miss any important details. The first day was Core Active Directory internals. Some of the things we covered were the following:

What really is Active Directory?
What’s inside the Active Directory database?
Let’s see some AD internals.
What really is a Global Catalog server?
What are linked values?
What are Phantom Objects?
What really is the Infrastructure Master FSMO role?

What really is Active Directory?

Had you asked me what Active Directory was before I went to the Masters class, I probably would have just answered, “An LDAP-enabled database with many dependent LDAP-enabled applications and services sitting on top of it including Kerberos, Authentication, DNS, etc.” Now, having gone through the Master class, my answer would change to “A distributed Jet/ESE database that’s exposed through LDAP by the Directory System Agent (DSA) with many dependent LDAP-enabled applications and services sitting on top of it including Kerberos, Authentication, DNS, etc.”

Now, that I just explained what AD is at its lowest level, a Jet database, why in the world would Microsoft choose Jet over say, a SQL database? SQL is so very well known, easy to access and manipulate; it almost sounds like a match made in heaven. Jet was chosen because it’s a ridiculously simple and fast database. If Active Directory was going to be the center of many enterprises, it had to be fast and Jet delivers on that promise in spades.

I like to describe the Directory System Agent (DSA) as the man behind the curtain, the bouncer, and the translator. It’s the component that talks to the database but also enables LDAP. Sorry to break it to you, but at the database level, distinguished names like ‘CN=users,DC=Contoso,DC=local’ don’t exist. It’s the DSA that creates this LDAP path based on the data in the underlying Jet database; this will make more sense in the next section. It also enforces data integrity, which data types are allowed for certain attributes. It really is the magic that creates this awesome LDAP database we call Active Directory. Jet makes it fast, the DSA makes it LDAP.

Now, within the ntds.dit file, there are actually many tables of data. The tables that are of most interest to us are the data table, which contains all the users, groups, OU’s. The link table, which contains any linked attributes for example, the members of a group. And lastly the SD table, which contains security descriptors or permissions that are assigned throughout Active Directory.

Structure of NTDS.dit

Let’s first take a look at data table. One easy way to do this without some fancy third party tools is to run LDP.exe and leverage an operational attribute called ‘DumpDatabase’. Do note that this forest is called contoso.local with a child domain named child.contoso.local.

Start Ldp.exe on the domain controller.

Connect locally, and then bind as an Domain Admin.
Click Modify on the Browse menu.
Edit for Attribute: dumpdatabase.
Edit for Values: name ncname objectclass objectguid instancetype. You must leave one space between the attributes.
Click Enter. The Entry List box contains the following entry:[Add]dumpdatabase:name ncname objectclass objectguid instancetype
Click the Extended and Run options.
The %systemroot%\NTDS\Ntds.dmp file is created, or you receive an error message in Ldp.exe that you must investigate.

Source: http://support.microsoft.com/kb/315098

Data Table

The file created, ntds.dmp, is a text file that can be opened in notepad although the file size will depend on how big your Active Directory database is and we all know that notepad doesn’t like huge files. :) Nonetheless, once you open it in notepad, what you're looking at is the data table from Active Directory and it should look something like this:

Disclaimer: I excluded some columns from this picture that wouldn’t fit nor was relevant to this blog.

Here is a key for some of the above terms:

DNT: Distinguished Name Tag. Essentially is a primary key to identify each row within the database.

PDNT: Parent Distinguished Name Tag. Indicates which object in the database is the parent object of this object. References another objects DNT.

NCDNT: Naming Context Distinguished Name Tag. Indicates which “partition” this object belongs to. References the root of a partition’s DNT.

The first thing you’ll notice is that all the partitions in Active Directory are represented in this one data table. This is why we call them logical partitions. So, how does Active Directory keep track of the different
partitions and which objects belong to which partitions? This is where the DNT, PDNT, NCDNT values you see above come into play. The PDNT value tells each object what their parent object is plus the NCDNT value tells the object which partition it belongs to.

In the above diagram, you’ll notice that the DNT is just like a unique identifier where each row as a different value. The PDNT on each object tells us which object within the data table is its parent object. Additionally, you’ll notice the NCDNT on the Dave user account tells us that he belongs in the contoso.local domain partition. You'll notice that the users container also has a NCDNT of 1788. This just indicates that the users container also belongs to the contoso.local domain partition. NCDNT tells us which partition each object belongs to.

The DSA then uses this information to map out the hierarchy of all objects and their partitions and delivers them in LDAP syntax. When I realized that almost all data and partitions in Active Directory are in this one data table and just organized by these hierarchal numbers, it forever changed my understanding of Active Directory. You’ll fully understand what I mean in a little bit.

Now, let’s also take a look at a GC at this low level. The official definition of a GC is that it contains a partial attribute set of every object in the Active Directory forest. While that is true, once again, all of this is stored in the one data table in Active Directory and organized by DNT’s, PDNT’s and NCDNT’s:

The diagram above is a dump from a forest-root GC. Once again, you’ll notice the PDNT references the parent object. The NCDNT references what partition this object belongs to. And the PDNT on the child object, which is the root of the child domain, points to the DNT of contoso.local. We know this is a GC because these objects here at the bottom are from the child domain, which only a GC would have.

Key Takeaway: Active Directory does not have different tables to store the different partitions including the GC partition. Everything is stored in the one data table which is logically and hierarchy organized.

Now that I knew and understood Active Directory in this way, my mind started to open up and understand things that I couldn’t fully comprehend before.

Link Table & Linked Values

Linked values are a way of telling Active Directory that two attributes are related to one another. For example, on groups, we have an attribute called member that contains all the users that belong to that group. On each user account, we have an attribute called memberof that will show you all the groups that that user belongs to. Consequently, the member and memberof attributes are linked values that tell Active Directory they are related. Earlier, I mentioned the link table in the Active Directory database. It contains all the information about these linked values and in this case, who’s a member of these groups. Do remember that the link table may also contain information about other linked values as well, like the attributes ‘DirectReports’ and ‘ManagedBy’. Another tidbit about this that really isn't relevant to this blog is that the link table is new in Windows 2003. It’s also what enables something called Linked Value Replication (LVR) in Active Directory 2003 native mode. Here is an example of the Dave user account belonging to the administrators group in contoso.local:

So when you go to the properties of the administrators group to see who is a member, the database would take the administrators DNT of 3566, search the link table for all matching link_DNT values, and then return backlink_DNT values, which would correspond to a user or group within the DB that are members of that group.

In the reverse, when I go to the properties of the Dave account to see what groups he belong to, the database takes my DNT of 3830 and searches the link table for all matching backlink_DNT values, and then returns the link_DNT values, which would correspond to groups within the DB that I belong to.

Key Takeaway: Anything that is linked, like member and memberof attributes, must reference a physical object in the database. This is for purposes of referential integrity and it must have a corresponding DNT value, which means it will have its own row in the database. Contrast this with any generic multi-valued attribute within AD. If it isn’t linked, you can go ahead and add any value you want to it.

With that being said, let’s say that I log onto the child domain (child.contoso.local) and want to make the user account Dave, from the forest root, an administrator in the child domain. Now remember that this child DC is NOT a GC so he wouldn’t have a copy of the Dave user account in his data table. Also, remember that when you add someone to a group, they MUST physically exist in the local data table in Active Directory.

Does that mean that I have to make this child DC a GC so Dave would exist in the data table so we could then then add him to the administrators group?

Phantom Objects

No, what actually happens under the hood is the DC creates what’s called a phantom object in the data table that references the Dave account in the forest root. This phantom object is now a real object with its own DNT and exists in the data table in the child domain on all non-GC’s. Now, he can properly be added to the administrators group. Let’s take a look at this under the hood from the child DC that is not a GC:

The first clue that this is a phantom object is because OBJ=False. But if we compare this phantom object to the actual user account in the forest-root domain, it looks like this:

Since this child DC isn’t a GC and didn’t have a copy of forest-root Dave account but had to still add Dave to the administrators group, it has to create a representation of Dave in its local database because the rules of linking state that the object must exist in the local data table and have a valid DNT.

Key Takeaway: Remember that GC’s don’t have nor need phantom objects because they have a row in their data table for every object in the forest so phantoms objects aren't necessary. Non-GC’s only have the objects from their local domain so they have to create phantom objects to represent accounts from other domains.

Now, let’s take a look at the link table on this DC in the child domain from adding the Dave account in the forest root to the administrators group in the child domain:

Tying It all Together

Now, why does any of this matter? Well, do you remember that recommendation that Microsoft made a long time ago about not putting the Infrastructure Master on a Global Catalog server? Everything I explained above is why. Let's step through it one more time to make it clear. Before we do, let's summarize some absolutes about Active Directory:

Every domain controller is personally responsible for maintaining their own data table and how that data is internally linked. Internally, the DB on each DC may not be identical but the outcome will be the same.
On each DC, to add a user to a group, that user must physically be present in the local data table either as a user account or a phantom object.
A Global Catalog Server has a partial copy of every object in the forest in its data table. Objects from other domains don't have all their attributes populated but nonetheless are present. Because of this, it doesn't need phantom objects because it has the real objects locally.
A Domain Controller that isn't a GC doesn't have a copy of every object in the forest in its data table. It only contains objects from its own domain. Because of this, it has to create phantom objects to reference the real objects from other domains.
The infrastructure master is responsible for updating or deleting phantom objects if/when they change. For example, does the actual Dave account in the forest root still exist? Has he been moved or renamed? This process runs every 2 days and asks this question and then either updates or deletes the phantom objects accordingly.

One day, the forest-root Dave account gets deleted. The infrastructure Master role is running in the child domain on a global catalog server. Let's go through it step-by-step:

Disclaimer: AD replication occurs at a much higher level than this and does not occur based on DNT values. I am just doing it this way to put it into the context of this blog. Plus, DNT's are local to each DC.

1.) The Dave account in the forest-root domain contoso.local gets deleted.

2.) The DC in contoso.local replicates that deletion to the child domain GC by telling it to delete DNT 3830.

3.) The non-GC's in the child domain don't have the Dave account with a DNT of 3830. Instead, they have a phantom object that represents Dave with a DNT of 5585. Consequently, the Dave phantom object does not get deleted.

4.) This is where the Infrastructure Master comes in. There is one IM per domain. The IM process in this child domain runs every two days and says, "Let me review my phantom objects to make sure that the actual user accounts still exist". Under normal conditions, it would determine that the actual Dave account got deleted and would then delete the Dave phantom object from itself and then replicate that to other DC's in the child domain that aren't GC's. The problem here is though, the Infrastructure Master is running on a GC and we all know by now that GC's doesn't have any phantom objects. Consequently, the IM determines, "since I don't have any phantom objects, there’s really nothing for me to do”. Therefore, the phantom object for the Dave account remains on all non-GC's in the child domain. If you were to look at the administrators group on any of these non-GC's in the child domain, Dave would still show as present even though the actual user account was deleted from the forest-root and replicated to all global catalog servers in the child.

Technically, the best practice should have been "Only put the Infrastructure Master on DC's that have phantom objects" but this would have caused more confusion so Microsoft simplified it and just made it "Don't put the Infrastructure Master on a GC".

Why, Oh Why?

I know you're probably thinking all of this is a convoluted way of adding users from one domain to groups in another domain, right? Well, what are all of the possible options? Let's think about this:

Allow a DC to add a user to a group even though the user account doesn't exist in the local data table. This would break the database and referential integrity. Definitely not a good option.
Don't allow our customers to add users from one domain into groups from another domain. Once again, not a good option.
Recommend that all domain controllers be global catalog servers, which negates the entire phantom object scenario. Wait a minute, we already recommend that!
Create Phantom Objects on non-GC's in other domains and then allow the Infrastructure Master to keep those phantom objects update to date, which is exactly what we're doing today.

Have you thanked your infrastructure master lately? Perhaps you should. :)

↧

Windows Server and Processor Cores...

July 30, 2012, 6:01 am

≫ Next: First, Do No Harm

≪ Previous: MCM: Core Active Directory Internals

Recently someone asked me of my thoughts on how Windows Server handles processor cores. With newer processors available with more than 2 or 4 cores each, it seemed like a good time to revisit this topic. If you have a system with multiple processor sockets and a few new processors with 2, 4, 6, 10, or more cores each…what should you expect Windows to do?

With use of multi-core processors becoming more prevalent in not only servers but desktops and maybe even your next cell phone or TV remote…it makes sense to review how Windows Server makes use of processors…since that is where you’re more likely to see higher densities of cores on a single physical processor.

Windows Server 2008/2008 R2 count processors in relation to licensing by processor socket (physical processor.) For example, say the edition of Windows Server you have indicates that it supports up to 4 physical processors. If you have dual core processors in four available processor sockets, that would provide 8 logical processors for the OS. If the processors also support Simultaneous Multithreading (SMT) (also known to some as Hyper Threading Technology (HTT) based on Intel's implementation) and the system has this option enabled, then the total logical processor count may then be 16. 16 processors as compared to 8…seems like a no-brainer to have twice as many, right?

Don’t confuse processor cores with extra logical processors (LPs) available with SMT enabled. Some configurations with SMT might present one or more extra LP per core. Physical cores and LPs from SMT are two different things and the expected performance may be different than expected. Additional processor cores are practically just like additional physical processors without requiring extra sockets for them on the system board.

The way I like to think of SMT is to think of one of those old pizza shops where the pizza maker is tossing the dough in the air in preparation to make a pizza. You know…the old-fashioned way. Instead of being able to toss just one pizza at a time…imagine the same person tossing and spinning two of them simultaneously. Performance of this person compared to two separate people performing the same task may not be equivalent and may be somewhere in-between. For SMT compatible processors that provide an extra LP per physical core, SMT allows a processor core to run one additional concurrent thread per SMT LP exposed but sharing on-chip resources like cache. If the shared resources on chip become bottlenecks for simultaneously executing threads on the same processor, then SMT may not contribute to but might limit performance.

For purposes of illustration, assume you have a single core processor that supports SMT and provides a single SMT LP. In that configuration, both LPs share resources on the chip. A SMT processor typically will not provide the same performance as two single-threaded processors but may provide better performance than a single processor. With expected performance of a SMT LP being somewhere in-between, the performance gains achieved in a SMT configuration will vary by application. While I truly believe that SMT on today’s hardware is better implemented and performs better than in years past, I don’t factor SMT into sizing a system. SMT can be a good performance benefit to have on hand if you need it, but I’ve not seen the performance to be that much greater. I’ve consistently thought of SMT as yielding more compute power than dealing with I/O. You can search the net and find a variety of opinions on this topic. You may form your own opinion. There are also some applications that suggest or require disabling SMT because of the impact to the application. The advice I’ve consistently given has been to size systems according to physical processors and cores. Then use Performance Monitor to determine if SMT provides additional gain. And, of course, if an application says don’t use SMT…the vendor may have a reason.

How many cores then will Windows Server allow?

The number of possible LPs prior to Windows Server 2008 R2 was based on the number of bits. For instance, a 32-bit OS could use 32 LPs; 64-bit could use 64 LPs. This is confirmed by Mark Russinovich’s presentation on R2’s kernel changes (available on the Microsoft Download Center.) Windows Server 2008 R2 extends this limit by allowing up to 4 groups of up to 64 processors each. Doing the math, that translates to a maximum of 256 LPs for Windows Server 2008 R2. That alone would be enough for me to jump to R2 if I were an administrator using very expensive hardware with lots of processor cores…especially for virtualization.

The Windows Server 2008 R2 kernel establishes processor groups (K-Groups) at boot time; they are not customizable by an administrator after startup. However, according to KB2506384, there exists a way to manually adjust K-Group assignments to your liking for the next boot of the OS. K-Groups may contain one or more NUMA nodes. Windows attempts to place all processors from a given NUMA node in the same group where possible. Systems with less than 64 LPs will have only a single group. From a scheduling standpoint, threads are assigned to only one group at a time. Also, an interrupt may target only the processors of a single group.

What happens when a physical processor has multiple cores or a given core has multiple logical processors?

The answer to this question depends on whether you're using Windows Server 2008 R2 RTM or with applicable updates that alter default behavior. Using the RTM version of Windows Server 2008 R2, the kernel attempts to place all cores of a given physical processor in the same group whenever possible. If using processors where the number of cores per chip isn’t an even multiple of 2, then some cores on a physical processor may be split between groups. For example, if using 12 processors with 6 cores each, the total number of processor cores would be 72. This would result in one group of 64 processors, and a second group of 8. The eleventh physical processor would have 4 cores in the first group, with the remaining two cores the second group along with all six cores of processor 12. For some applications, uneven groups can be problematic. Additionally, minor hardware differences between seemingly identical systems could result in one with a {64,8} grouping and another with {8,64}.

If using Windows Server 2008 R2 with KB 2510206 (or future service pack containing this update), the kernel will attempt to balance processors amongst groups. With the preceding example of 72 LP cores, the resulting groups would each contain 36. The update provides predictability and balance without requiring manual K-Group specification as per KB2506384.

If using Windows Server 2008 with more than 64 cores, you would not be able to utilize extra cores above that limit even though they may exist. Windows Server 2008 R2 can utilize processor groups and allow use of these additional cores up to the maximum of 256. This isn’t the only reason to consider moving to Windows Server 2008 R2…there certainly are many more.

How does using Hyper-V affect all of this?

Hyper-V has limits as to the number of LPs it supports as a virtualization host. As a result, if using Hyper-V it is possible that the system may be limited on the number of LP that can be used even though the OS version, Service Pack, or other updates may support more. It is important to be aware of these limits when planning or ordering systems to be used for virtualization. Based on published documents found on TechNet about the upcoming Windows Server 2012 release, indications are that Hyper-V limits may raise significantly. For more information, see the link provided in the additional references section as Windows Server 2012 was not a released product at the time this post was published.

Below is a chart of Windows Server versions and the maximum number of LPs supported.


Windows Server 2008 w/Hyper-V	16 LP
Windows Server 2008 Service Pack 2 w/ Hyper-V	24 LP
Windows Server 2008 R2 w/ Hyper-V	64 LP

Therefore, if you have a computer with Windows Server 2008 R2 installed with updates that normally could support up to 256 LPs, the same system would only utilize the first 64 LPs with Hyper-V enabled. All remaining LPs would be ignored. The primary OS runs in the parent partition and does not recognize any LPs above 64 because the Hypervisor does not present any additional to the system. In a configuration like this, you would want to make certain that SMT is disabled to make use of all possible physical cores possible rather than have many physical cores ignored and unused. Further, on such a configuration, you may receive a warning from a Best Practices Analyzer scan indicating the number of LPs available exceeds the number supported by Hyper-V.

Additional References

2510206 Performance issues when more than 64 logical processors are used in Windows Server 2008 R2
http://support.microsoft.com/kb/2510206/EN-US

2546706 A Windows Server 2008 R2-based computer that has some NUMA-based processors and more than 256 logical processors runs in SMP mode as a 64-processor system and may experience decreased performance
http://support.microsoft.com/kb/2546706/EN-US

2517752 "0x0000000A" Stop error occurs during the shutdown process on a computer that is running Windows Server 2008 and that has more than 64 processors installed
http://support.microsoft.com/kb/2517752/EN-US

Sysinternals CoreInfo tool can show logical processor to physical processor mapping
http://technet.microsoft.com/en-us/sysinternals/cc835722

Hyper-V: The number of logical processors in use must not exceed the supported maximum
http://technet.microsoft.com/en-us/library/ee941148(v=WS.10).aspx

Requirements and Limits for Virtual Machines and Hyper-V in Windows Server 2008 R2
http://technet.microsoft.com/en-us/library/ee405267(WS.10).aspx

Competitive advantages of Windows Server 2012 Release Candidate

http://download.microsoft.com/download/5/A/0/5A0AAE2E-EB20-4E20-829D-131A768717D2/Competitive%20Advantages%20of%20Windows%20Server%202012%20RC%20Hyper-V%20over%20VMware%20vSphere%205%200%20V1%200.pdf

↧

First, Do No Harm

August 5, 2012, 10:00 pm

≫ Next: Becoming an Xperf Xpert Part 3: The Case of When Auto “wait for it” Logon is Slow

≪ Previous: Windows Server and Processor Cores...

As time passes, the business value derived from a company's IT infrastructure continues to expand. Usually, along with this, the 'interconnectedness' of the infrastructure expands, too.

One of the goals of an IT Pro is similar to a paraphrase of Hippocrates: "First, do no harm."

When we DCPROMO a Domain Controller (DC) out of Active Directory (AD), it is a very easy wizard-driven process. However, without proper forethought and a few precautions, that easy process can possibly have a far reaching and substantial negative impact on production IT Operations. I like to help my customers avoid those situations with good proactive advice and meaningful lessons-learned.

Personally, I love checklists – they are always more complete than my memory, they can be templatized and re-used. Checklists can serve as a paper trail for Change Control audits or histories and they can often help you cover your butt if things go south.

Here are few items for a DC decommission checklist.

Prior to DCPROMO

Before we yell BANZAI! and kick off a DCPROMO, what should we check?

Do we need to modify any DHCP scopes?
- Are we providing this server's IP address as a DNS server for DHCP clients?
Do we need to modify any member servers or other static systems?

Are we pointing any systems to this server for DNS, LDAP, etc?
SCCM is an awesome tool to inventory these sorts of settings, but if you don't have SCCM, another great way to inventory the IP configurations for your systems is via script. PowerShell has some handy 'export to CSV' functions built right in. A peer PFE and PowerShell wiz is putting the final polish on a post with some script code samples to do just that...stay tuned!

Are there startup/logon/other scripts that hard-code this system and need to be modified/edited beforehand?
Are there any shares, applications, printers, scheduled tasks or other 'services' on the target system? Check for any additional or unexpected Roles or Features (i.e. WINS – which is far down in the GUI; be sure you scroll!)
- Those may or may not be affected after you remove AD from a server
- If the goal is to decommission the server, those services may need to be moved or decommissioned with the target server
You'll want to know what you should set for the local Administrator account's password once AD is removed and the server becomes a member server
- Beware that by default, the username will be "Administrator" but if you've redirected the default COMPUTERS container in AD to another OU, there might be a GPO that renames the local Administrator account
You'd be wise to obtain/verify (or reset) the local Directory Service Recovery Mode password just in case you need it on this system
- This is often a weak spot in an AD deployment
  - "Who knows the DSRM password? I've tried the 8 that I thought it might be and none of them work."
  - On 2008 and newer versions, this can be sync'd to a domain account to make it a bit more easily managed.
Obtain and verify ILO/DRAC/KVM/VM console or other 'out of band' access to the system in the event the system doesn't reboot properly (sitting at an F1 prompt because someone unplugged the keyboard to a server across the country on a Sunday at 2:00 am is no fun).
You'll need the AD Site information for the DC so you can clean up the server 'object' from AD Sites after the DCPROMO
- That still needs to be manually removed even in Server 2012 (at least in the Release Candidate)
Make sure you have your approved and communicated Change Control Request
Make sure the Helpdesk is aware
I recommend verifying the AD FSMOs can all be reached from the target system
- From CMD: DCDIAG /TEST:FSMOCHECK <enter>

Starting DCPROMO

Logon to the target Domain Controller
- Verify you are actually logged on to the proper server.
  - Name resolution can be flaky
  - Names of VMs in the VM console can be wrong
Check the NIC settings and verify/set a valid setting for the DNS entries OTHER than itself
When ready, kick off DCPROMO
If you are decommissioning the domain, select the checkbox stating that this is the last DC in the domain
- This will clean up the domain from the AD Forest, if applicable
After the system reboots from the DCPROMO, you'll likely need to remove the Role itself via Server Manager
Remember, by default, the computer account gets moved into the COMPTUERS container in AD after DCPROMO. If you've redirected the default container to another OU, the computer object will be placed there.
- You may need to move the computer account object into a more desirable OU to ensure proper policies get applied (if the server isn't going to be decommissioned right away)
Remember to go into AD Sites and Services MMC and delete the server 'object'
- Sometimes AD replication takes time and you'll notice there is still an NTDS Settings object under the server object.
- If this is the case, allow time for AD Replication to "spread the news" of the recent DCPROMO and check back later to delete the server object.
- WARNING!! DO NOT delete the NTDS Settings object manually or attempt to delete the server object if there is an NTDS Settings object underneath it
If you are decommissioning the server, be sure you delete the computer account from Active Directory to help "keep AD clean."
If required – depending on the situation:
- Delete the AD Site, Site Link and subnet(s) entries from AD
- Delete any DNS entries
  - A, PTR, CNAME, etc
  - Don't forget to check for any glue or delegation-related records
  - Don't forget to check any NS records where the DC was a secondary DNS server for one or more primary DNS zones hosted elsewhere
  - Don't forget any forwarders/conditional forwarders on other DCs which may have pointed to the target system
- Delete the record(s) from WINS
- Delete the "other side" of any external or shortcut trusts if this was a domain decommission in addition to a DC decommission
- Edit any GPOs that set this domain as a DNS suffix entry
- Remove the server from any backup, management or monitoring tools
- Mark the system as appropriate in any asset tracking tools
- If it was a VM, inform the VM team to delete the VM
- If it was a physical server, remove it from the rack
  - Remove the rails, network, power and other cables from the rack
  - Update any documents which track datacenter space/ports/rack allocations
When you're all done, inform the proper people that your change control request is complete
Sign off on your 'checklist' and file it away

For me, there is always a period of time where I'm still 'waiting' for the phone to ring or pager to go off but once you get past that time, usually a few hours during the business day, have a Coke and smile J

Hilde

↧

Becoming an Xperf Xpert Part 3: The Case of When Auto “wait for it” Logon is Slow

August 13, 2012, 5:00 am

≫ Next: How Many Coffees Can You Drink While Windows 7 Boots?

≪ Previous: First, Do No Harm

Hey y’all, Mark here again with an interesting “Real World” XPerf trace from the depths of a WDRAP (Windows Desktop Risk Assessment). If your infrastructure hasn’t had a WDRAP yet, or it’s been a while, work with your Technical Account Manager to see if you can schedule one. I highly recommend it as a great way to expose and diagnose issues that might be impacting 10s, 100s or even 10,000s of users.

It was during a WDRAP that we came across this interesting slow boot/slow logon situation and analyzed it with an XPerf trace. The target system’s overall boot time wasn’t terrible - about 2 minutes and 5 seconds isn’t too bad for a typical corporate environment, but we see a large chunk of the trace, almost a full minute - 60 long seconds, was being taken up during the Winlogon phase. I was hopeful we could improve or at least identify what was eating up that looooong 60 second time slice.

To highlight the time range through all the other frames in the trace, we lined everything up within XPerf so it would be easier on the eyes:

· Select an area on the graph

· Right-click and choose “Clone Selection”

We touched on the boot phases in our other XPerf posts - you check them out here . We know that Group Policy processing and loading of your profile occurs around this time, in addition to other processing.

Notice in the top graph above, we have 3 diamonds that say “GPClient” - what are those actually doing?

· The bottom “GPClient” - that’s the computer policy

· The middle “GPClient” - in our case here, the one lowest in the list, is the user policy

· Finally, the top “GPClient” - is if there are any Group Policy logon scripts

In the screen shot above, we see that these 3 diamonds are finishing quickly, as are most all the other diamonds in this phase.

So where is all this time going?

It is time to dig into the “Generic Events” section of the trace which is located directly under “Boot Phases” to give us a clue of what might be happening.

In Generic Events, right-click the area we have selected, and click “Summary table.”

· This will open up the “Generic Events Summary Table” as shown below

· The area we are interested in here is “Microsoft-Windows-Winlogon”, so expand that

I ordered the columns as follows:

· The first column is the “Provider Name” - we are interested in Winlogon

· Second column is “Time”

o We know pretty close when the issue is occurring.

· Next, we have “Process” and then “Thread ID”

· Finally, “Task Name”

Let’s jump to where our problem seems to be occurring – you can see from the screen shot there is a delay right after 16-17 seconds – where the time jumps to 70 seconds.

If Barney Stinson from “How I Met Your Mother” was doing this WDRAP he’d say, “Log wait for it…..on!” At about 17 seconds, the system is waiting for credentials. At 70 seconds, the system finally receives them and starts the logon process. Case closed. Or is it?

The customer here was using a McAfee Endpoint Encryption product that would kick-in after the user enters his/her logon credentials. It would start decrypting the drive, then pass the credentials over to Windows and log the user in to the Desktop.

As a test, we temporarily disabled this and used the autologon registry keys in Windows to see if we could repo the issue. We did not experience a delay – we concluded that it had to be something with how these credentials were being handled. It was time to dig even deeper.

We zoomed into this time frame on all graphs by right clicking anywhere on the selection and clicking “Zoom to Selection.”

Something that jumped out right away was the DPC CPU Usage Graph (below):

Anything over 15% should be investigated.

I right-clicked the “DPC CPU Usage” graph and clicked “Summary Table” but I didn’t see anything interesting in the results around McAfee.

We needed to look at other CPU areas so we went to “CPU Sampling by CPU.” If you don’t see it, you may have to add it from the fly out menu on the left. Right-click and select “Summary Table.”

It’s time to do some more column sorting. Since we know we are looking for DPC from before, we move that all the way to the left followed by the “Module” column. This Module looks suspiciously like a file that is part of McAfee Endpoint Encryption. We need to verify that by using one last trick – the “Image Summary Table.”

If you ever wanted to find out which version a file is, you can use the “Image Summary Table” view. To see this, you’ll need to go to “Process Lifetime,” right-click the area you have selected, and pick “Image Summary Table” as shown below.

If we sort our columns in the following order, we should be able to get some good information about this file.

The order I have is:

· Image Name

· Binary File Version

· Company Name

· Product Version

Let’s find MfeEEAlg.sys in our table and see if it all adds up:

Sure enough it does. I request the highest of fives.

The next steps for the customer were to contact McAfee through their support channel to further drill down into the issue. Our analysis and drill-down was able to provide the customer a very clear direction on where the delay was occurring. This should help to speed up the rest of the troubleshooting process with the other vendor.

I want to send special “thanks” out to fellow PFE Yong Rhee who helped with this. Maybe I’ll get him to post up here one day and really teach us all a thing or two about Xperf.

-Mark Morowczynski

↧

How Many Coffees Can You Drink While Windows 7 Boots?

August 15, 2012, 6:28 am

≫ Next: Looking for Trouble? Jump into the IP Info Weeds with PowerShell

≪ Previous: Becoming an Xperf Xpert Part 3: The Case of When Auto “wait for it” Logon is Slow

Hey y’all, Mark here again with a quick post. Did you attend Tech Ed this year but miss a chance to see Xperf stealing the show? Great news, the session is online along with many others!

This is a great session led by Stephen Rose ( @stephenlrose), PFE’s own Matthew Reynolds who absolutely kills it in the Xperf demo department and MSIT Vadim Arakelov.

If you are not a member of Springboard Series newsletter you are missing out, sign up here (http://technet.microsoft.com/en-us/windows/springboard-series-insider.aspx)

This is the type of amazing content you miss as well as the fantastic networking opportunities with peers as well as meeting and asking Microsoft experts your toughest questions. You can view more sessions here, http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012

So save the date of June 3-6 for Tech Ed 2013, now you have a reason to visit Gambit’s home town of New Orleans if you’ve never been. http://northamerica.msteched.com/#fbid=fBjb8v8NBNZ

Finally, I’m on twitter too! Ask me stuff! (@markmorow)

Update: If you want to download the video below for offline viewing. I should have put that up there to begin with. Sorry guys.

http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/WCL305

-Mark Morowczynski