Quantcast
Channel: Ntdebugging Blog
Viewing all articles
Browse latest Browse all 141

Virtual Machine Managment Hangs on Windows Server 2012 R2 Hyper-V Host

$
0
0

Hi, my name is Christian Sträßner from the Global Escalation Services team based in Munich, Germany. Today we will look at a hang scenario that involves user and kernel dump analysis.  Also, we will demonstrate how to analyze a hang from both user and kernel modes without forcing a crash dump of the machine.

 

With the increasing popularity of virtualization using Hyper-V it becomes more unrealistic to get a full memory dump of a Hyper-V host due to the increasing amount of RAM installed into those machines.  Fortunately there are sometimes alternative tools available to debug these servers.

 

Customer Scenario for this particular investigation:

The customer was unable to execute any kind of management activity against virtual machines running on a Windows Server 2012 R2 Hyper-V host.  Already running Virtual Machines were operating fine.

 

We discovered that the get-vm powershell cmdlet hung forever when executed on the host. Also, any kind of WMI operation executing management activities against the virtual machines hung forever or produced errors.

 

Tools used to gather data on the affected host:

Task Manager

Livekd.exe

 

Since Management operations against VMs involve the Service “Hyper-V Virtual Machine Management” (VMMS.EXE) we first take a look into that Service. 

 

So we first create a User mode memory dump from that process.  To do so, simply open task manager and use the “create dump” option that is available when you right click on a process in the details tab view.

image001

 

Tip: you can copy and paste the path from the following message box:

image002

 

Save the dump and copy it to a workstation that has windbg installed.

 

Analyzing the Usermode Dump:

After opening the dump and setting the symbol path if required, check the number of threads in the process:

0:020> ~

#  0  Id: c08.c0c Suspend: 0 Teb: 00007ff7`765ae000 Unfrozen

   1  Id: c08.c34 Suspend: 0 Teb: 00007ff7`765a9000 Unfrozen

<snip>

181  Id: c08.36a0 Suspend: 0 Teb: 00007ff7`76360000 Unfrozen

182  Id: c08.4c10 Suspend: 0 Teb: 00007ff7`7633a000 Unfrozen

 

The process has 183 threads. This is high, but we cannot judge if this is abnormal or not purely by that number.  Looking at another Hyper-V server with similar load might provide an idea if 183 threads is a lot.

 

Next we take a look into what these threads are doing using the command ~*kcn.  The output is very long, save it to a .txt file either via copy + paste or using windbg’s .logopen functionality.

 

Looking at the text file in your favorite editor, we quickly see that there are 104 threads that have the function RtlpWaitOnCriticalSection on the stack.

 

Next we use the !cs command to analyze the critical sections in that process.  The output shows 1 critical section in state LOCKED and the owning thread is 43d8.

DebugInfo          = 0x00000092cfd20960

Critical section   = 0x00000092cf31fb38 (+0x92CF31FB38)

LOCKED

LockCount          = 0x68

WaiterWoken        = No

OwningThread       = 0x00000000000043d8

RecursionCount     = 0x1

LockSemaphore      = 0x1B48

SpinCount          = 0x00000000020007d0

 

To verify the output above is a valid critical section, you can dump the DebugInfo member:

 

0:000> dt _RTL_CRITICAL_SECTION_DEBUG 0x00000092cfd20960

vmms!_RTL_CRITICAL_SECTION_DEBUG

  +0x000 Type             : 0

  +0x002 CreatorBackTraceIndex : 0

  +0x008 CriticalSection  : 0x00000092`cf31fb38 _RTL_CRITICAL_SECTION

  +0x010 ProcessLocksList : _LIST_ENTRY [ 0x00000092`d2fc01a0 – 0x00000092`d31a0e20 ]

  +0x020 EntryCount       : 0

  +0x024 ContentionCount  : 0xa27

  +0x028 Flags            : 0

  +0x02c CreatorBackTraceIndexHigh : 0

  +0x02e SpareUSHORT      : 0

 

The CriticalSection member should point back to the Critical Section structure we looked at earlier.

 

From the output of the ~ command run earlier we find owning thread 43d8 is thread 20 in the dump:

. 20  Id: c08.43d8 Suspend: 0 Teb: 00007ff7`76460000 Unfrozen

 

Switch to thread 20 and display the stack of the critical section owner:

0:020> ~20s

ntdll!NtDeviceIoControlFile+0xa:

00007fff`97c3124a c3              ret

0:020> kcn

# Call Site

00 ntdll!NtDeviceIoControlFile

01 nsi!NsiAllocateAndGetTable

02 IPHLPAPI!InternalGetIfTable2Ex

03 vmms!VmNetworkInterfaceTable::Initialize

04 vmms!NvspWmiINetCfgHelper::EnumerateNetworkAdapters

05 vmms!NetConfigHelper::GetNetworkBridgeAdapter

06 vmms!NvspWmiUtility::GetBridgedAdapterId

07 vmms!VmEthernetSwitchPort::GetLogicalAdapterInternal

08 vmms!VmEthernetSwitchPort::GetConnectedLogicalAdapter

09 vmms!ExternalConnectionStatusRetriever::GetExternalSwitchAdapter

0a vmms!ExternalConnectionStatusRetriever::GetExternalSwitchAdapterStatusInternal

0b vmms!ExternalConnectionStatusRetriever::GetSwitchStatus

0c vmms!VmEthernetConnectionProvider::GetSwitchConnectionStatus

0d vmms!VmmsRealizedVirtualMachine::VerifyExternalSwitches

0e vmms!VmmsClusterManager::IsVmHealthy

0f vmms!ComVmmsClusterManager::IsVmHealthy

10 rpcrt4!Invoke

11 rpcrt4!Ndr64StubWorker

12 rpcrt4!NdrStubCall3

<snip>

 

Reading the stack from bottom to top we can make a fairly educated guess what’s going on.  A function called IsVmHealthy is invoked through a RPC call (frame 10->0xf).

 

Note:

Typically, seeing these RPC calls at the bottom of a stack means that this thread is currently serving an incoming RPC call.

 

We then execute multiple functions and end up enumerating the network adapters on the machine (frame 4).  In frame 0 we call ntdll!NtDeviceIoControlFile and transition to kernel mode.

 

Here our journey in user mode ends. Since we only have a user mode dump from the vmms process, we cannot tell what’s going on in the kernel part of that thread.

 

The Kernel Side:

Ideally the next step would be to create a full memory dump of the machine to inspect both user mode and kernel in one piece of data. As mentioned before, this is sometimes not an option on Hyper-V host machines due to the amount of physical memory installed in these machines.

 

This leaves us with a kernel dump. Creating a kernel dump requires inducing a crash (aka “bluescreen”) of the machine which would take down the host and all VMs running.  We may need to wait for a maintenance window to execute those actions.

 

Live migrating the running VMs to a different host will not work because the VMMS process is hung.

 

So we move to LiveKD.exe, a Sysinternals Tool that has existed for a long time.  In earlier versions of LiveKD the consistency of the memory dumps captured was not good.  Windows Server 2012 R2 and Windows 8.1 enable the new LiveKD -ML functionality. With the -ML switch, LiveKD asks the Operating System to create a Mirror Dump without crashing the machine.  This captures the kernel memory in a way that the dump content is consistent (call it a snapshot).

 

Copy Livekd.exe to c:\temp.  Create an empty text file called KD.exe in the same directory.  This works around a limitation in LiveKD.exe which always checks for the existence of kd.exe even when it’s not required for the -ML option.

 

Command to create the dump:

livekd.exe -ml -o c:\temp\kerneldump.dmp

 

Once this is finished, copy the dump file to your workstation and open it in windbg.

 

From the User mode dump we know that this our process is c08 and our thread is 43d8:

20  Id: c08.43d8 Suspend: 0 Teb: 00007ff7`76460000 Unfrozen

 

So let’s search for our VMMS.exe process:

0: kd> !process 0 0 vmms.exe

PROCESS ffffe0005c21c080

    SessionId: 0  Cid: 0c08   Peb: 7ff7765ad000  ParentCid: 0554

    DirBase: 3705e000  ObjectTable: ffffc000717c3040  HandleCount: <Data Not Accessible>

    Image: vmms.exe

 

Switch to the Address Space of that process:

0: kd> .process /p /r ffffe0005c21c080

Implicit process is now ffffe000`5c21c080

Loading User Symbols

PEB is paged out (Peb.Ldr = 00007ff7`765ad018).  Type “.hh dbgerr001″ for details

 

The Error is expected since we only have the Kernel Address Space available.

 

Run the following command to get the threads inside that process:

!process 0 2 vmms.exe

 

In the output, search for our thread’s TID 43d8

THREAD ffffe00069cdb880  Cid 0c08.43d8 Teb: 00007ff776460000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

            ffffd001b8931ed0  NotificationEvent

 

Switch to that thread and display the stack:

0: kd> .thread ffffe00069cdb880 

Implicit thread is now ffffe000`69cdb880

0: kd> kcn

  *** Stack trace for last set context – .thread/.cxr resets it

# Call Site

00 nt!KiSwapContext

01 nt!KiSwapThread

02 nt!KiCommitThreadWait

03 nt!KeWaitForSingleObject

04 NDIS!ndisQuerySetMiniportEx

05 NDIS!ndisQuerySetMiniport

06 NDIS!ndisIfGetMiniportStatistics

07 NDIS!ndisIfQueryFilterObject

08 NDIS!ndisIfQueryObject

09 NDIS!ndisNsiGetInterfaceRodInformation

0a NDIS!ndisNsiGetAllInterfaceInformation

0b NETIO!NsiEnumerateObjectsAllParametersEx

0c nsiproxy!NsippEnumerateObjectsAllParameters

0d nsiproxy!NsippDispatch

0e nt!IopSynchronousServiceTail

0f nt!IopXxxControlFile

10 nt!NtDeviceIoControlFile

11 nt!KiSystemServiceCopyEnd

 

0: kd> !thread ffffe00069cdb880

THREAD ffffe00069cdb880  Cid 0c08.43d8  Teb: 00007ff776460000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

   IRP List:

    ffffe801316e7ae0: (0006,0118) Flags: 00060000  Mdl: 00000000

Not impersonating

DeviceMap                 ffffc0006ea0c4e0

Owning Process            ffffe0005c21c080       Image:         vmms.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      3667296        Ticks: 14135 (0:00:03:40.859)

Context Switch Count      3149           IdealProcessor: 14            

UserTime                  00:00:00.390

 

So this is the kernel mode stack of our thread 43d8.

 

To recap:  Thread ffffe00069cdb880 is the owner of the locked critical section in vmms.exe.

 

Reading the kernel stack from bottom to top, we see the call to nt!NtDeviceIoControlFile which calls through the network stack components nsiproxy.sys and netio.sys before it enters NDIS.sys.

 

The last function called in NDIS is NDIS!ndisQuerySetMiniportEx.  After calling ndisQuerySetMiniportEx the thread sits and waits for 0:00:03:40.859.  From the name of the function we can again make a good guess what’s going on, this thread is doing a query or set of something on an NDIS Miniport.

 

So the remaining work is simple, due to the fact that the NDIS Team ships an excellent debugger extension to investigate the state of NIC’s, Miniports, filters, etc.  Type: !ndiskd.help to get the available help.

 

Since the function ndisQuerySetMiniportEx executes something against some miniport, we investigate the miniports next: 

0: kd> !ndiskd.miniports

    MiniDriver         Miniport            Name                                

    ffffe8012a67eba0   ffffe8012ac921a0    Microsoft ISATAP Adapter #5

    ffffe8012a67eba0   ffffe8012aa371a0    Microsoft 6to4 Adapter

    ffffe8012a67eba0   ffffe8012a97b1a0    Microsoft ISATAP Adapter #3

    ffffe8012a67eba0   ffffe8012a9521a0    Microsoft ISATAP Adapter #4

    ffffe8012a67eba0   ffffe8012a6d21a0    Microsoft ISATAP Adapter #9

    ffffe8012a67eba0   ffffe8012a6d01a0    Microsoft ISATAP Adapter #2

    ffffe0005169b6c0   ffffe000595311a0    Microsoft Network Adapter Multiplexor Driver

    ffffe0005169b6c0   ffffe0005876c1a0    Microsoft Network Adapter Multiplexor Driver #2

    ffffe000516be020   ffffe0005161d1a0    Microsoft Failover Cluster Virtual Adapter

    ffffe000516d0ba0   ffffe0005161f1a0    WAN Miniport (PPPOE)

    ffffe0004f3e2390   ffffe000516211a0    Hyper-V Virtual Ethernet Adapter #4

    ffffe0004f3e2390   ffffe000516231a0    Hyper-V Virtual Ethernet Adapter #3

    ffffe0004f3e2390   ffffe000516251a0    Microsoft Hyper-V Network Switch Default Miniport

    ffffe000516dbba0   ffffe000516271a0    WAN Miniport (L2TP)

    ffffe0005169b6c0   ffffe000516291a0    Microsoft Network Adapter Multiplexor Default Miniport

    ffffe0004e98fba0   ffffe0005162b1a0    WAN Miniport (Network Monitor)

    ffffe000516a55e0   ffffe0005162d1a0    WAN Miniport (PPTP)

    ffffe0004f5b9ba0   ffffe000516391a0    FooBar Network Adapter #7

    ffffe0004f5b9ba0   ffffe0005163b1a0    FooBar Network Adapter #4

    ffffe0004f5b9ba0   ffffe0005161b1a0    FooBar Network Adapter #8

    ffffe0004f5b9ba0   ffffe0004f5f51a0    FooBar Network Adapter #3

    ffffe0004f5b9ba0   ffffe0004f5f71a0    FooBar Network Adapter #2

    ffffe0004f5b9ba0   ffffe0004f5f91a0    FooBar Network Adapter

    ffffe0004e98fba0   ffffe0004f5fb1a0    WAN Miniport (IPv6)

    ffffe0004f2e9ba0   ffffe0004f5fd1a0    WAN Miniport (IKEv2)

    ffffe0004f5883a0   ffffe0004f55b1a0    WAN Miniport (SSTP)

    ffffe0004f58f580   ffffe0004f55d1a0    Hyper-V Virtual Switch Extension Adapter #2

    ffffe0004f58f580   ffffe0004f55f1a0    Hyper-V Virtual Switch Extension Adapter

    ffffe0004e98fba0   ffffe0004e95c1a0    WAN Miniport (IP)

 

Lets start “bottom to top” through the network stack beginning with the Physical NIC  to check their health state.  Inspect each miniport using !ndiskd.miniport ffffe000516391a0.

 

Since ndiskd uses DML (Debugger Markup Language) links, you can click on each address in the miniport column until you find a miniport that shows something abnormal (different from the other miniports)

0: kd> !ndiskd.miniport ffffe0004f5f71a0

 

 

MINIPORT

 

     FooBar Network Adapter #2

 

    Ndis handle        ffffe0004f5f71a0

    Ndis API version   v6.40

    Adapter context    ffffe00051a00000

    Miniport driver    ffffe0004f5b9ba0 – FooBarDriver v1.0

    Network interface  ffffe0004e7c0a20

 

    Media type         802.3

    Device instance    PCI\VEN_1234&DEV_5678&SUBSYS_11112222&REV_01\9877ffe0004f5883a0  

    Device object      ffffe0004f5f7050    More information

    MAC address        12-34-56-78-9a-bc

 

 

STATE

 

    Miniport           Running

    Device PnP         Started             Show state history

    Datapath           DIVERTED_BECAUSE_MINIPORT_RESETTING

    NBL status         NDIS_STATUS_RESET_IN_PROGRESS

    Interface          Up

    Media              Connected

    Power              D0

    References         0n173               Show detail

    User handles       2

    Automatic resets   1

    Resets requested   0

    Pending OID        OID_RECEIVE_FILTER_SET_FILTER

    Flags              BUS_MASTER, 64BIT_DMA, SG_DMA, DEFAULT_PORT_ACTIVATED,

                       RESET_IN_PROGRESS, RESTORING_FILTERS,

                       SUPPORTS_MEDIA_SENSE, DOES_NOT_DO_LOOPBACK,

                       MEDIA_CONNECTED

    PnP flags          PM_SUPPORTED, DEVICE_POWER_ENABLED, RECEIVED_START,

                       HARDWARE_DEVICE

    More flags         PROCESSING_REQUEST, REQUEST_TIMEOUT

 

The above output indicates this Miniport is in status of resetting.  It also is currently processing an OID.

 

Clicking on the DML link:

    Pending OID        OID_RECEIVE_FILTER_SET_FILTER

 

Gives us:

0: kd> !ndiskd.oid ffffe8012c504090

 

 

OID REQUEST

 

    OID_RECEIVE_FILTER_SET_FILTER

 

    Request type:      OID_METHOD

    Completed:         Not completed

    Cloned from:       OID_RECEIVE_FILTER_SET_FILTER

 

 

 

DATA PAYLOAD

<snip>

 

Setting a Filter via the OID above is a relatively common operation and should complete very quickly. It is very unusual to see such operation “in progress” in a memory dump collected at a random point in time.

 

Note:

You can read more on OID’s  and specifically on OID_RECEIVE_FILTER_SET_FILTER  here:

https://msdn.microsoft.com/en-us/library/windows/hardware/ff569795(v=vs.85).aspx

 

So here we have reached a point where it is worthwhile to approach the vendor of the NIC or the OEM of the server.  To be more confident, we could collect another kernel dump 5 minutes later and check if that miniport is still in the same state, but just out of interest, let dig around a bit more.

 

FooDRV.sys is the driver serving this NIC

!ndiskd.minidriver ffffe0004f5b9ba0 -handlers

 

HANDLERS

 

    Miniport Handler                       Function pointer   Symbol (if available)

*** ERROR: Module load completed but symbols could not be loaded for FooDRV.sys

    InitializeHandlerEx                    fffff80032712fc0   FooDRV+3cfc0

    SetOptionsHandler                      fffff8003274bb10   FooDRV+75b10

<snip>

 

 

Run !stacks 2 FooDRV to search for all threads in the dump that currently have this driver on the stack. We have a hit:

4.001114  ffffe0007510c040 ffc7d369 Blocked    nt!KiSwapContext+0x76

     nt!KiSwapThread+0x14e

     nt!KiCommitThreadWait+0x129

                                        nt!KeWaitForSingleObject+0x373

     NDIS!NdisMSleep+0x59

     FooDRV+0x8ccbb

     FooDRV+0x8bc8b

     FooDRV+0x70d79

     FooDRV+0x14696

     nt!IopProcessWorkItem+0x80

     nt!ExpWorkerThread+0x69f

     nt!PspSystemThreadStartup+0x58

     nt!KiStartSystemThread+0x16

 

FooDRV.sys is calling NdisMSleep.  The Function name already tells it what it does, but here is the documentation, https://msdn.microsoft.com/de-de/library/ff563677(v=vs.85).aspx

 

This is where I stopped.  With a minimum impact to the production system we captured enough data to find the problem.  The vendor of the NIC driver will fix that in a newer version of the driver and in the meantime they found a configuration workaround till that fix is ready.  Also, we found by experimenting that disabling VMQ on the Machine does seem to be a valid workaround too.


Viewing all articles
Browse latest Browse all 141

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>