Debugging a Network Connectivity Issue - TrackNblOwner to the Rescue

Hello Debug community this is Karim Elsaid again. Today I’m going to discuss a recent interesting case where intermittently the server is losing access to the network. No communication (even pings) can be done from / to the server when the issue hits.

We went through the normal exercise and asked the customer to obtain a Kernel memory dump from the machine while it was in the problematic state, hoping that we will find some data to help us to demystify the issue.

One of the very first commands we run upon receiving a hang dump is the very famous “!locks” command. This yielded the following:

8: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks..

Resource @ nt!IopDeviceTreeLock (0xfffff80001a81c80) Shared 1 owning threads

Threads: fffffa800cd8a040-01<*>

KD: Scanning for held locks.

Resource @ nt!PiEngineLock (0xfffff80001a81b80) Exclusively owned

Contention Count = 6

Threads: fffffa800cd8a040-01<*>

KD: Scanning for held locks

84372 total locks, 2 locks currently held

What I’m looking for is Locks with exclusive owners and waiters. From the above output we can see that thread fffffa800cd8a040 exclusively owns a Plug and Play (Pi prefix) lock and shared owns an I/O Manager (Io prefix) device tree lock.

There are no waiters for the exclusive lock, however PnP locks always worth investigating. While debugging I always treat everything a possible suspect unless proven otherwise, so let’s dump this thread:

8: kd> !thread fffffa800cd8a040 e

THREAD fffffa800cd8a040 Cid 0004.005c Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

fffff88002b0f118 SynchronizationEvent

IRP List:

fffffa8016527510: (0006,0310) Flags: 00000000 Mdl: 00000000

Not impersonating

DeviceMap fffff8a000006100

Owning Process fffffa800cd56040 Image: System

Attached Process N/A Image: N/A

Wait Start TickCount 14791337 Ticks: 15577 (0:00:04:03.002)

Context Switch Count 835317 IdealProcessor: 2

UserTime 00:00:00.000

KernelTime 00:00:26.863

Win32 Start Address nt!ExpWorkerThread (0xfffff8000188f530)

Stack Init fffff88002b0fc70 Current fffff88002b0ee30

Base fffff88002b10000 Limit fffff88002b0a000 Call 0

Priority 12 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

*** ERROR: Module load completed but symbols could not be loaded for myfault.sys

Child-SP RetAddr Call Site

fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a

fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2

fffff880`02b0f040 fffff800`018e1816nt!KeWaitForSingleObject+0x19f

fffff880`02b0f0e0 fffff880`01618fcd nt! ??::FNODOBFM::`string'+0x12ff6

fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d

fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe

fffff880`02b0f1f0 fffff880`00f884b7 NDIS!ndisPnPNotifyBinding+0xbf

fffff880`02b0f280 fffff880`00fa1911 NDIS!ndisPnPNotifyAllTransports+0x377

fffff880`02b0f3f0 fffff880`00fa2c5b NDIS!ndisCloseMiniportBindings+0x111

fffff880`02b0f500 fffff880`00f3bbc2 NDIS!ndisPnPRemoveDevice+0x25b

fffff880`02b0f6a0 fffff880`00fa5b69 NDIS!ndisPnPRemoveDeviceEx+0xa2

fffff880`02b0f6e0 fffff800`01aec8d9 NDIS!ndisPnPDispatch+0x609

fffff880`02b0f780 fffff800`01c6c1e1 nt!IopSynchronousCall+0xc5

fffff880`02b0f7f0 fffff800`0197f733 nt!IopRemoveDevice+0x101

fffff880`02b0f8b0 fffff800`01c6bd34 nt!PnpRemoveLockedDeviceNode+0x1a3

fffff880`02b0f900 fffff800`01c6be40 nt!PnpDeleteLockedDeviceNode+0x44

fffff880`02b0f930 fffff800`01cfcd04 nt!PnpDeleteLockedDeviceNodes+0xa0

fffff880`02b0f9a0 fffff800`01cfd35c nt!PnpProcessQueryRemoveAndEject+0xc34

fffff880`02b0fae0 fffff800`01be65ce nt!PnpProcessTargetDeviceEvent+0x4c

fffff880`02b0fb10 fffff800`0188f641 nt! ?? ::NNGAKEGL::`string'+0x5ab9b

fffff880`02b0fb70 fffff800`01b1ce5a nt!ExpWorkerThread+0x111

fffff880`02b0fc00 fffff800`01876d26 nt!PspSystemThreadStartup+0x5a

fffff880`02b0fc40 00000000`00000000 nt!KiStartSystemThread+0x16

Interesting, by looking at the stack above we can see that thread is doing some NDIS PnP stuff. This thread has been waiting for more than 4 minutes, but hold on, what is “ nt! ?? ::FNODOBFM::`string”? This doesn’t seem to be a useful function name, no its not! This is a side effect of Basic Block Tools optimization (BBT). Using public symbols the debugger will find it hard to get to the right symbol, there is a nice a trick you can use in order to get to the right function.

P.S for a nice x64 Deep Dive please refer to our archive.

Let’s display the function data for the return address fffff800`018e1816:

8: kd> .fnent fffff800`018e1816

Debugger function entry 000000e8`f28f14f8 for:

(fffff800`018c4790) nt! ?? ::FNODOBFM::`string'+0x12ff6 | (fffff800`018c47c8) nt!vDbgPrintExWithPrefixInternal

BeginAddress = 00000000`000da7d0

EndAddress = 00000000`000da81c

UnwindInfoAddress = 00000000`001c8a54

Unwind info at fffff800`019cfa54, 10 bytes

version 1, flags 4, prolog 0, codes 0

Chained info:

BeginAddress = 00000000`000182f0

EndAddress = 00000000`00018358

UnwindInfoAddress = 00000000`001bf910

Unwind info at fffff800`019c6910, 6 bytes

version 1, flags 0, prolog 4, codes 1

00: offs 4, unwind op 2, op info c UWOP_ALLOC_SMALL.

For optimized binaries, you will find a section “Chained Info”. Add the BeginAddress to the start address of the module and you should hit the correct function so:

8: kd> ln nt+000182f0

(fffff800`0181f2f0) nt!ExWaitForRundownProtectionReleaseCacheAware | (fffff800`0181f358) nt!KeGetRecommendedSharedDataAlignment

Exact matches:

nt!ExWaitForRundownProtectionReleaseCacheAware (<no parameter info>)

Bingo! You got the function. So tcpip!FlPnpEvent was calling ExWaitForRundownProtectionReleaseCacheAware. This function will basically wait for the rundown protection to drop down to 0.

A thread can call ExAcquireRundownProtectionEx against a shared object for safe access. Rundown Protection provides a way to protect an object from being deleted unless all outstanding access has been finished (Run Down). The “ExWaitForRundownProtectionReleaseCacheAware” will do exactly the same; it will wait for all rundown protection calls to be completed.

The question is which structure are we waiting for its rundown to drain, that will depend on what we are dealing with. Because of code optimization the debugger is not showing you the full picture. Through code review I found that in this particular dump there is an inline call to function “FlpUninitializePacketProviderInterface”.

So the stack in reality should look like this:

Child-SP RetAddr Call Site

fffff880`02b0ee70 fffff800`0187ba32 nt!KiSwapContext+0x7a

fffff880`02b0efb0 fffff800`0188cd8f nt!KiCommitThreadWait+0x1d2

fffff880`02b0f040 fffff800`018e1816 nt!KeWaitForSingleObject+0x19f

fffff880`02b0f0e0 fffff880`01618fcd nt!ExWaitForRundownProtectionReleaseCacheAware

----inline function---- tcpip!FlpUninitializePacketProviderInterface

fffff880`02b0f150 fffff880`0173f54e tcpip!FlPnpEvent+0x17d

fffff880`02b0f1c0 fffff880`00f87b2f tcpip!Fl48PnpEvent+0xe

…

So we need to un-initialize a network interface but before doing that we need to make sure that there are no outstanding references to packets and that there are no outstanding packets pending. When we say packets, starting in NDIS 6 we basically mean “NET_BUFFER” and “Net_Buffer_List” structures. So we need to check for any outstanding Net_Buffer_Lists (NBLs) that are pending, one reference will correspond to one pending NBL.

To the rescue, the “NDISKD” debugger extension has a very nice and handy command to display all pending NBLS and their owners, it is “!pendingnbls”. For the command to work it you must first enable “TrackNblOwner” through the registry. By default, this registry key is not enabled on server SKUs as it may cause a performance hit. On client SKUs this is enabled by default.

When you run !pendingnbls on a clean Windows 2008 R2 install you get:

8: kd> !ndiskd.pendingnbls

This command requires NBL tracking to be enabled on the debugee target

machine. (By default, client operating systems have level 1, and servers

have level 0). To enable, set this REG_DWORD value to a nonzero value on

the target machine and reboot the target machine:

HKLM\SYSTEM\CurrentControlSet\Services\NDIS\Parameters ! TrackNblOwner

Possible Values (features are cumulative)

* 0: Disable all tracking.

* 1: Track the most recent owner of each NBL (enables !ndiskd.pendingnbls)

Show me all allocated NBLs so I can manually find the one I want

You can find all allocated NBLs with the command “!ndiskd.nblpool -force -find ((@$extin.Flags)&0x108)==0x100)”, but still you don’t get any owner.

So I asked the customer to turn on “TrackNblOwner” and reboot, wait for the next occurrence of the issue and get a new memory dump.

Two days later we received the memory dump file. I verified that they are having the same issue I found in the last dump and that TrackNblOwner is configured correctly:

23: kd> dp NDIS!ndisTrackNblOwner L1

fffff880`00ef1a30 00000000`00000001

Then I immediately checked all pending NBLs to claim the prize, and it was not surprising to see why the NIC card was not un-initializing:

23: kd> !ndiskd.pendingnbls

PHASE 1/3: Found 20 NBL pool(s).

PHASE 2/3: Found 550 freed NBL(s).

Pending Nbl Currently held by

fffffa801dc559f0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]

fffffa801dc81680 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]

fffffa80131d2aa0 fffffa80142d31a0 - My Ethernet 1Gb 4-port Adapter [Miniport]

……………………………….

Ret of the repeated output omitted

PHASE 3/3: Found 1854 pending NBL(s) of 3005 total NBL(s).

Search complete.

So we currently have 1854 NBLs pending on the NIC miniport driver “fffffa80142d31a0”. This is the Miniport that currently holding all NBLs:

23: kd> !ndiskd.miniport fffffa80142d31a0

MINIPORT

My Ethernet 1Gb 4-port Adapter

Ndis handle fffffa80142d31a0

Ndis API version v6.20

Adapter context fffffa80138cc000

Miniport driver fffffa800d4f7530 - MyMiniPortDriver v1.0

Network interface fffffa800d25e870

Media type 802.3

Device instance PCI\VEN_1111&DEV_1111&SUBSYS_169D103C&REV_01\4&2263a140&0&0010

Device object fffffa80142d3050 More information

MAC address xx-xx-xx-xx-xx-xx

STATE

Miniport Running

Device PnP QUERY_REMOVED

Datapath Normal

Operational status DORMANT

Operational flags DORMANT_PAUSED

Admin status ADMIN_UP

Media Connected

Power D0

References 9

Total resets 0

Pending OID None

Flags BUS_MASTER, 64BIT_DMA, SG_DMA, DEFAULT_PORT_ACTIVATED,

SUPPORTS_MEDIA_SENSE, DOES_NOT_DO_LOOPBACK,

MEDIA_CONNECTED

PnP flags PM_SUPPORTED, DEVICE_POWER_ENABLED, RECEIVED_START,

HARDWARE_DEVICE

…

What you notice from the above that the device received a “Query_Removed” PNP and is currently in a Dormant_Paused state.

From: http://msdn.microsoft.com/en-us/library/ff566737.aspx:

NET_IF_OPER_STATUS_DORMANT_PAUSED

The operational status is set to NET_IF_OPER_STATUS_DORMANT because the miniport adapter is in the paused or pausing state.

NDIS 6.0 and up allow miniport adapters to be paused and the documentation here shows what the miniport driver should do when it receives a pause request.

Because the adapter was in a pause state, basic network commads like “ping” ceased to work as described earlier in the symptoms. The next action is definitely to involve the miniport adapter vendor to trace this further and find out why all these pending NBLs were not completed.

Until a next adventure!

Best Regards,

Karim

Debugging a Network Connectivity Issue - TrackNblOwner to the Rescue

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List