Hello folks, Pushkar here. I recently worked on a case where the server was hung at “Applying User Settings” during the logon phase. You might ask what’s going to be new in this post, NTDebugging has bunch of posts covering such debug scenarios J. In my defense, this case was particularly interesting because it defied the general logic that I had about processing of network connection request. I am still a newbie in this area getting my feet wet in the arena of debugging. I had to seek help from one of the GES guru Aman Jain. He is an expert in the Windows Networking Stack. He helped me in confirming my suspicion.
As we were stuck at “Applying User Settings”, the obvious path was to look at the activity of Winlogon to understand why is it stuck? I could see that there were seven logon requests which haven’t been processed yet.
1: kd> !session
Sessions on machine: 7
Valid Sessions: 0 1 2 3 4 5 6
1: kd> !process 0 0 winlogon.exe
PROCESS fffffa806e132c10
SessionId: 1 Cid: 0384 Peb: 7fffffdf000 ParentCid: 0298
DirBase: db59d000 ObjectTable: fffff88007fd0430 HandleCount: 139.
Image: winlogon.exe
<snip>
The above command returns 7 instances of Winlogon.exe, one instance per session. I picked one of the sessions as all the 7 sessions were in the same state.
As this was a Windows 2008 Server, winlogon would open an LPC Channel with Logonui.exe to process the user’s logon and then it will call into Local Session Manager (LSM.EXE) to spawn the users’ session. I could see that winlogon is waiting on message reply from Logonui.exe.
1: kd> !thread fffffa80`73d3d770
THREAD fffffa8073d3d770 Cid 0384.1e2c Teb: 000007fffffd7000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable
fffffa8073d3db00 Semaphore Limit 0x1
Waiting for reply to ALPC Message fffff8801ffd9cf0 : queued at port fffffa804e9d9370 : owned by process fffffa806e134c10
Not impersonating
DeviceMap fffff880000073a0
Owning Process fffffa806e132c10 Image: winlogon.exe
Attached Process N/A Image: N/A
Wait Start TickCount 4192487 Ticks: 1882 (0:00:00:29.359)
Context Switch Count 2
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x0000000076f67cb0
Stack Init fffffa60110a9db0 Current fffffa60110a9670
Base fffffa60110aa000 Limit fffffa60110a4000 Call 0
Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`110a96b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`110a97f0 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`110a9860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb
fffffa60`110a98f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92
fffffa60`110a9980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44
fffffa60`110a99e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f
fffffa60`110a9b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f
fffffa60`110a9bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13
00000000`01f8f098 00000000`00000000 0x76f9756a
1: kd> !alpc /m fffff8801ffd9cf0
Message @ fffff8801ffd9cf0
MessageID : 0x0464 (1124)
CallbackID : 0x2844F0 (2639088)
SequenceNumber : 0x0000000B (11)
Type : LPC_REQUEST
DataLength : 0x0064 (100)
TotalLength : 0x008C (140)
Canceled : No
Release : No
ReplyWaitReply : No
Continuation : Yes
OwnerPort : fffffa80481ba9b0 [ALPC_CLIENT_COMMUNICATION_PORT]
WaitingThread : fffffa8073d3d770
QueueType : ALPC_MSGQUEUE_PENDING
QueuePort : fffffa804e9d9370 [ALPC_CONNECTION_PORT]
QueuePortOwnerProcess : fffffa806e134c10 (LogonUI.exe)
ServerThread : fffffa804f96d5a0 <----------- LPC Message Server Thread
QuotaCharged : No
CancelQueuePort : 0000000000000000
CancelSequencePort : 0000000000000000
CancelSequenceNumber : 0x00000000 (0)
ClientContext : 0000000000243d00
ServerContext : 0000000000000000
PortContext : 0000000001c27600
CancelPortContext : 0000000000000000
SecurityData : 0000000000000000
View : 0000000000000000
1: kd> !thread fffffa804f96d5a0
THREAD fffffa804f96d5a0 Cid 01ac.09fc Teb: 000007fffffd7000 Win32Thread: fffff900c067dd50 WAIT: (UserRequest) UserMode Non-Alertable
fffffa80516be920 NotificationEvent
Not impersonating
DeviceMap fffff880000073a0
Owning Process fffffa806e134c10 Image: LogonUI.exe
Attached Process N/A Image: N/A
Wait Start TickCount 4192487 Ticks: 1882 (0:00:00:29.359)
Context Switch Count 13 LargeStack
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x000007fefed47780
Stack Init fffffa60114e4db0 Current fffffa60114e4940
Base fffffa60114e5000 Limit fffffa60114df000 Call 0
Priority 14 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`114e4980 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`114e4ac0 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`114e4b30 fffff800`02119608 nt!KeWaitForSingleObject+0x2cb
fffffa60`114e4bc0 fffff800`01eb8ef3 nt!NtWaitForSingleObject+0x98
fffffa60`114e4c20 00000000`76f96d5a nt!KiSystemServiceCopyEnd+0x13
00000000`0444ef18 00000000`00000000 0x76f96d5a
This server thread is processing a message, but we can’t tell what it is waiting for. If this thread was finished processing a message we would have seen “WAIT: (WrLpcReceive)” for the thread wait reason. Often this type of wait is for a critical section or other synchronization object that is waiting for another thread in the process. This begged an investigation of Logonui.exe, to see what the other threads were doing.
I noticed that another Logonui.exe thread is waiting on LPC Message processing by a thread in LSM.EXE.
1: kd> !thread fffffa80`4e9e5060
THREAD fffffa804e9e5060 Cid 01ac.0438 Teb: 000007fffffd9000 Win32Thread: fffff900c06db2e0 WAIT: (WrLpcReply) UserMode Non-Alertable
fffffa804e9e53f0 Semaphore Limit 0x1
Waiting for reply to ALPC Message fffff8801ffaacf0 : queued at port fffffa804e9b6060 : owned by process fffffa806e130c10
Not impersonating
DeviceMap fffff880000073a0
Owning Process fffffa806e134c10 Image: LogonUI.exe
Attached Process N/A Image: N/A
Wait Start TickCount 4192488 Ticks: 1881 (0:00:00:29.343)
Context Switch Count 165 LargeStack
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x000007fefbaa151c
Stack Init fffffa6008d26db0 Current fffffa6008d26670
Base fffffa6008d27000 Limit fffffa6008d20000 Call 0
Priority 15 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`08d266b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`08d267f0 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`08d26860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb
fffffa60`08d268f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92
fffffa60`08d26980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44
fffffa60`08d269e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f
fffffa60`08d26b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f
fffffa60`08d26bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13
00000000`0272ea78 00000000`00000000 0x76f9756a
At this point I decided to follow the LPC message chain trail to see where it ends and who is ultimately blocking it.
1: kd> !alpc /m fffff8801ffaacf0
Message @ fffff8801ffaacf0
MessageID : 0x0474 (1140)
CallbackID : 0x2844F6 (2639094)
SequenceNumber : 0x0000000D (13)
Type : LPC_REQUEST
DataLength : 0x0058 (88)
TotalLength : 0x0080 (128)
Canceled : No
Release : No
ReplyWaitReply : No
Continuation : Yes
OwnerPort : fffffa804ea03c80 [ALPC_CLIENT_COMMUNICATION_PORT]
WaitingThread : fffffa804e9e5060
QueueType : ALPC_MSGQUEUE_PENDING
QueuePort : fffffa804e9b6060 [ALPC_CONNECTION_PORT]
QueuePortOwnerProcess : fffffa806e130c10 (lsm.exe)
ServerThread : fffffa805180c700 <---- LPC Server thread for this message in LSM.EXE
QuotaCharged : No
CancelQueuePort : 0000000000000000
CancelSequencePort : 0000000000000000
CancelSequenceNumber : 0x00000000 (0)
ClientContext : 0000000001cf0e90
ServerContext : 0000000000000000
PortContext : 0000000001dc9df0
CancelPortContext : 0000000000000000
SecurityData : 0000000000000000
View : 0000000000000000
I noticed that the server thread which is supposed to process the LPC message for above thread is itself waiting on a message reply from a thread in Svchost.exe. LSM had bunch of other threads which were waiting on LPC message processing from server threads in Svchost.exe. I dumped out all the threads and their LPC message structure to locate the server threads in Svchost.exe which seems to be doing some activity and found that only for the thread below the server thread had some interesting information as other LPC Server threads in svchost.exe were Notification Timers.
1: kd> !thread fffffa80`50910060
THREAD fffffa8050910060 Cid 02e0.1a70 Teb: 000007fffffda000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable
fffffa80509103f0 Semaphore Limit 0x1
Waiting for reply to ALPC Message fffff8801fe00cf0 : queued at port fffffa804e876b60 : owned by process fffffa804ea4fc10
Not impersonating
DeviceMap fffff880000073a0
Owning Process fffffa806e130c10 Image: lsm.exe
Attached Process N/A Image: N/A
Wait Start TickCount 3187894 Ticks: 1006475 (0:04:21:41.110)
Context Switch Count 97
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x000007fefed47780
Stack Init fffffa60125b1db0 Current fffffa60125b1670
Base fffffa60125b2000 Limit fffffa60125ac000 Call 0
Priority 9 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Kernel stack not resident.
Child-SP RetAddr Call Site
fffffa60`125b16b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`125b17f0 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`125b1860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb
fffffa60`125b18f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92
fffffa60`125b1980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44
fffffa60`125b19e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f
fffffa60`125b1b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f
fffffa60`125b1bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13
00000000`0205e538 00000000`00000000 0x76f9756a
I looked at the server thread for this LPC message and that was also waiting on a notification event doing nothing. It again begged an investigation of the state of Svchost.exe to find out why the server thread hasn’t replied yet?
1: kd> !alpc /m fffff8801fe00cf0
Message @ fffff8801fe00cf0
MessageID : 0x03C4 (964)
CallbackID : 0x2140C2 (2179266)
SequenceNumber : 0x00000021 (33)
Type : LPC_REQUEST
DataLength : 0x00C0 (192)
TotalLength : 0x00E8 (232)
Canceled : No
Release : No
ReplyWaitReply : No
Continuation : Yes
OwnerPort : fffffa804ef15d30 [ALPC_CLIENT_COMMUNICATION_PORT]
WaitingThread : fffffa8050910060
QueueType : ALPC_MSGQUEUE_PENDING
QueuePort : fffffa804e876b60 [ALPC_CONNECTION_PORT]
QueuePortOwnerProcess : fffffa804ea4fc10 (svchost.exe)
ServerThread : fffffa8051435940
QuotaCharged : No
CancelQueuePort : 0000000000000000
CancelSequencePort : 0000000000000000
CancelSequenceNumber : 0x00000000 (0)
ClientContext : 00000000021a0f60
ServerContext : 0000000000000000
PortContext : 00000000033834d0
CancelPortContext : 0000000000000000
SecurityData : 0000000000000000
View : 0000000000000000
1: kd> !thread fffffa8051435940
THREAD fffffa8051435940 Cid 0574.1ea0 Teb: 000007fffffd3000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable
fffffa601086d3e8 SynchronizationEvent
IRP List:
fffffa80514558b0: (0006,03a0) Flags: 00000884 Mdl: 00000000
Impersonation token: fffff8801fc4b050 (Level Impersonation)
DeviceMap fffff8800e7c50e0
Owning Process fffffa804ea4fc10 Image: svchost.exe
Attached Process N/A Image: N/A
Wait Start TickCount 3187932 Ticks: 1006437 (0:04:21:40.517)
Context Switch Count 111
UserTime 00:00:00.015
KernelTime 00:00:00.000
Win32 Start Address 0x000007fefed47780
Stack Init fffffa601086ddb0 Current fffffa601086d040
Base fffffa601086e000 Limit fffffa6010868000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`1086d080 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`1086d1c0 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`1086d230 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb
fffffa60`1086d2c0 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62
fffffa60`1086d320 fffffa60`015da75a nt!FsRtlCancellableWaitForSingleObject+0x27
fffffa60`1086d360 fffffa60`015d8588 mup!MupLocateUncProviderForPath+0x40a
fffffa60`1086d4a0 fffffa60`015d8130 mup!MupiCallUncProvider+0xe8
fffffa60`1086d510 fffffa60`015d90ef mup!MupStateMachine+0x120
fffffa60`1086d560 fffffa60`00f31e91 mup!MupCreate+0x2c3
fffffa60`1086d5e0 fffffa60`00f4e26c fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x211
fffffa60`1086d650 fffff800`021267d9 fltmgr!FltpCreate+0x25d
fffffa60`1086d700 fffff800`0211ecb2 nt!IopParseDevice+0x5f9
fffffa60`1086d8a0 fffff800`0211f842 nt!ObpLookupObjectName+0x593
fffffa60`1086d9b0 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2
fffffa60`1086da80 fffff800`0212ecc8 nt!IopCreateFile+0x287
fffffa60`1086db20 fffff800`01eb8ef3 nt!NtCreateFile+0x78
fffffa60`1086dbb0 00000000`76f9726a nt!KiSystemServiceCopyEnd+0x13
00000000`2674d738 00000000`00000000 0x76f9726a
Bingo! I found two threads in Svchost.exe which showed some activity and were blocked for over 4 hours on a UNC Path access. From the IRP which was issued by the thread above I found out that it was attempting to contact one of the DCs.
1: kd> !irp 0xfffffa80`514558b0
Irp is active with 4 stacks 3 is current (= 0xfffffa8051455a10)
No Mdl: No System Buffer: Thread fffffa8051435940: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ 0, 0] 0 e0 fffffa8031d1a840 fffffa805145aec0 fffffa6000f332b0-fffffa8047b6c930 Success Error Cancel
\FileSystem\Mup fltmgr!FltpSynchronizedOperationCompletion
Args: fffffa601086d848 01000040 00030000 00000000
[ 0, 0] 0 0 fffffa8031d1b880 fffffa805145aec0 00000000-00000000
\FileSystem\FltMgr
Args: fffffa601086d848 01000040 00030000 00000000
1: kd> dt nt!_FILE_OBJECT fffffa805145aec0 FileName
+0x058 FileName : _UNICODE_STRING "\DC005.contoso.com\PIPE\lsarpc"
For the details on MUP you can refer to Support for UNC Naming and MUP.
At first my thought was that something is wrong with the network stack because we are waiting on domain controller to respond. The thought was that either we had a networking issue or the Domain Controller wasn’t responding, but redirector would have timed out as it won’t wait for over 4 hours for an unresponsive connection. I started digging into the source code and found that we build a query context in MupLocateUncProviderForPath () function and issue another IRP which is sent down the I/O Stack. I searched for other IRPs which was issued by the above thread and found two IRPs.
1: kd> !irpfind 0 0 thread fffffa8051435940
Looking for IRPs with thread == fffffa8051435940
Scanning large pool allocation table for Tag: Irp? (fffffa804ea83000 : fffffa804ed83000)
Searching NonPaged pool (fffffa8030605000 : fffffa8c27e00000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffa80514558b0 [fffffa8051435940] irpStack: ( 0, 0) fffffa8031d1a840 [ \FileSystem\Mup]
fffffa8073bc65e0 [fffffa8051435940] irpStack: ( e, 0) fffffa804e842610 [ \FileSystem\mrxsmb]
1: kd> !irp fffffa8073bc65e0
Irp is active with 2 stacks 1 is current (= 0xfffffa8073bc66b0)
No Mdl: No System Buffer: Thread fffffa8051435940: Irp stack trace.
cmd flg cl Device File Completion-Context
>[ e, 0] 0 e1 fffffa804e842610 fffffa804e843f20 fffffa60015d4024-fffffa601086d3b0 Success Error Cancel pending
\FileSystem\mrxsmb mup!MupiQueryPathCompletionRoutine
Args: 00000004 00000050 00140193 fffffa601086d408
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
This is the IRP that MUP has issued an IOCTL IOCTL_REDIR_QUERY_PATH to the redirector. If the status returned by the driver is anything other than STATUS_PENDING then it will wait by calling FsRtlCancellableWaitForSingleObject.
The 0xfffffa8051435940 thread doesn’t have any redirector function in its stack, which made me suspicious that there could be some other thread which is already working on similar operation and redirector is currently busy processing that request. I started looking for a worker thread which would be calling the redirector.
1: kd> !stacks 0 mrxsmb
Proc.Thread .Thread Ticks ThreadState Blocker
[fffffa804e9c7c10 svchost.exe]
248.000e84 fffffa804ee72bb0 ffe0d0ad Blocked nt!FsRtlCancellableWaitForMultipleObjects+0x62
[fffffa804e9e12e0 svchost.exe]
404.000f18 fffffa804edad4f0 ffe03e7e Blocked nt!FsRtlCancellableWaitForMultipleObjects+0x62
The above command will parse through all the kernel stacks for all the process and dump out one line summary info for the threads where it finds reference to that module in its stack. I have snipped the output to keep things in context.
One of the sample threads where redirector is in action is shown below.
1: kd> !thread fffffa804ee72bb0
THREAD fffffa804ee72bb0 Cid 0248.0e84 Teb: 000007fffffa8000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable
fffffa8048f6c6b8 SynchronizationEvent
IRP List:
fffffa8073ac3980: (0006,03a0) Flags: 00000884 Mdl: 00000000
fffffa807257fc60: (0006,03a0) Flags: 00000884 Mdl: 00000000
fffffa80511627b0: (0006,03a0) Flags: 00000884 Mdl: 00000000
Impersonation token: fffff8800d515c40 (Level Impersonation)
Owning Process fffffa804e9c7c10 Image: svchost.exe
Attached Process N/A Image: N/A
Wait Start TickCount 2043731 Ticks: 2150638 (0:09:19:10.167)
Context Switch Count 893
UserTime 00:00:00.031
KernelTime 00:00:00.109
Win32 Start Address 0x000007fefb96c534
Stack Init fffffa60125a1db0 Current fffffa60125a0ab0
Base fffffa60125a2000 Limit fffffa601259c000 Call 0
Priority 10 BasePriority 1 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`125a0af0 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`125a0c30 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`125a0ca0 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb
fffffa60`125a0d30 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62
fffffa60`125a0d90 fffffa60`0f625c14 nt!FsRtlCancellableWaitForSingleObject+0x27
fffffa60`125a0dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c
fffffa60`125a0e10 fffffa60`0f63599f rdbss!RxFindOrCreateConnections+0x572
fffffa60`125a0ef0 fffffa60`0f627318 rdbss!RxConstructVirtualNetRoot+0xb7
fffffa60`125a0f90 fffffa60`0f63815a rdbss!RxFindOrConstructVirtualNetRoot+0x594
fffffa60`125a1100 fffffa60`0f627eee rdbss!RxCreateTreeConnect+0x13e
fffffa60`125a1190 fffffa60`0f6041f6 rdbss!RxCommonCreate+0x20a
fffffa60`125a1250 fffffa60`0f62568a rdbss!RxFsdCommonDispatch+0x786
fffffa60`125a1340 fffffa60`1031f2cf rdbss!RxFsdDispatch+0x21a
fffffa60`125a13b0 fffffa60`015d85f9 mrxsmb!MRxSmbFsdDispatch+0xbf
fffffa60`125a13f0 fffffa60`015d8130 mup!MupiCallUncProvider+0x159
fffffa60`125a1460 fffffa60`015d90ef mup!MupStateMachine+0x120
fffffa60`125a14b0 fffffa60`00f4e342 mup!MupCreate+0x2c3
fffffa60`125a1530 fffff800`021267d9 fltmgr!FltpCreate+0x333
fffffa60`125a15e0 fffff800`0211ecb2 nt!IopParseDevice+0x5f9
fffffa60`125a1780 fffff800`0211f842 nt!ObpLookupObjectName+0x593
fffffa60`125a1890 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2
fffffa60`125a1960 fffff800`0212ecc8 nt!IopCreateFile+0x287
fffffa60`125a1a00 fffff800`01eb8ef3 nt!NtCreateFile+0x78
fffffa60`125a1a90 fffff800`01eb9400 nt!KiSystemServiceCopyEnd+0x13
fffffa60`125a1c98 fffffa60`0f664a5e nt!KiServiceLinkage
<snip>
The above thread is waiting for the condition of the remote connection to become stable and that’s the reason it called the function RxWaitForStableCondition(), which accepts RX_CONTEXT as its first argument and the reason for why we entered into the wait as second argument. As we can see that the thread above was waiting for over 9 hours and 19 minutes.
For additional reading you can go through RX_CONTEXT and IRP Management.
The unassembled code for the function below is to find out where we store the pointer to the RX_CONTEXT structure. You may remember about the challenges of x64 debugging from this post.
1: kd> .frame /r 05
05 fffffa60`11e42dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c
rax=0000000000000000 rbx=fffffa804f4de010 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000
rip=fffffa600f625c14 rsp=fffffa6011e42dd0 rbp=fffffa805113a490
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=fffffa805113a4a0 r13=fffffa6011e42fd8
r14=fffffa804f4de010 r15=0000000000000000
iopl=0 nv up di pl nz na pe nc
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000
rdbss!RxWaitForStableCondition+0x11c:
fffffa60`0f625c14 85c0 test eax,eax
1: kd> u rdbss!RxWaitForStableCondition
rdbss!RxWaitForStableCondition:
fffffa60`0f625af8 488bc4 mov rax,rsp
fffffa60`0f625afb 48895808 mov qword ptr [rax+8],rbx
fffffa60`0f625aff 48896810 mov qword ptr [rax+10h],rbp
fffffa60`0f625b03 48897018 mov qword ptr [rax+18h],rsi
fffffa60`0f625b07 48897820 mov qword ptr [rax+20h],rdi
fffffa60`0f625b0b 4154 push r12
fffffa60`0f625b0d 4883ec30 sub rsp,30h
fffffa60`0f625b11 33ff xor edi,edi
1: kd> u
rdbss!RxWaitForStableCondition+0x1b:
fffffa60`0f625b13 833a03 cmp dword ptr [rdx],3
fffffa60`0f625b16 418af1 mov sil,r9b
fffffa60`0f625b19 4d8be0 mov r12,r8
fffffa60`0f625b1c 488bea mov rbp,rdx
fffffa60`0f625b1f 488bd9 mov rbx,rcx <--------- RX_CONTEXT is in RBX register
fffffa60`0f625b22 7c07 jl rdbss!RxWaitForStableCondition+0x33 (fffffa60`0f625b2b)
fffffa60`0f625b24 33c0 xor eax,eax
fffffa60`0f625b26 e968010000 jmp rdbss!RxWaitForStableCondition+0x19b (fffffa60`0f625c93)
1: kd> !pool fffffa804f4de010 2 <---- [Argument 1] Here is pointer to RX_CONTEXT Structure
Pool page fffffa804f4de010 region is Nonpaged pool
*fffffa804f4de000 size: 670 previous size: 0 (Allocated) *RxIr
Pooltag RxIr : RDBSS RxContext, Binary : rdbss.sys
1: kd> !pool fffffa805113a490 2 <---- [Argument 2] Here is the Pointer to the Reason for Wait
Pool page fffffa805113a490 region is Nonpaged pool
*fffffa805113a3b0 size: 280 previous size: 10 (Allocated) *RxSc
Pooltag RxSc : RDBSS SrvCall, Binary : rdbss.sys
1: kd> dc fffffa805113a490 L1
fffffa80`5113a490 00000001 <---------------- Transport is in Transition state
We have bunch of RX_CONTEXT describing the IRPs pending on redirector for the transport of the remote connection to stabilize.
Next obvious thought that came to my mind was to investigate the list where we store the IRPs which have been issued by the redirector for the transport driver to process and understand the state of those IRPs and understand why we are not progressing further? RxCeSend is the API which submits the IRPs to the Transport driver which is maintained in the list below. For more information about Redirector related APIs you can refer Connection Engine Management.
1: kd> x rdbss!RxIrpsList
fffffa60`0f620110 rdbss!RxIrpsList = <no type information>
The rdbss!RxIrpsList is a pointer to the structure of type LIST_ENTRY.
1: kd> dq fffffa60`0f620110 L2
fffffa60`0f620110 fffffa80`510fd0d0 fffffa80`4edb5590
1: kd> dq fffffa80`510fd0d0 L2 <---------------- LIST_ENTRY.FLINK
fffffa80`510fd0d0 fffffa80`50d2e300 fffffa60`0f620110
1: kd> dq fffffa80`4edb5590 L2 <---------------- LIST_ENTRY.BLINK
fffffa80`4edb5590 fffffa60`0f620110 fffffa80`4ff450d0
This being a x64 windows each entry in this list is 0x30 bytes, this can change with OS version or driver updates, so don’t rely on this list or any of the offset that I referenced here in your driver code. Entry in the first column of this list is a pointer to the list entry. At offset 10 we store the pointer to the IRP. Add 0x10 bytes to get the IRP.
1: kd> dl 0xfffffa80`510fd0d0
fffffa80`510fd0d0 fffffa80`50d2e300 fffffa60`0f620110
fffffa80`510fd0e0 fffffa80`7274e5a0 00000000`00000000
fffffa80`50d2e300 fffffa80`50ee4490 fffffa80`510fd0d0
fffffa80`50d2e310 fffffa80`50f16160 00000000`00000000
fffffa80`50ee4490 fffffa80`727f1140 fffffa80`50d2e300
fffffa80`50ee44a0 fffffa80`51491760 00000000`00000000
fffffa80`727f1140 fffffa80`73a95860 fffffa80`50ee4490
fffffa80`727f1150 fffffa80`511b9c60 00000000`00000000
fffffa80`73a95860 fffffa80`73a8e3e0 fffffa80`727f1140
fffffa80`73a95870 fffffa80`73a77c60 00000000`00000000
fffffa80`73a8e3e0 fffffa80`73a9c290 fffffa80`73a95860
fffffa80`73a8e3f0 fffffa80`73a46a30 00000000`00000000
fffffa80`73a9c290 fffffa80`725ca490 fffffa80`73a8e3e0
<Snip>
1: kd> !irp poi(fffffa80`510fd0d0+10)
Irp is active with 2 stacks 2 is current (= 0xfffffa807274e6b8)
No Mdl: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ f, 3] 0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa601030df70-fffffa805119fda0 Success Error Cancel pending
\Driver\Smb mrxsmb!RxTdiAsynchronousConnectCompletion
Args: 00000000 fffffa8050d127e0 00000000 fffffa601031b6f0
1: kd> !irp poi(fffffa80`50d2e300+10)
Irp is active with 2 stacks 2 is current (= 0xfffffa8050f16278)
No Mdl: System buffer=fffffa8050d59c88: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ e, 0] 0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa8050d59c60 Success Error Cancel pending
\Driver\Smb mrxsmb!RxTdiAsynchronousRequestCompletion
Args: 00000000 00000030 0021c0e8 00000000
1: kd> !irp poi(fffffa80`50ee4490+10)
Irp is active with 2 stacks 2 is current (= 0xfffffa8051491878)
No Mdl: System buffer=fffffa805115dcb8: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ e, 0] 0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa805115dc90 Success Error Cancel pending
\Driver\Smb mrxsmb!RxTdiAsynchronousRequestCompletion
Args: 00000000 00000030 0021c0e8 00000000
All these IRPs are waiting on SMB Transport Driver (responsible for handling TCP/IP port 445 related traffic) for further processing. The obvious feeling was to find out if there are threads which are making call to SMB function to see what they are they doing?
Note: In Windows 2008 R2 and Windows 7 onwards redirector doesn’t use SMB as the transport driver directly.
I searched through the dump but I couldn’t find a single thread which was calling any of the SMB function. This is the point where it gets more interesting. Why we don’t have any thread which is calling SMB function when we have bunch of IRPs pending on that driver?
This was the time for me to collaborate with the Network Stack Guru Aman Jain to get some insight to figure out the answer to my question.
It turned out that SMB driver will create a work item and it will use the System Delayed Worker Queue to queue the work item that it created and will populate the routine in the work Item structure which is supposed to process the work item once the worker thread picks up the work item.
1: kd> x nt!ExWorkerQueue
fffff800`01ff08a0 nt!ExWorkerQueue = <no type information>
Above pointer is the list head for the 3 queues that system has namely the “Critical WorkQueue” , “Delayed WorkQueue” and the “HyperCritical WorkQueue” as see in the output of !exqueue command in the debugger. Each entry in this list is 0x58 bytes long and is of type _KQUEUE.
Structure referenced below decides how a various work queues are enumerated. This structure is documented in this MSDN article.
typedef enum _WORK_QUEUE_TYPE {
CriticalWorkQueue = 0,
DelayedWorkQueue = 1,
HyperCriticalWorkQueue = 2,
MaximumWorkQueue = 3
} WORK_QUEUE_TYPE;
Critical WorkQueue
1: kd> dt nt!_KQUEUE fffff800`01ff08a0
+0x000 Header : _DISPATCHER_HEADER
+0x018 EntryListHead : _LIST_ENTRY [ 0xfffff800`01ff08b8 - 0xfffff800`01ff08b8 ]
+0x028 CurrentCount : 0
+0x02c MaximumCount : 0x10
+0x030 ThreadListHead : _LIST_ENTRY [ 0xfffffa80`30a4da08 - 0xfffffa80`30a39d68 ]
Delayed WorkQueue
1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58
+0x000 Header : _DISPATCHER_HEADER
+0x018 EntryListHead : _LIST_ENTRY [ 0xfffffa80`50226f00 - 0xfffffa80`5183f9f0 ]
+0x028 CurrentCount : 0
+0x02c MaximumCount : 0x10
+0x030 ThreadListHead : _LIST_ENTRY [ 0xfffffa80`30a398d8 - 0xfffffa80`51157d68 ]
HyperCritical WorkQueue
1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58+58
+0x000 Header : _DISPATCHER_HEADER
+0x018 EntryListHead : _LIST_ENTRY [ 0xfffff800`01ff0968 - 0xfffff800`01ff0968 ]
+0x028 CurrentCount : 0
+0x02c MaximumCount : 0x10
+0x030 ThreadListHead : _LIST_ENTRY [ 0xfffffa80`30a4e1f8 - 0xfffffa80`30a4e1f8 ]
1: kd> dd fffff80001ff08a0+58+40 L1
fffff800`01ff0938 00000010 <----------- We created 0x10 (16) dynamic threads
We create additional dynamic threads as in the event when the default number of threads in a particular work queue is unable to handle the Work Items being queued. In this case we created 16 additional threads, which is the maximum number of dynamic thread which can be created.
1: kd> dd fffff80001ff08a0+58+4c L1
fffff800`01ff0944 00004667
1: kd> ? 00004667
Evaluate expression: 18023 = 00000000`00004667 <-------- 18,023 work Items pending for processing
I could see that there are work items pending in the Delayed WorkQueue because of the following relation which is not present in any other WorkQueue that we dumped.
nt!_KQUEUE->EntryListHead.Flink != nt!_QUEUE->EntryListHead.Blink
For details about Work Item Processing and Work Items you can refer following docs:
The next obvious steps involved dumping all work items pending in the list to see if I have work items related to SMB driver? And if the Work Items are present for SMB then why they haven’t been picked up by the worker thread?
1: kd> !list "-t nt!_LIST_ENTRY.Flink -e -x\"dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1\" fffffa80`50226f00"
<Snip>
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`4b7c7e30 fffffa80`50a720e0
fffffa80`4b7c7e50 fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`72544e00 fffffa80`31aeb580
fffffa80`72544e20 fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`72759d50 fffffa80`51184690
fffffa80`72759d70 fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`73b416e0 fffffa80`5143c9c0
fffffa80`73b41700 fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession
<snip>
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`518392e0 fffffa80`31fafa60
fffffa80`51839300 fffffa60`00becb58 mpio!MPIOPathVerify
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`31fafa60 fffffa80`31fafab0
fffffa80`31fafa80 fffffa60`00becb58 mpio!MPIOPathVerify
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`31fafab0 fffffa80`31fafc30
fffffa80`31fafad0 fffffa60`00becb58 mpio!MPIOPathVerify
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`31fafc30 fffffa80`5183f9a0
fffffa80`31fafc50 fffffa60`00becb58 mpio!MPIOPathVerify
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`5183f9a0 fffffa80`4ea092e0
fffffa80`5183f9c0 fffffa60`00becb58 mpio!MPIOPathVerify
dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1
fffffa80`4ea092e0 fffffa80`5183f9f0
fffffa80`4ea09300 fffffa60`00becb58 mpio!MPIOPathVerify
Voila! Here are my SMB work items in the Delayed WorkQueue waiting to be processed. Another interesting observation was that this queue had around 18,023 work items in this queue waiting to be picked up by the worker threads. Most of the work items were related to MPIO. Investigation of the worker threads in the Delayed WorkQueue revealed that the worker threads were blocked processing MPIO work item on the DSM driver for the storage.
I picked one of the worker entry from the _KQUEUE->ThreadListHead for the Delayed WorkQueue and validated that it is a thread pool allocation with !pool. As the size of _ETHREAD structure is 0x450 bytes and the pool allocation was for 0x490 bytes, this indicates that the pointer to the actual thread object would be 0x40 bytes into that pool allocation from the beginning of that pool block.
1: kd> !pool 0xfffffa80`30a398d8 2
Pool page fffffa8030a398d8 region is Nonpaged pool
*fffffa8030a396e0 size: 490 previous size: 90 (Allocated) *Thre (Protected)
Pooltag Thre : Thread objects, Binary : nt!ps
1: kd> ?? sizeof(nt!_ETHREAD)
unsigned int64 0x450
1: kd> !thread fffffa8030a396e0+40
THREAD fffffa8030a39720 Cid 0004.0044 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa8031e8fe58 SynchronizationEvent
Not impersonating
DeviceMap fffff880000073a0
Owning Process fffffa80309ca110 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 2015907 Ticks: 2178462 (0:09:26:24.225)
Context Switch Count 42461
UserTime 00:00:00.000
KernelTime 00:00:14.976
Win32 Start Address nt!ExpWorkerThread (0xfffff80001ec07c8)
Stack Init fffffa600232bdb0 Current fffffa600232b800
Base fffffa600232c000 Limit fffffa6002326000 Call 0
Priority 15 BasePriority 12 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffffa60`0232b840 fffff800`01ebb6fa nt!KiSwapContext+0x7f
fffffa60`0232b980 fffff800`01eb035b nt!KiSwapThread+0x13a
fffffa60`0232b9f0 fffffa60`00c9c698 nt!KeWaitForSingleObject+0x2cb
fffffa60`0232ba80 fffffa60`00c97a06 MyBadDSM+0xb698
fffffa60`0232bb10 fffffa60`00becca3 MyBadDSM+0x6a06
fffffa60`0232bb60 fffff800`02119693 mpio!MPIOPathVerify+0x14b
fffffa60`0232bcc0 fffff800`01ec08c3 nt!IopProcessWorkItem+0x27
fffffa60`0232bcf0 fffff800`020c3ee7 nt!ExpWorkerThread+0xfb
fffffa60`0232bd50 fffff800`01ef6616 nt!PspSystemThreadStartup+0x57
fffffa60`0232bd80 00000000`00000000 nt!KxStartSystemThread+0x16
At this point it was evident that something has gone wrong with either the storage or the HBA adapter, Inspection of the MyBadDSM.sys code at this point could reveal if it was the problem with the driver or something has gone wrong with the HBA or the Fiber connection to the storage.
This brings me to the end of this post. I hope you enjoyed reading this post. Goodbye. Until next time!