Hello all, East here. I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.
First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:
1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.
2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.
When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.
NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx
When a deadlock is detected the cluster log will show:
"00000dd4.00000aac::2011/05/13-16:12:39.559 ERR [RHS] WER report is submitted. Result : WerReportQueued."
You should find the WER report file and dumps located at the default location:
<SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue
Here is what a cluster created Windows Error Report looks like:
Version=1
EventType=WSFC Resource Deadlock
EventTime=129497767591366327
ReportType=1
Consent=1
ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c
Response.type=4
Sig[0].Name=Resource Name
Sig[0].Value=New flexdisk
Sig[1].Name=Resource Type
Sig[1].Value=flexdisk
Sig[2].Name=Call Type
Sig[2].Value=ISALIVE
DynamicSig[1].Name=OS Version
DynamicSig[1].Value=6.1.7601.2.1.0.274.10
DynamicSig[2].Name=Locale ID
DynamicSig[2].Value=1033
File[0].CabName=memory.hdmp
File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file
File[0].Flags=2097152
File[0].Type=3
File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp
File[1].CabName=minidump.mdmp
File[1].Path=WERC2CE.tmp.mdmp
File[1].Flags=2162690
File[1].Type=2
File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp
FriendlyEventName=WSFC Resource Deadlock
ConsentKey=WSFC Resource Host Monitor
AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS
AppPath=C:\Windows\Cluster\rhs.exe
ReportDescription=WSFC Resource Deadlock
Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:
The Resource Name = New flexdisk
The Resource Type = flexdisk
The type of call the resource was engaged in = ISALIVE
You have all this information in the cluster log as well:
00000dd4.00000aac::2011/05/13-16:12:39.136 ERR [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'. RHS deadlock during an Isalive check for the New flexdisk resource.
00000dd4.00000aac::2011/05/13-16:12:39.136 INFO [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.
00000dd4.00000aac::2011/05/13-16:12:39.136 ERR [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.
00000dd4.00000aac::2011/05/13-16:12:39.136 ERR [RHS] About to send WER report.
000004dc.00000fb4::2011/05/13-16:12:39.136 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.
000004dc.00000fb4::2011/05/13-16:12:39.136 INFO [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.
00000dd4.00000aac::2011/05/13-16:12:39.559 ERR [RHS] WER report is submitted. Result : WerReportQueued. WER report and user-mode dump created
000004dc.00000fb4::2011/05/13-16:12:39.559 ERR [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4
000004dc.00000fb4::2011/05/13-16:12:39.575 INFO [RCM] Created monitor process 3280 / 0xcd0
00000cd0.00000200::2011/05/13-16:12:39.575 INFO [RHS] Initializing.
000004dc.00000fb4::2011/05/13-16:12:39.591 INFO [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)
000004dc.00000fb4::2011/05/13-16:12:39.591 WARN [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.
Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”. Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.
Run the following command after opening the dump under your debugger “~*k”. In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above. If you don’t have a cluster log, this is the thread with wer.dll on the stack.
0:000> ~*k
. 0 Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen
Child-SP RetAddr Call Site
00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa
00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8
00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0
00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f
00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74
00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1
00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d
00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd
00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
1 Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen
Child-SP RetAddr Call Site
00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa
00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8
00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0
00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f
00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184
00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18
00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd
00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
2 Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen
Child-SP RetAddr Call Site
00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa
00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d
00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd
00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
3 Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen
Child-SP RetAddr Call Site
00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa
00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8
00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3
00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a
00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46
00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20
00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54
00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4
00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb
00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386
00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a
00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f
00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187
00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324
00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd
00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
4 Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen
Child-SP RetAddr Call Site
00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa
00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab
00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9
00000000`00e2f1a8 00000000`00000000 0x326aa0
<snipped to save space>
We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers. Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax: “ ~~[aac]s ”
0:003> kn
# Child-SP RetAddr Call Site
00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa
01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8
02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3
03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a
04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46
05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20
06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54
07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4
08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb
09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386
0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a
0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f
0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187
0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324
0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd
0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers ( /r ) for that frame context:
0:003> .frame /r d
0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324
rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0
rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0
rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000
r8=00000000ffffffff r9=00000000000004ff r10=0000000000000000
r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=000000004dcd5877
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
rhs!RhsCall::DeadlockMonitor+0x324:
00000000`ff74b270 448b442430 mov r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff
Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.
Dump that out to find the thread containing the stack with:
0:003> dqs 000000000035f8b0
00000000`0035f8b0 00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType
00000000`0035f8b8 00000000`00000004
00000000`0035f8c0 00000000`001e7da0
00000000`0035f8c8 00000000`00000e48 <-- Thread id of concerning resource
00000000`0035f8d0 00000000`00346920
00000000`0035f8d8 0000000c`00000001
00000000`0035f8e0 00000000`00000003
00000000`0035f8e8 00000000`4dcd5876
00000000`0035f8f0 00000000`048d2dd3
00000000`0035f8f8 00000000`00000100
00000000`0035f900 00000001`00000100
00000000`0035f908 00000000`00000000
00000000`0035f910 00430000`00000031
00000000`0035f918 8a0000a4`52e48988
00000000`0035f920 004c0041`00530049
00000000`0035f928 00450045`00560049
If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:
4 Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen
Child-SP RetAddr Call Site
00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa
00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab
00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9
00000000`00e2f1a8 00000000`00000000 0x326aa0
In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.
Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:
00000dd4.00000e48::2011/05/13-16:07:37.270 INFO [RES] flexdisk <New flexdisk>: Set new entry point = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.
000004dc.00000fd4::2011/05/13-16:07:37.286 WARN [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.
00000dd4.00000e48::2011/05/13-16:07:38.023 INFO [RES] flexdisk <New flexdisk>: Set new entry point = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.
00000dd4.00000e48::2011/05/13-16:07:38.023 INFO [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.
In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues. If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.