This is Ron Stock from the Global Escalation Services team and I recently worked with a customer to determine which misbehaving driver was crashing their critical server. This particular crash was a STOP 0x00000020 which maps to KERNEL_APC_PENDING_DURING_EXIT.
The KERNEL_APC_PENDING_DURING_EXIT bugcheck type indicates the APC disable count for a thread was not equal to zero when the thread exited. The APC disable count is a field in the _KTHREAD structure and it is decremented when drivers disable APCs by calling functions such as KeEnterCriticalRegion, FsRtlEnterFileSystem or by acquiring a mutex. Disabling APC delivery to a thread is the equivalent of hanging the “Do Not Disturb” sign on your door. When drivers need to perform a critical operation they ‘hang the sign on the door’ to prevent interruption from APCs. When the same driver fails to ‘take the sign off the door’ by calling KeLeaveCriticalRegion, FsRtlExitFileSystem or KeReleaseMutex, the APC disable count is never incremented back to its original value. This forgetful behavior causes a bugcheck because the APC disable count is checked when the thread is exiting. The OS expects this value to be zero on thread exit.
In my case the value was 0xffff (negative 1) indicating a driver had forgot to remove the ‘Do Not Disturb’ sign.
kd> !analyze –v
KERNEL_APC_PENDING_DURING_EXIT (20)
The key data item is the thread's APC disable count.
If this is non-zero, then this is the source of the problem.
Arguments:
Arg1: 0000000000000000, The address of the APC found pending during exit.
Arg2: 000000000000ffff, The thread's APC disable count
Arg3: 0000000000000000, The current IRQL
Arg4: 0000000000000001
Because the value is decremented earlier in time the current call stack is not particularly useful. It merely shows the thread exiting under normal conditions.
0: kd> !thread -1 e
THREAD fffffa8049f04b50 Cid 0004.0998 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap fffff8a000007ee0
Owning Process fffffa8048cad9e0 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 11503325 Ticks: 0
Context Switch Count 185715 IdealProcessor: 0
UserTime 00:00:00.000
KernelTime 00:00:06.078
Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88003c4b400)
Stack Init fffff88005078db0 Current fffff880050789b0
Base fffff88005079000 Limit fffff88005073000 Call 0
Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffff880`05078b08 fffff800`01984bd9 nt!KeBugCheckEx
fffff880`05078b10 fffff800`019a1a3d nt!PspExitThread+0xffffffff`fffe3ae9
fffff880`05078c10 fffff800`0195bc8a nt!PspTerminateThreadByPointer+0x4d
fffff880`05078c60 fffff880`03c56769 nt!PsTerminateSystemThread+0x22
fffff880`05078c90 fffff880`03c4b5b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99
fffff880`05078cc0 fffff800`01966e5a srv2!SrvProcWorkerThread+0x1b6
fffff880`05078d40 fffff800`016c0d26 nt!PspSystemThreadStartup+0x5a
fffff880`05078d80 00000000`00000000 nt!KxStartSystemThread+0x16
Driver Verifier is the ideal tool for this type of bugcheck. It has a feature called Critical Region logging which tracks the call stack and KTRHEAD value for each call to either KeEnterCriticalRegion() and KeLeaveCriticalRegion(). I had the customer enable this logging by selecting the “Miscellaneous checks” option in Driver Verifier using these steps-
- Run Verifier.exe
- Select “Create custom settings (For code developers)”
- Select individual settings from a full list
- Select Miscellaneous checks
- Select Driver Names from a list
- Manually choose all of the third-party drivers.
- Reboot after making the changes.
After running through the steps above, we gathered another STOP 0x00000020 dump. I confirmed the “Miscellaneous checks” option was enabled by using the !verifier command
0: kd> !verifier
Verify Level 800 ... enabled options are:
Miscellaneous checks enabled
The stack in this new dump was in the same SrvProcWorkerThread thread exit path so we had a consistent pattern. The thread with the negative APC Disable count was fffffa804b5be040.
0: kd> !thread -1 e
THREAD fffffa804b5be040 Cid 0004.082c Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap fffff8a000007ee0
Owning Process fffffa8048cad9e0 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 4458237 Ticks: 0
Context Switch Count 36067 IdealProcessor: 0
UserTime 00:00:00.000
KernelTime 00:00:01.218
Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88004827400)
Stack Init fffff88005cc6db0 Current fffff88005cc69b0
Base fffff88005cc7000 Limit fffff88005cc1000 Call 0
Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffff880`05cc6b08 fffff800`0198dbd9 nt!KeBugCheckEx
fffff880`05cc6b10 fffff800`019aaa3d nt!PspExitThread+0xffffffff`fffe3ae9
fffff880`05cc6c10 fffff800`01964c8a nt!PspTerminateThreadByPointer+0x4d
fffff880`05cc6c60 fffff880`048326d9 nt!PsTerminateSystemThread+0x22
fffff880`05cc6c90 fffff880`048275b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99
fffff880`05cc6cc0 fffff800`0196fe5a srv2!SrvProcWorkerThread+0x1b6
fffff880`05cc6d40 fffff800`016c9d26 nt!PspSystemThreadStartup+0x5a
fffff880`05cc6d80 00000000`00000000 nt!KxStartSystemThread+0x16
I dumped the Critical Region log by using the !verifier 200 command. The Critical Region log has enough room for 128 stacks. After dumping the log, the first thing to do is to find the KTHREAD value of the thread with the non-zero APC disable count. Unfortunately in my case thread fffffa804b5be040 didn’t appear in the log. In fact all 128 stacks had a driver named Suspect.sys calling KeEnterCriticalRegion or KeLeaveCriticalRegion. Note:To protect our vendor friends, I renamed the actual sys file in this article to suspect.sys.
The customer disabled the suspect.sys driver hoping this was the driver forgetting to re-enable APCs. If nothing else, this would perhaps remove the noisy suspect.sys from the log in the next dump.
0: kd> !verifier 200
Enter/Leave Critical Region log:
There are up to 0x80 entries in the log.
Displaying all the log entries.
======================================================================
Thread fffffa8048ce4b50
fffff80001b74293 nt!VerifierKeLeaveCriticalRegion+0xc3
fffff8800100aafa Suspect.sys+0xaafa
fffff88001001e30 Suspect.sys+0x1e30
fffff80001abc68c nt!IopLoadUnloadDriver+0x1c
fffff800016e1641 nt!ExpWorkerThread+0x111
fffff8000196ee5a nt!PspSystemThreadStartup+0x5a
fffff800016c8d26 nt!KiStartSystemThread+0x16
======================================================================
Thread fffffa8048ce4b50
fffff80001b6b0a2 nt!VerifierKeEnterCriticalRegion+0x92
fffff880010062a3 Suspect.sys+0x62a3
fffff8800100a7e2 Suspect.sys+0xa7e2
fffff88001001e30 Suspect.sys+0x1e30
fffff80001abc68c nt!IopLoadUnloadDriver+0x1c
fffff800016e1641 nt!ExpWorkerThread+0x111
fffff8000196ee5a nt!PspSystemThreadStartup+0x5a
Unfortunately, the system continued to crash and in the next dump the critical region log was completely empty. My guess is the complier was optimizing the KeEnterCriticalRegion and KeLeaveCriticalRegion calls in the driver, causing them to be inlined and skipping the call to VerifierKeLeaveCriticalRegion/VerifierKeEnterCriticalRegion. I needed another attack plan.
There is another Verifier option called I/O Verification and it works in a similar way to the steps below. Please note that this functionality is not documented and may be subject to change at any time.
- A call to IoCallDriver() is made to send an IO packet to a driver associated with a device.
- Verifier hooks the call.
- Verifier creates a structure to record state info.
- Verifier fills in the structure with data including the thread’s APC Disable Count.
- Next Verifier calls the normal IoCallDriver() routine to “continue” the call made in step 1.
- The driver does its work (disables and re-enables APCs as needed)
- The call to IoCallDriver() returns when the driver is finished.
- Verifier checks the real APC count in the thread. Next it compares the value to the recorded value in the structure from step 4. If the two values do not match, then Verifier crashes the machine so we can pull the bad driver out of the dump.
I had the customer enable I/O Verification using these steps-
- Run Verifier.exe
- Select “Create custom settings (For code developers)”
- Select individual settings from a full list
- Select I/O Verification
- Select Driver Names from a list
- Manually choose all of the third-party drivers.
- Reboot after making the changes.
As we expected, the machine crashed again because of the APC Disable issue. Because we enabled I/O Verification, the bugcheck type changed to DRIVER_VERIFIER_DETECTED_VIOLATION and now we have a smoking gun.
I used the !verifier command to review the Verifier settings. The output below shows “I\O subsystem checking enabled” which confirms I/O Verification was been enabled.
0: kd> !verifier
Verify Level 810 ... enabled options are:
Io subsystem checking enabled
The parameters to KeBugcheck reconfirmed the APC disable count was -1 (ffff). And this time we have an additional breadcrumb, the driver dispatch routine address.
DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught. This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arguments:
Arg1: 00000000000000c5, Thread APC disable count changed by driver dispatch routine.
Arg2: fffff88001345610, Driver dispatch routine address.
Arg3: 000000000000ffff, Current thread APC disable count.
Arg4: 0000000000000000, Thread APC disable count before calling driver dispatch routine.
The APC disable count is decremented each time a driver calls
KeEnterCriticalRegion, FsRtlEnterFileSystem, or acquires a mutex. The APC
disable count is incremented each time a driver calls KeLeaveCriticalRegion,
FsRtlExitFileSystem, or KeReleaseMutex. Since these calls should always be in
pairs, this value should be zero when a thread exits. A negative value
indicates that a driver has disabled APC calls without re-enabling them. A
positive value indicates that the reverse is true.
Notice the Verifier functions on the call stack which we I leveraged for the “saved state” information I discussed above in the I/O Verification architecture.
0: kd> kn
# Child-SP RetAddr Call Site
00 fffff880`05db1b08 fffff800`0174a9c0 nt!KeBugCheckEx
01 fffff880`05db1b10 fffff800`01b66b4ent!VfBugCheckNoStackUsage+0x30
02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e
03 fffff880`05db1ba0 fffff880`04054756 nt!IovCallDriver+0x57e
04 fffff880`05db1c00 fffff880`0404b7b0 srv2!Smb2ExecuteRead+0x9a6
05 fffff880`05db1c80 fffff880`0404b6fb srv2!SrvProcessPacket+0xa0
06 fffff880`05db1cc0 fffff800`01960e5a srv2!SrvProcWorkerThread+0x2fb
07 fffff880`05db1d40 fffff800`016bad26nt!PspSystemThreadStartup+0x5a
08 fffff880`05db1d80 00000000`00000000nt!KiStartSystemThread+0x16
Next I dumped the driver dispatch routine noted in the KeBugCheckoutput above using the ln command. This points to fltmgr!FltpDispatch which tells me we have a filter manager minifilter driver making calls to disable APCs but rudely forgetting to re-enable them. As I noted above we save the state info before the call to IoCallDriver().
0: kd> ln fffff88001345610
(fffff880`01345610) fltmgr!FltpDispatch | (fffff880`01345710) fltmgr!FltReleasePushLock
Exact matches:
fltmgr!FltpDispatch (<no parameter info>)
Now the goal was to determine which minifilter is leaving the “Do Not Disturb” sign on the door and forgetting to remove it. We can find this using the fltmgr device object. The “saved state” structure is passed to VfAfterCallDriver as the first parameter so I switched to the VfAfterCallDriverframe (second frame) to dig it out. I used the /r flag to show the original values of the registers for this frame.
0: kd> .frame /r 2
02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e
rax=0000000000000000 rbx=fffffa804b729790 rcx=00000000000000c4
rdx=00000000000000c5 rsi=fffffa804a3b0000 rdi=fffff8000183ce80
rip=fffff80001b66b4e rsp=fffff88005db1b50 rbp=fffffa804de8c290
r8=fffff88001345610 r9=000000000000ffff r10=fffff80001b7a640
r11=0000000000000000 r12=000000004de8c290 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz na pe nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00000282
nt!VfAfterCallDriver+0x33e:
fffff800`01b66b4e cc int 3
Parameter 1 on x64 is always passed via rcx. I dumped the assembly for VfAfterCallDriver and confirmed the value in rcx (the base of the saved state structure) is moved to rbx.
0: kd> u nt!VfAfterCallDriver
nt!VfAfterCallDriver:
fffff800`01b66810 48895c2410 mov qword ptr[rsp+10h],rbx
fffff800`01b66815 48896c2418 mov qword ptr[rsp+18h],rbp
fffff800`01b6681a 4889742420 mov qword ptr[rsp+20h],rsi
fffff800`01b6681f 57 push rdi
fffff800`01b66820 4154 push r12
fffff800`01b66822 4155 push r13
fffff800`01b66824 4883ec30 sub rsp,30h
fffff800`01b66828 488bfa mov rdi,rdx
fffff800`01b6682b 488bd9 mov rbx,rcx
The device object is stored in the save state information at offset 0xa0.
0: kd> ? fffffa804b729790 + 0xa0
Evaluate expression: -6046048151504 = fffffa80`4b729830
0: kd> dq fffffa80`4b729830 l1
fffffa80`4b729830 fffffa80`4a3b0060
0: kd> !devobj fffffa80`4a3b0060
Device object (fffffa804a3b0060) is for:
\FileSystem\FltMgr DriverObject fffffa80491fb7c0
Current Irp 00000000 RefCount 0 Type 00000008 Flags 00040000
DevExt fffffa804a3b01b0 DevObjExt fffffa804a3b0208
ExtensionFlags (0x80000800) DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO
Characteristics (0000000000)
AttachedTo (Lower) fffffa804a3b1030 \FileSystem\Ntfs
Device queue is not busy.
0: kd> !devstack fffffa80`4a3b0060
!DevObj !DrvObj !DevExt ObjectName
> fffffa804a3b0060 \FileSystem\FltMgr fffffa804a3b01b0
fffffa804a3b1030 \FileSystem\Ntfs fffffa804a3b1180
As http://msdn.microsoft.com/en-us/library/ff541610(v=vs.85).aspxexplains – “The filter manager is installed with Windows, but it becomes active only when a minifilter driver is loaded. The filter manager attaches to the file system stack for a target volume. A minifilter driver attaches to the file system stack indirectly, by registering with the filter manager for the I/O operations the minifilter driver chooses to filter.”
Using the power of the fltkd extension, I dumped the volume information associated with this device object. From the output below, we can extract the name of the filter attached to the volume. The culprit is named BadDriver.sys. The customer removed the driver and the problem went away long enough for the vendor to create an update for BadDriver.sys. Happy Ending!
0: kd> !fltkd.volume fffffa80`4a3b0060
FLT_VOLUME: fffffa804a3b0800 "\Device\HarddiskVolume3"
FLT_OBJECT: fffffa804a3b0800 [04000000] Volume
RundownRef : 0x000000000000020a (261)
PointerCount : 0x00000001
PrimaryLink : [fffffa804ae06810-fffffa804a2b16f0]
Frame : fffffa8049fcd420 "Frame 0"
Flags : [00000064] SetupNotifyCalledEnableNameCaching FilterAttached
FileSystemType : [00000002] FLT_FSTYPE_NTFS
VolumeLink : [fffffa804ae06810-fffffa804a2b16f0]
DeviceObject : fffffa804a3b0060
DiskDeviceObject : fffffa804a1c0350
FrameZeroVolume : fffffa804a3b0800
VolumeInNextFrame : 0000000000000000
Guid : "\??\Volume{552791b0-455d-11de-b7b9-00145eed6acc}"
CDODeviceName : "\Ntfs"
CDODriverName : "\FileSystem\Ntfs"
TargetedOpenCount : 258
Callbacks : (fffffa804a3b0910)
ContextLock : (fffffa804a3b0cf8)
VolumeContexts : (fffffa804a3b0d00) Count=0
StreamListCtrls : (fffffa804a3b0d08) rCount=2871
FileListCtrls : (fffffa804a3b0d88) rCount=0
NameCacheCtrl : (fffffa804a3b0e08)
InstanceList : (fffffa804a3b0890)
FLT_INSTANCE: fffffa804b5b1010 "BadDriver.sys Instance" "189600"