Part 2: Got Stack? No. We ran out and kv won’t tell me why!

December 1, 2009, 8:51 am

Hello. It’s Ryan again with the second installment of my stack depletion walkthrough. Part 1 of this blog covered the initial analysis of a kernel memory dump captured due to a Stop 0x7f EXCEPTION_DOUBLE_FAULT. Our initial analysis revealed that kv was not able to provide us with a useful stack backtrace. Background information relating to Task States and Double Faults were also covered. If you haven’t yet reviewed this blog, you can find it at Part 1.

Previously, in part one of this blog, we reviewed the memory dump of a stop 0x7f EXCEPTION_DOUBLE_FAULT and found that we were not provided with a valid stack backtrace. Without a valid stack backtrace, we were unable to identify what depleted the stack. Whenever I am in the situation where kv is not able to properly walk the stack, my next course of action is to manually dump out the memory within the stack range using the dps command.

In the previous blog, we ran the !thread command to obtain the stack base and limit to view the thread in our fictiously named process, StackHog.exe.

!thread

Owning Process 874c6800 Image: StackHog.exe

...

Base b8ae9000 Limit b8ae6000

Let’s pass this address range to the dps command. The dps command will display the contents of memory and treat the data dumped out as pointers. It will then try to match up these pointer addresses with symbol information in the same manner that kv does.

dps b8ae6000 b8ae9000

omitting extremely long output

I'll spare the lengthy dps output here and instead describe what was observed. When dumping the stack in this manner, I noticed that one product’s drivers were appearing on the stack over and over again. This product also provided the executable that was running as the current process (I obtained this information from the !thread command that I ran to obtain the limit and base values). Running lm kv m and !lmi against these drivers verified that they were in fact all from the same product (StackHog.exe).

Using dps in this manner will often provide a good idea of what may have been using the stack. However, there are a few problems that may affect the reliability of this method. Dps simply dumps out whatever data is present anywhere on the stack. One problem is there can often be trash left over on the stack from previous activity that is unrelated to our present stack. Also, there may be trap handlers and other data that shows up to further complicate the call flow picture. In addition, the information will not clearly show the order of calls like a kv stack backtrace output will. The output can also be very lengthy since the entire range is displayed. On a stack where each call frame averages 8 frames, you may be only interested in the symbols from the return address in each frame. I have observed some stack frames where dps may resolve two or three symbols for that frame. All of this means that while dps is a good tool and often useful, on occasion some of the output from dps may be confusing instead of revealing. Sometimes when multiple unrelated code from various vendors (including Microsoft) are on the stack, you may need a more complete understanding of the true call flow. For these reasons, whenever possible, I want to be able to see the kv stack backtrace to display the call flow that led up to the crash.

Let's see if we can help kv to reconstruct the stack. Since the stack backtrace did not display properly on its own, we will need to help out a little to get the backtrace started. If we can provide good starting point values, the debugger can often dump out the rest of the stack. Also, when dealing with an overflowed stack, we will want to use the kf option instead of kv. The kf command will give us a better idea of how much space a driver and the calls it made are occupying on the stack. It does this by listing the amount of stack space between the stack frames. If you take these stack usage values as fact, you are assuming that the stack was backtraced correctly. Sometimes this isn’t the case when you don’t have symbols for all the code on the stack. So to investigate the stack usage, we need to get a good backtrace. The k command accepts parameters that will help it display the stack using address values that you provide. We need to provide a few addresses to any of the various versions of the k command by using the equal sign as I will demonstrate shortly. We need to provide the BasePtr, the StackPtr, and the InstructionPtr. For more information on the k command, refer to the msdn documentation .

Since the real problem here is that various modules have used up all the stack, we simply need to dump out most of that stack to see where most of the usage is. It isn't important that we identify what was going on when we died or what was at the very top of the stack. We just want to see as much of the stack as we can dumped out in an easy to understand format by using kf.

So where can we find these values? I'll start by dumping out the stack using the address of the stack limit. Let's dump out this area of memory. I'll start by clearing the screen so that I can examine the output

3: kd>.cls

3: kd> dps b8ae6000 b8ae9000

I wasn’t able to locate any valid patterns in the very top of the stack. This was probably due to lack of symbols and valid stack related register addresses combined with FPO or other optimizations. Things started making sense further down the stack in the stack range that I have listed below. Observe the patterns I have highlighted. Note that the symbol output is listed next to the return addresses. I will try to display my stack starting at this point. B8ae6100 is very close to the stack limit so we won’t be missing much of the output. We will get most of the stack output that we need to see if we start here.

b8ae60ec b8ae6100 <----------------------------------points to the next stack frame pointer (saved ebp) below

b8ae60f0 8081df65 nt!IofCallDriver+0x45 <---possible Return address. Start here.

b8ae60f4 8763f718

b8ae60f8 87758bd8

b8ae60fc 8b4abb00

b8ae6100 b8ae6128 This points to the next frame

b8ae6104 f7a2ec45 fltmgr!FltpDispatch+0x6f <----possible return address

b8ae6108 8b4abb00

b8ae610c 87758bd8

b8ae6110 00000000

b8ae6114 89751350

b8ae6118 00000080

b8ae611c 00000000

b8ae6120 b8ae6130

b8ae6124 8084cff9 nt!MmIsAddressValid+0xf

b8ae6128 b8ae613c

b8ae612c 8081df65 nt!IofCallDriver+0x45

b8ae6130 8765b6a8

b8ae6134 87758bd8

b8ae6138 89751350

b8ae613c b8ae6144

b8ae6140 b958e196 BossHog+0x1196

b8ae6144 b8ae61c4

b8ae6148 b958f4bc BossHog!StackEater+0x80c

I'll start by identifying all of the stack address in the hopes of finding any candidates for saved stack frame pointers (saved ebp values). Based on the limit and base, these addresses will all start with b8ae. The next digit will be 6, 7, 8, or 9. I have identified all of the possible values. Next, I looked for patterns of addresses that are pointing to other pointers below them to create a chain. Next, I will look for possible return addresses. They should appear on the line right under the saved EBP value.

I'll try dumping the stack using the values from the first possible frame. The format is

kf=BasePtr, StackPtr, InstructionPtr

So now, it appears that we have a pattern, let me pass in these numbers to the kf command. If this works, then the stack walker will dump the stack out from this point down up to the maximum that you have set by using the .kframes command. I’ll start by raising the number of stack frames displayed using .kframes.

3: kd> .kframes 200

Default stack trace depth is 0n512 frames

Now, lets dump the stack out using kf. Kf will display the number of bytes of stack space used in hex on each line before it displays the frame.

3: kd> kf=b8ae6100 b8ae60ec 8081df65

Memory ChildEBP RetAddr

b8ae6100 f7a2ec45 nt!IofCallDriver+0x45

28 b8ae6128 8081df65 fltmgr!FltpDispatch+0x6f

14 b8ae613c b958e196 nt!IofCallDriver+0x45

WARNING: Stack unwind information not available. Following frames may be wrong.

8 b8ae6144 b958f4bc BabyHog+0x1196

80 b8ae61c4 8081df65 BabyHog!HogFarm+0x88c

14 b8ae61d8 b80c18a6 nt!IofCallDriver+0x45

84 b8ae625c b80cf367 BossHog+0x78a6

10 b8ae626c b80cf3b7 BossHog+0x15367

28 b8ae6294 8081df65 BossHog!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae62a8 b76fbcf7 nt!IofCallDriver+0x45

84 b8ae632c b7709ae6 BossHog01+0x7cf7

10 b8ae633c b7709b36 BossHog01+0x15ae6

28 b8ae6364 8081df65 BossHog01!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae6378 8081e4ed nt!IofCallDriver+0x45

18 b8ae6390 8085114a nt!IoPageRead+0x109

9c b8ae642c 8085ea66 nt!MiDispatchFault+0xece

84 b8ae64b0 8088c798 nt!MmAccessFault+0x89e

0 b8ae64b0 808b64a6 nt!_KiTrap0E+0xdc

c8 b8ae6578 bae5af2d nt!CcMapData+0x8c

20 b8ae6598 bae5849b Ntfs!NtfsMapStream+0x4b

74 b8ae660c bae5adf0 Ntfs!NtfsReadMftRecord+0x86

38 b8ae6644 bae5afac Ntfs!NtfsReadFileRecord+0x7a

38 b8ae667c bae19903 Ntfs!NtfsLookupInFileRecord+0x37

110 b8ae678c bae1a6c4 Ntfs!NtfsLookupAllocation+0xdd

1d0 b8ae695c bae1a87c Ntfs!NtfsPrepareBuffers+0x25d

1dc b8ae6b38 bae1b1a6 Ntfs!NtfsNonCachedIo+0x1ee

ec b8ae6c24 bae1b0c9 Ntfs!NtfsCommonRead+0xaf5

1ac b8ae6dd0 8081df65 Ntfs!NtfsFsdRead+0x113

14 b8ae6de4 f7a2ec45 nt!IofCallDriver+0x45

28 b8ae6e0c 8081df65 fltmgr!FltpDispatch+0x6f

14 b8ae6e20 b958e196 nt!IofCallDriver+0x45

8 b8ae6e28 b958f4bc BabyHog+0x1196

80 b8ae6ea8 8081df65 BabyHog!HogFarm+0x88c

14 b8ae6ebc b80c18a6 nt!IofCallDriver+0x45

84 b8ae6f40 b80cf367 BossHog+0x78a6

10 b8ae6f50 b80cf3b7 BossHog+0x15367

28 b8ae6f78 8081df65 BossHog!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae6f8c b76fbcf7 nt!IofCallDriver+0x45

84 b8ae7010 b7709ae6 BossHog01+0x7cf7

10 b8ae7020 b7709b36 BossHog01+0x15ae6

28 b8ae7048 8081df65 BossHog01!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae705c 8081e4ed nt!IofCallDriver+0x45

18 b8ae7074 8085114a nt!IoPageRead+0x109

9c b8ae7110 8085ea66 nt!MiDispatchFault+0xece

84 b8ae7194 8088c798 nt!MmAccessFault+0x89e

0 b8ae7194 808b64a6 nt!_KiTrap0E+0xdc

c8 b8ae725c bae5af2d nt!CcMapData+0x8c

20 b8ae727c bae5d9d5 Ntfs!NtfsMapStream+0x4b

30 b8ae72ac bae5f5e4 Ntfs!ReadIndexBuffer+0x8f

174 b8ae7420 bae5f786 Ntfs!NtfsUpdateFileNameInIndex+0x62

fc b8ae751c bae5f8c6 Ntfs!NtfsUpdateDuplicateInfo+0x2b0

208 b8ae7724 bae5c8d9 Ntfs!NtfsCommonCleanup+0x1e82

170 b8ae7894 8081df65 Ntfs!NtfsFsdCleanup+0xcf

14 b8ae78a8 f7a2ec45 nt!IofCallDriver+0x45

28 b8ae78d0 8081df65 fltmgr!FltpDispatch+0x6f

14 b8ae78e4 b958e196 nt!IofCallDriver+0x45

8 b8ae78ec b958f472 BabyHog+0x1196

80 b8ae796c 8081df65 BabyHog!HogFarm+0x842

14 b8ae7980 b80c18a6 nt!IofCallDriver+0x45

84 b8ae7a04 b80cf367 BossHog+0x78a6

10 b8ae7a14 b80cf3b7 BossHog+0x15367

28 b8ae7a3c 8081df65 BossHog!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae7a50 b770a8ac nt!IofCallDriver+0x45

28 b8ae7a78 b76fb994 BossHog01!DEVICEDISPATCH::LowerDevicePassThrough+0x48

7c b8ae7af4 b76fbbaf BossHog01+0x7994

94 b8ae7b88 b7709ae6 BossHog01+0x7baf

10 b8ae7b98 b7709b36 BossHog01+0x15ae6

28 b8ae7bc0 8081df65 BossHog01!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae7bd4 808f9732 nt!IofCallDriver+0x45

30 b8ae7c04 80934bac nt!IopCloseFile+0x2ae

30 b8ae7c34 809344ad nt!ObpDecrementHandleCount+0xcc

28 b8ae7c5c 80934546 nt!ObpCloseHandleTableEntry+0x131

44 b8ae7ca0 80934663 nt!ObpCloseHandle+0x82

10 b8ae7cb0 8088978c nt!NtClose+0x1b

0 b8ae7cb0 8082e811 nt!KiFastCallEntry+0xfc

7c b8ae7d2c b8d8ec2d nt!ZwClose+0x11

50 b8ae7d7c b8d8ede5 MamaHog+0x5c2d

54 b8ae7dd0 b8d8fa85 MamaHog+0x5de5

164 b8ae7f34 b8d917fe MamaHog+0x6a85

40 b8ae7f74 b8d8d22a MamaHog+0x87fe

2c4 b8ae8238 b958ecdf MamaHog+0x422a

24 b8ae825c b958eee0 BabyHog!HogFarm+0xaf

34 b8ae8290 8081e103 BabyHog!HogFarm+0x2b0

30 b8ae82c0 bae1a22c nt!IopfCompleteRequest+0xcd

10 b8ae82d0 bae5c00a Ntfs!NtfsCompleteRequest+0xc8

104 b8ae83d4 8081df65 Ntfs!NtfsFsdCreate+0x48c

14 b8ae83e8 f7a3c458 nt!IofCallDriver+0x45

2c b8ae8414 8081df65 fltmgr!FltpCreate+0xe4

14 b8ae8428 b958e196 nt!IofCallDriver+0x45

8 b8ae8430 b958f71c BabyHog+0x1196

60 b8ae8490 8081df65 BabyHog!HogFarm+0xaec

14 b8ae84a4 b80d012b nt!IofCallDriver+0x45

28 b8ae84cc b80c1862 BossHog!DEVICEDISPATCH::LowerDevicePassThrough+0x48

8c b8ae8558 b80cf367 BossHog+0x7862

10 b8ae8568 b80cf3b7 BossHog+0x15367

28 b8ae8590 8081df65 BossHog!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae85a4 b76f9639 nt!IofCallDriver+0x45

4c b8ae85f0 b76fbb42 BossHog01+0x5639

94 b8ae8684 b7709ae6 BossHog01+0x7b42

10 b8ae8694 b7709b36 BossHog01+0x15ae6

28 b8ae86bc 8081df65 BossHog01!DEVICEDISPATCH::DispatchPassThrough+0x48

14 b8ae86d0 808f8f71 nt!IofCallDriver+0x45

e8 b8ae87b8 80937942 nt!IopParseDevice+0xa35

80 b8ae8838 80933a76 nt!ObpLookupObjectName+0x5b0

54 b8ae888c 808eae25 nt!ObOpenObjectByName+0xea

7c b8ae8908 808ec136 nt!IopCreateFile+0x447

48 b8ae8950 b76ff4ae nt!IoCreateFileSpecifyDeviceObjectHint+0x52

9c b8ae89ec b76ff0e6 BossHog01+0xb4ae

58 b8ae8a44 b7685f8b BossHog01+0xb0e6

50 b8ae8a94 b76867a3 daddyHog+0x3f8b

28 b8ae8abc b76f66e7 daddyHog+0x47a3

38 b8ae8af4 b76f6e76 BossHog01+0x26e7

124 b8ae8c18 b770a0d5 BossHog01+0x2e76

44 b8ae8c5c 808f5e2f BossHog01!DEVICEDISPATCH::DispatchPassThrough+0x5e7

a4 b8ae8d00 808eed08 nt!IopXxxControlFile+0x255

34 b8ae8d34 8088978c nt!NtDeviceIoControlFile+0x2a

0 b8ae8d34 7c8285ec nt!KiFastCallEntry+0xfc

0335e534 00000000 0x7c8285ec

Problem solved, we have our stack. Please note that the technique demonstrated above is only applicable to an x86 platform. The x64 architecture is completely different. For more information, please refer to Trey Nash’s previous blog

The same drivers I observed previously using the dps output are listed; however, it’s much easier to follow the call flow this time. Let’s see how much stack these hogs are using:

3: kd> ? 0x8+0x80+0x84+0x10+0x28+0x84+0x10+0x28+0x8+0x80+0x84+0x10+0x28+0x84+0x10+0x8+0x80+0x84+0x10+0x28+0x28+0x7c+0x94+0x10+0x28+0x50+0x54+0x164+0x40+0x2c4+0x24+0x34+0x8+0x60+0x28+0x8c+0x10+0x28+0x9c+0x58+0x50+0x28+0x38+0x124+0x44

Evaluate expression: 4176 = 00001050

How many kb is this?

3: kd> ? 4176/1024

Evaluate expression: 4 = 00000004

I would also like to dump out one of the functions that is using a large amount of stack space to show you how you can dig deeper into problems like this. Sometimes this level of granularity is needed when a vendor is attempting to optimize their code by showing them what caused their large allocations to occur.

Let’s review the frame listed below that used 0x2c4 bytes of stack space

40 b8ae7f74 b8d8d22a MamaHog+0x87fe

2c4 b8ae8238 b958ecdf MamaHog+0x422a

24 b8ae825c b958eee0 BabyHog!HogFarm+0xaf

I have underlined the return address of this stack frame. This is the address right after the instruction where the call to MamaHog took place. After MamaHog would have completed, execution would of course have continued with the line of code right after the call. To view the call to MamaHog, let’s unassemble backwards one instruction to see the call being made.

3: kd> ub b958ecdf L1

BabyHog!HogFarm+0xa8:

call dword ptr BabyHog!HogFarm +0x7460 (b9596090) <-stores the location we called

Let’s dump out this location to see what we actually called. The code dereferences this location to make the actual call.

3: kd> dps b9596090 L1

b9596090 b8d8c0d0 BabyHog +0x30d0<-----highlighted address is the function called

3: kd> uf b8d8c0d0

b8d8c0d0 55 push ebp

b8d8c0d1 8bec mov ebp,esp

b8d8c0d3 6aff push 0FFFFFFFFh

b8d8c0d5 6840e2d9b8 push offset BabyHog +0x15240 (b8d9e240)

b8d8c0da 6808b2d9b8 push offset BabyHog +0x12208 (b8d9b208)

b8d8c0df 64a100000000 mov eax,dword ptr fs:[00000000h]

b8d8c0e5 50 push eax

b8d8c0e6 64892500000000 mov dword ptr fs:[0],esp

b8d8c0ed 81c470fdffff add esp,0FFFFFD70h

To subtract 656, we are adding -656 to esp to bump the stack up. This is where the compiler is allocating stack space for storage of local variables. I am guessing there were multiple structures being allocated directly on the stack. If the programmer had instead called ExAllocatePoolWithTag to obtain memory, we could have instead only stored the pointers to this memory on the stack. Simply using the stack for storage is of course faster than calling out to get memory. The developer should balance the performance needs of the code with the need to conserve the stack space which is a limited resource.

3: kd> .formats 0x0FFFFFD70

Evaluate expression:

Hex: fffffd70

Decimal: -656

So this one instruction raised the stack more than .5 kb all by itself. The function made a bunch of other pushes (some of which are listed above) which when combined with the return address pushed by the call resulted in 708 (0x2c4 was listed by kf above) bytes of stack space to be allocated by this one function.

3: kd> .formats 2c4

Evaluate expression:

Hex: 000002c4

Decimal: 708

Two allocations like this will use up almost 1.5kb of the 12k stack space. By itself, this isn’t necessarily an issue; however, you can see from the output above how liberal stack usage by multiple calls from stacked up drivers can quickly add up. So what have we discovered here? Without even considering the fact that this products drivers may also be responsible for some of the stack space used by calls that they may have initiated, they have used at least 4k of the 12k stack. Also, the full 12k of stack space is not available for driver use. This is because the operating system also requires stack space for such overhead as the I/O operation, the file system components, thread startup, etc. The customer removed the product which prevented further bugchecks while the vendor was being engaged to assist or provide leaner hogs.

Summary

This two part blog has covered CPU task states, stack overflows, x86 stack reconstruction techniques, and examining functions to observe the stack allocations taking place. Hopefully, this blog will help you to understand what went wrong the next time that you encounter a Stop 0x7f (EXCEPTION_DOUBLE_FAULT).

You don’t have to wait until you encounter an UNEXPECTED_KERNEL_MODE_TRAP (7f) Arg1: 00000008, EXCEPTION_DOUBLE_FAULT to get familiar with this. Open up any dump file and see if you can find the limit and base. Then observe the values of your ebp and esp registers. If you are feeling truly geeky, dump the stack range out, identify patterns, and try passing various values to kf.

Please check out our previous blog post on this topic:

http://blogs.msdn.com/ntdebugging/archive/2008/02/01/kernel-stack-overflows.aspx

Keep in mind that this two part blog has only discussed stop 0x7f bugchecks where Arg1 lists value 0x8 EXCEPTION_DOUBLE_FAULT. There are other causes for a 0x7f bugcheck. For more information, please refer to:

314102 General Causes of STOP 0x0000007F Errors

Bug Check 0x7F: UNEXPECTED_KERNEL_MODE_TRAP

References:

“Windows Internals, Fifth Edition” By Mark E. Russinovich, David A. Solomon, with Alex Ionescu

Chapter 9, page 786, “Kernel Stacks”

Share this post :

↧

Part 3: ETW Methods of Tracing

December 11, 2009, 12:46 am

≫ Next: Pool Fragmentation

≪ Previous: Part 2: Got Stack? No. We ran out and kv won’t tell me why!

Introduction and Overview

This is Ivan from the Platforms OEM team and this is the final installment of the ETW series. In this article, we are going to continue our exploration of the ETW tracing available in Windows. This post is going to cover some of the other methods available to enable and capture ETW logs. In previous posts we covered

-ETW Introduction and Overview
-Exploring and Decoding ETW Providers using Event Log Channels

In this post we will explore some of the methods Microsoft support may use to enable and capture tracing, in order to troubleshoot issues on a box. In fact, some of these methods may ultimately be quite transparent when you use them; but we wanted to dig a bit into what his happening behind the scenes.

Please keep in mind that the previous post, Exploring and Decoding ETW Providers using Event Log Channels, was geared more towards self-discovery and decoding. This post covers the more advanced scenarios and tools that you may see Microsoft support using.

We plan on starting with some of what’s available in the box, then using a command line tool (logman) to enable further logging to a kernel buffer – both of these are useful for dumps. The second set of scenarios we are going to cover is using the command line (logman) and GUI to enable tracing to a sequential file. We will then finish up with some other special types of tracing available.

Tracing already available (and likely running) in the box

Starting with Vista/2008, there is a variety of tracing already running inbox and accessible via a kernel dump, or logging to an on-disk ETL file (limited tracing is available in 2003 & XP). This screenshot was taken from a default Vista box with no additional logging enabled or turned on. While we aren’t going to cover these ETW providers in detail, it’s good to know these are available for analysis. One logger in particular is a useful kernel logger, which is the Circular Kernel Context Logger or CKCL, a sort of In Flight Data Recorder (IFR). The CKCL is one of the two available kernel loggers, and is intended as a high-performance, high volume log to record the last 0.5-3 seconds (assuming high load) of low level kernel activity on the box. The CKCL is set to circular mode meaning newer events will overwrite older events, and with a 4MB total buffer space available.

This can be very useful, because along with all the rich information a full kernel .dmp provides, the CKCL can enhance that and provide recent historical information about how the kernel arrived at the state it’s currently in. The CKCL is enabled by default in Vista and can be enabled in Server 2008 in the GUI or via the command line.

The CKCL Keywords of 0x237f correspond to the following flags that are enabled on this box: process, thread, img, proccntr, cswitch, dpc, isr, disk, file, hf. The screenshot below shows information from the “NT Kernel Logger” session whose provider name is “Windows Kernel Trace”. “Windows Kernel Trace” is the provider for the CKCL and is shown here because it is the easiest/most reliable way to view what the kernel flags mean.

The green boxes below highlight the flags that were automatically enabled on this Vista machine and available in dumps. You might notice that tools such as XPerf will modify and collect data from the “NT Kernel Logger” session. XPerf post processes/merges the trace (along with other potential providers) to add valuable information to be able read the trace in more human readable form as well as reliably view the trace offline. However, even without the post processing added by Xperf, traces can be pulled from kernel dump, decoded, and even viewed by Xperf (although they will be missing information such as process names, ISR names, symbol information, etc).

To review how the kernel providers and sessions are related:

Session Provider Comment____________________________

Circular Kernel Context Logger(CKCL) Windows Kernel Trace In-Flight Recorder for recent activity on the box
NT Kernel Logger Windows Kernel Trace Kernel Logger used by XPerf but can be enabled manually w/o XPerf

Win7/2008R2:

In order to cover the newer flags available, we illustrate the “NT Kernel Logger”, which is enabled because xperf was enabled with: xperf –on Base+Diag. To reiterate, the same flags are available in the CKCL, but whose configuration and purpose is suited toward an IFR scenario.

Vista/2008:

As covered in the picture above, the flags are very similar to Win7, with the exception that dispatcher and virtalloc are not available.

Server 2003:
ETW Tracing in 2003 is limited. You can see the Kernel Flags are much smaller and do not cover flags such as ISR or DPC.

Using the !wmitrace.strdump debugger extension, we can find the Logger Id for the CKCL which is 2.

And use !wmitrace.logsave to save the log to an ETL file

Here we use Xperf to load the ETL file and are looking at ISR history over the lifetime of this particular trace (roughly last 250s of time before dump). This is a trace from a normally functioning box and is showing that ISR at 0x83b44b65 had the highest count over our selected ~100ms timeframe.

As previously mentioned, you will need to hand decode much of the trace data, because the XPerf merge step is missing. Here we find that the ISR at 0x83b44b65 is from ndis!ndisMiniportIsr

Finally, we can dump the log out in the debugger in plain old text format using !wmitrace.logdump, which we can decode because Windbg ships with a system.tmf which will decode most Kernel Events from the CKCL.

Using Logman to enable tracing to a circular buffer

Logman is a built in tool that ships with the OS in Vista/Win7, and can be used to enable and control ETW tracing. We are going to cover it first because it can be used to automate starting and stopping ETW logging without using the Computer Management GUI. It is also baked into the OS, making it ideal because no extra tools are required to start using or enabling ETW tracing. While logman is very useful, the syntax is often inconsistent, and it’s often difficult to figure out the right way to run a command, which is why we are going to cover some examples here.

You may even receive a set of batch scripts that basically automate the starting and stopping of tracing. For example:

-Start.bat <- Run this to start tracing
-Stop.bat <- Run this to stop tracing

Here we are using logman to start the tracing of the just previously mentioned “NT Kernel Logger” (used by XPerf) to enable the “Windows Kernel Trace” provider. This logging would be useful to enable kernel events that aren’t enabled by default in the CKCL, or if you want a larger time than the 4MB buffer the CKCL provides. The logging is flushed to disk after every 1MB buffer fills up.

<Start.bat>

REM – This creates a kernel trace using 20MB buffer in a circular file mode that logs to kernel.etl (and available in a memory dump as a circular buffer)

logman start "NT Kernel Logger" -p "Windows Kernel Trace" (process,thread,disk,isr,dpc,net,img,registry,file) -bs 1024 -nb 20 20 -mode Circular -o kernel.etl -ct perf -max 20 -ets

</Start.bat>

<Stop.bat>

logman stop "NT Kernel Logger" -ets

</Stop.bat>

Using Logman to enable tracing to a file

You may receive logging scripts that are tuned specifically to the problem at hand, including the necessary providers, and within each provider, applicable Flags and Levels. In the previous blog posts, we already covered some information about how to find out providers, so we let’s assume we already have that information. Using logman to enable tracing in this manner allows for an ETW trace .etl to be captured and sent to Microsoft for decoding. Using the Microsoft-Windows-WinINet example, the tracing may look like this:

<Start.bat>

REM - This creates a circular file trace with all Flags and Levels on the Microsoft-Windows-WinINet provider, with a max size of 20MB

logman start customTrace -o c:\temp\customTrace.etl -p "Microsoft-Windows-WinINet" 0xff 0xff -mode Circular -max 20 -ets

REM This creates a file trace with all Flags and Levels on the Microsoft-Windows-WinINet provider, logging until stopped

logman start customTrace -o c:\temp\customTrace.etl -p "Microsoft-Windows-WinINet" 0xff 0xff -ets

</Start.bat>

<Stop.bat>

logman delete customTrace

</Stop.bat>

Using the Computer Management – Performance to enable tracing manually

The Computer Management performance MMC GUI snap-in is useful for viewing and changing trace sessions. In previous blog posts, we briefly touched on the Event Trace Sessions GUI to view sessions auto-created when enabling a Windows Event log channel. Channels are useful because they don’t require much knowledge of the individual keywords/flags and levels to enable – they are pre-defined for a scenario. Most events that are delivered to a channel or decoded due to Event Manifests. However, there is another class of ETW providers called WPP Software Tracing, which was originally intended for driver tracing, but lacks a lot of the self-described meta information that manifest based providers give.

This is where Computer Management -> Performance -> Data Collector Sets -> Event Trace Sessions, once again comes in handy. Not only can you view existing sessions created by the system (Autologgers/Start Event Trace Sessions), but you can modify tracing, and even create new tracing; which is what we are going to cover. This is the only way to create tracing using the GUI for WPP style ETW tracing, such as iSCSI.

To create a new session, right click Event Trace Sessions and choose New -> Data Collector Set

In our example, we already know what type of tracing to enable (you may be directed by Microsoft Support), or you may be given an .xml template to automatically setup the tracing. In our example, we are going to setup tracing for iSCSI, which is WPP based.

The list of providers that displays unfortunately isn’t searchable, but is sorted alphabetically, so we can use our ‘logman query providers | findstr /I iscsi” to find providers named iscsi.

Which we should find in the GUI of Event providers:

Next we need to choose what Keywords (Any) are used for this provider. Keywords (Any) is an OR filter meaning any bit that is set on the Keyword/Flag mask will be traced. Keywords usually specify functional sub-components while Levels control at which level of detail (Error, Informational, Verbose) events are logged. Components are free to choose what Keywords/Flags they use, and what Levels the use; and don’t have to use both. We will need to set or check both Keywords(Any) and Level(s) in order to get any ETW tracing, otherwise events won’t be logged.

When we edit Keywords (Any) to include these four flags, you should notice the mask is set automatically to 0xf, which will reflect under Keywords(Any). When we go to edit the level, we notice levels don’t seem to be exposed or used by the msisci_iScsi provider. That normally means a provider doesn’t use levels, but it is recommended to always be on the safe side, so to make sure not to lose events we can set the level anyways. Here we set the level to 0xff.

What we should have now is a trace session called iSCSI with the msisci_iSCSI provider with Keywords(Any) of 0xf and Level of 0xf. There are just a couple of more items we need to check before starting to collect data on this provider, while is changing the Trace Buffer parameters, and checking stream mode and log location.

Here we change the Buffer Size to 1MB, flushed to disk every second, with 10 buffers available in case there is so much event load that events can’t be flushed to disk fast enough. We simply check the Stream Mode and see it is set to File, and check the file name which is under %LOCALAPPDATA%\iSCSI.etl. Now we can start the ETW session and start collecting data!

Using the Computer Management – Performance to enable tracing with a template

Now, while it’s useful to know how to manually create sessions, either with the GUI or logman command line; you may only want to quickly enable tracing provided by someone else, such as Microsoft Support, or save off commonly used tracing. This is where templates come in handy. Continuing with the last session, we can right click on our iSCSI session and choose “Save Template”, and then save the template as iSCSI.xml.

Now when creating a new trace with the GUI we can use a template to capture the previous saved settings, and quickly setup tracing.

Using tracelog (DDK) to enable tracing to a file

Tracelog is an ETW tracing tool that ships with the DDK. It has some more advanced features than logman that ships with the OS, with a disadvantage that you will need to download the WDK/DDK. You will find tracelog under %DDK_INSTALL_DIR%\tools\tracing\%PROCESSOR_ARCHITECTURE%\tracelog.exe, along with other useful tools such TraceView, and tracefmt.

While you can use the built-in Computer Management GUI or logman to manipulate most tracing, a couple of key features that tracelog provides are:

                -Realtime output to the kernel debugger (-kd switch)
                -Private process heap tracing (-heap -pids <n> <pid1 pid2 …>). Heap tracing is also available in xperf, and is more useful due to stack walking (more on this later)
                -Private process critical section tracing (-critsec -pids <n> <pid1 pid2 …>)
                -Ability to create autologgers/traces that persist on reboots (-addautologger). Note: This ability does exist in logman, but is hidden. It can be done by appending “autosession\” to the name of session using logman.

Here we are going to use tracelog to demonstrate a special type of logger in ETW called a private session logger, which uses private buffers inside a process (normally ETW buffers live in kernel mode). Private buffers are the only way to collect heap or critical section ETW tracing.

Critical section private tracing

First we start by finding the notepad process, for which we are going to collect critical section tracing.

C:\temp>tasklist | findstr /i notepad

Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
notepad.exe 7752 Console 2 10,052 K

Now we use tracelog with the –critsec switch to

<Start.bat>

tracelog -start critSecTrace -f c:\temp\critSecTrace.etl -critsec -pids 1 7752

</Start.bat>

<Stop.bat>

logman -stop critSecTrace

</Stop.bat>

The critical section trace can be decoded with the DDK tracefmt tool using the system.tmf in the DDK. Here is an example portion of the decoded log.

Heap private tracing

We can use tracelog in a similar manner to enable and decode heap tracing. The heap trace can also be decoded with the DDK tracefmt tool using the system.tmf in the DDK.

<Start.bat>

tracelog -start heapTrace -f c:\temp\heapTrace.etl -heap -pids 1 7752

</Start.bat>

<Stop.bat>

logman -stop heapTrace

</Stop.bat>

While the heap trace is useful in its own right, we can even get more useful info using XPerf. Xperf is a great way to view most types of ETW data (especially performance), and in this particular case is vital to get full stack traces that led up to the heap entries. Xperf can be downloaded from the Windows Performance Analysis Dev Center. Here we enable stacktraces for heap allocations and reallocations so we know who is creating the heaps. You may want to explore other heap stackwalking available with the xperf help - ‘xperf -help stackwalk’

<Start.bat>

xperf -start HeapSession -heap -PidNewProcess "notepad.exe" -BufferSize 1024 -MinBuffers 128 -MaxBuffers 128 -stackwalk HeapAlloc+HeapRealloc

</Start.bat>

<Stop.bat>

xperf -stop HeapSession -stop -d c:\temp\xperf_heap.etl

</Stop.bat>

Now we can load up the ETW trace in xperf, and view heap allocation information (in additional to kernel info such as CPU from the NT Kernel Logger), with stack traces at the time of allocation!

Wrapping it up

Hopefully, we dug further into ETW and explored some of the various methods to put the power of ETW to work. There is a lot of tools and power exposed with the built-in ETW tools, and some other scenarios that can be enabled with add-on tools from the DDK and XPerf.

Share this post :

↧

Pool Fragmentation

December 11, 2009, 8:27 am

≫ Next: Case Study - Software Restriction Policies and Large EXE Files

≪ Previous: Part 3: ETW Methods of Tracing

Hello! My name is Stephen, an escalation engineer on the Microsoft Global Escalation Services Team. Today I'm going to share my experience of a pool fragmentation issue I came across recently. Let’s jump right in with the dump file.

This is the output of !vm

*** Virtual Memory Usage ***

      Physical Memory:      917368 (   3669472 Kb)
      Page File: \??\C:\pagefile.sys
      Current:   4190208 Kb Free Space:   4090220 Kb
      Minimum:   4190208 Kb Maximum:      4190208 Kb
      Available Pages:      649161 (   2596644 Kb)
      ResAvail Pages:       860271 (   3441084 Kb)
      Locked IO Pages:         210 (       840 Kb)
      Free System PTEs:      14629 (     58516 Kb)
      Free NP PTEs:           4230 (     16920 Kb)
      Free Special NP:           0 (         0 Kb)
      Modified Pages:          791 (      3164 Kb)
      Modified PF Pages:       785 (      3140 Kb)
      NonPagedPool Usage:    25463 (    101852 Kb)
      NonPagedPool Max:      32647 (    130588 Kb)
      PagedPool 0 Usage:     8717 (     34868 Kb)
      PagedPool 1 Usage:      6113 (     24452 Kb)
      PagedPool 2 Usage:      6100 (     24400 Kb)
      PagedPool 3 Usage:      6033 (     24132 Kb)
      PagedPool 4 Usage:      6116 (     24464 Kb)
      PagedPool Usage:       33079 (    132316 Kb)
      PagedPool Maximum:     60416 (    241664 Kb)
      Session Commit:         1870 (      7480 Kb)
      Shared Commit:          5401 (     21604 Kb)
      Special Pool:              0 (         0 Kb)
      Shared Process:         8957 (     35828 Kb)
      PagedPool Commit:      33120 (    132480 Kb)
      Driver Commit:          1939 (      7756 Kb)
      Committed pages:      227031 (    908124 Kb)
      Commit limit:        1929623 (   7718492 Kb)

Using the!poolused /t5 2, I dumped out the highest users of nonpaged pool.

Sorting by NonPaged Pool Consumed
Pool Used:
            NonPaged            Paged
Tag    Allocs     Used    Allocs     Used
MmCm     3187 17452976         0        0      Calls made to MmAllocateContiguousMemory , Binary: nt!mm
NDpp     1125 4519648         0        0      packet pool , Binary: ndis.sys
File    24911 3992376         0        0      File objects
abcd        8 3305504         0        0      UNKNOWN pooltag 'abcd', please update pooltag.txt
LSwi        1 2576384         0        0      initial work context
TOTAL 239570 65912104    200276 66610504

The big difference between the totals reported by !vm(101 MB) and !poolused(65 MB), tells us there is a pool fragmentation issue!

After some research, I found a lot of pool pages with the following allocation pattern:

3: kd> !pool fa808000
Pool page fa808000 region is Nonpaged pool
*fa808000 size: a20 previous size:    0 (Free)      *MFE0
fa808a20 size:   18 previous size: a20 (Allocated) ReEv
fa808a38 size: 5c8 previous size:   18 (Free)       NtFs

3: kd> !pool fa550000
Pool page fa550000 region is Nonpaged pool
*fa550000 size: 860 previous size:    0 (Free)      *Io
fa550860 size:   18 previous size: 860 (Allocated) MFE0
fa550878 size: 788 previous size:   18 (Free)       Irp

3: kd> !pool f8feb000
Pool page f8feb000 region is Nonpaged pool
*f8feb000 size: 648 previous size:    0 (Free)      *Ntfr
f8feb648 size:   18 previous size: 648 (Allocated) ReEv
f8feb660 size: 9a0 previous size:   18 (Free)       MFE0

The page fa808000 has only one pool chunk in use, and its size is about 0x18=24 Bytes. The top and bottom portion of the entire page are freed pool chunks and could be re-allocated for any use. For this page, 24 out of 4096 bytes are in use.

It is the same story on pages at fa550000, f8feb000, etc. So, the question is, how could this have happened and how do we avoid this in the future?

From the dump, I also found many MmCm pool allocations:

fe592000 size: f18 previous size:    0 (Allocated) MmCm
fe593000 size: f18 previous size:    0 (Allocated) MmCm
fe597000 size: f18 previous size:    0 (Allocated) MmCm
fe5ac000 size: f18 previous size:    0 (Allocated) MmCm
fe5ad000 size: f18 previous size:    0 (Allocated) MmCm
fe5ae000 size: f18 previous size:    0 (Allocated) MmCm
fe5af000 size: f18 previous size:    0 (Allocated) MmCm
fe5b0000 size: f18 previous size:    0 (Allocated) MmCm
...

This is most likely how the fragmentation happened

1) A driver requests a pool block of size 0xF18. Notice the 3 pages I displayed above have enough free space in total. The free blocks inside one page are split in two, one in the top, and the one in the bottom. Neither the top nor the bottom are big enough for the pool request of size 0xF18.

2) So the OS creates a new pool page, gives the top portion to the driver, and the bottom will be marked as freed pool.

3) Now there is a request for a small pool allocation. The OS might take the new pool page’s bottom portion to satisfy the request.

4) Now, the driver frees the MmCm pool usage. The bottom portion is still in use so the whole page could not be freed. As time goes on, it is very possible that some other portion will be re-allocated for some use.

5) Now, there is another request for a pool block of size 0xF18. The previous pool block is not good because there might be pool allocations in it. So the OS might create another new page again.

6) If the above things happen repeatedly, it has the potential to contribute to pool fragmentation as evident in this crash memory dump.

Ways to avoid this issue - Instead of requesting an allocation of size 0xf18, the driver should request an entire page. There will be some small wasted portion in the page, but that is the trade-off to avoid this type of fragmentation issue. By the way, MSDN suggests drivers should use the MmCm for long term. In a live debug, you will see the driver continually allocating and freeing MmCm.

Links to related article:

http://msdn.microsoft.com/en-us/library/ms801986.aspx

http://blogs.msdn.com/ntdebugging/archive/2006/12/18/Understanding-Pool-Consumption-and-Event-ID_3A00_--2020-or-2019.aspx

↧

Case Study - Software Restriction Policies and Large EXE Files

January 18, 2010, 4:06 am

≫ Next: Ironical Case of a Performance Monitoring Tool causing a Performance Issue!

≪ Previous: Pool Fragmentation

Recently I received a debugging request for a customer having problems running large executables. On their systems, they could run most EXEs without any problems, but they had two programs that were over 1.8 GB which, when run, would display the following error:

If they tried to run them in a command prompt, they received the message “Access is denied.” Both attempts were made with an administrator account and in neither case were the processes created. Through testing, they found that the programs worked if they were scheduled to run as System and also worked when run in safe mode as an administrator.

When the case was brought to my attention, it was noted that when the failing executables were run, the following appeared in process monitor logs:

The engineer did not see this when one of the problematic EXEs was run (successfully) on his test machine. The customer provided a VM image of their system which we set up in HyperV with a named pipe kernel debugger. I then started kernel debugging to find the cause of the INVALID PARAMETER error, hoping that resolving it would fix the issue.

To start, I looked at the call stack within process monitor for the invalid parameter:

The problem is this isn’t exactly where we return invalid parameter. Looking at the source code for Fltmgr, it doesn’t return invalid parameter – this was just where the error was caught in procmon. This call stack did provide some ideas for good starting places to debug, however. First, I looked up the hex value for STATUS_INVALID_PARAMETER in ntstatus.h – 0xC000000D. Knowing this, I decided to set a breakpoint on nt!IofCallDriver and ran the program. Once the debugger broke in, I planned to use wt -oR. This will trace through the calls displaying the return values next to each call. From there, I would just need to find 0xC000000D on the return column. Unfortunately, what I had forgotten was wt does not display return codes in kernel debugging, only when debugging user mode.

With wt not an option, I decided to use a combination of debugger commands to approximate the output of wt. I knew the return value I was looking for, and I was also confident that I would find that code in the EAX register after the problem occurred. As such, I needed to write a loop that would walk through the instructions until it found 0xC000000D in EAX. The debugger provides two main options for walking instructions: p and t. p (Step) will execute a single instruction and display the register values. If the instruction is a call, it will not enter that function, but just display the results after that subroutine has been executed. t (Trace) also executes a single instruction, but it will enter into the function and will display each instruction.

In this case I wanted trace so I could see which function was returning the invalid parameter status. Tracing though that many instructions/functions would take a long time, but there are some variations on t (and p) that can help. tc (or pc)will execute instructions until a call statement is reached, where it will break and show the register values. tt (or pt) will execute instructions until a return instruction is reached. tct (or pct) will run until either a call or return is reached. In this case, I opted for tct.

Knowing that I would use tct, I had to find a way to execute tct statements until EAX was the value I was looking for. This can be accomplished with the z (While) debugger command. The syntax is pretty easy, it’s just z(expression) and it works just like a do-while loop. Putting it all together, I used this command in the debugger:

tct; z(eax!=0xc000000d)

I then waited for the debugger to break in so I could see where this status was being thrown. Regrettably, the code called ended up going in to some recursion which made my while loop take way too long. To resolve this, I set a new breakpoint just before we entered the recursion, reran the program, used p to step past the call then ran the tct loop.

This time I was quickly brought to the code I was looking for. As soon as it broke in, I ran k to view the callstack:

kd> k

ChildEBP RetAddr
b9541a3c f7b7fab9 Ntfs!NtfsCommonDeviceControl+0x40
b9541aa0 f7b8b02f Ntfs!NtfsFsdDispatchSwitch+0xe4
b9541bbc 8081df85 Ntfs!NtfsFsdDispatchWait+0x1c
b9541bd0 f7876d28 nt!IofCallDriver+0x45
b9541bfc 8081df85 fltmgr!FltpDispatch+0x152
b9541c10 f7876d28 nt!IofCallDriver+0x45
b9541c3c 8081df85 fltmgr!FltpDispatch+0x152
b9541c50 808f5437 nt!IofCallDriver+0x45
b9541c64 808f61bf nt!IopSynchronousServiceTail+0x10b
b9541d00 808eed08 nt!IopXxxControlFile+0x5e5
b9541d34 808897bc nt!NtDeviceIoControlFile+0x2a
b9541d34 7c82860c nt!KiFastCallEntry+0xfc
0012e960 7c826fe9 ntdll!KiFastSystemCallRet
0012e964 77e416f9 ntdll!NtDeviceIoControlFile+0xc
0012e9c8 77e6738d kernel32!DeviceIoControl+0x137
0012ec44 77e67032 kernel32!GetVolumeNameForRoot+0x16d
0012ec68 77e67782 kernel32!BasepGetVolumeNameForVolumeMountPoint+0x73
0012ecd0 7d20b01d kernel32!GetVolumePathNameW+0x1c7
0012ef18 7d20ae2c ADVAPI32!CodeAuthzFullyQualifyFilename+0xbc
0012eff8 7d20b33f ADVAPI32!__CodeAuthzpIdentifyOneCodeAuthzLevel+0x19f
0012f07c 77e6df9e ADVAPI32!SaferIdentifyLevel+0x163
0012f278 77e6ce03 kernel32!BasepCheckWinSaferRestrictions+0x60c
0012fa90 77e424b0 kernel32!CreateProcessInternalW+0xc0e
0012fac8 4ad0256f kernel32!CreateProcessW+0x2c
0012fc24 4ad01a2b cmd!ExecPgm+0x221
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad0c30b cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x216
0012ffc0 77e6f23b cmd!mainCRTStartup+0x12f
0012fff0 00000000 kernel32!BaseProcessStart+0x23

If we look at the assembly around Ntfs!NtfsCommonDeviceControl+0x40, we see that only if our return from NtfsDecodeFileObject is not equal to 4 it would move 0xC000000D in to esi, and then move esi to eax, :

f7b7faf9 e8e904fdff call Ntfs!NtfsDecodeFileObject (f7b4ffe7)
f7b7fafe 83f804 cmp eax,4
f7b7fb01 0f848873ffff je Ntfs!NtfsCommonDeviceControl+0x54 (f7b76e8f)

Ntfs!NtfsCommonDeviceControl+0x40:
f7b7fafe 83f804 cmp eax,4
f7b7fb01 0f848873ffff je Ntfs!NtfsCommonDeviceControl+0x54 (f7b76e8f)
f7b7fb07 be0d0000c0 mov esi,0C000000Dh
f7b7fb0c 56 push esi
f7b7fb0d 53 push ebx
f7b7fb0e 57 push edi
f7b7fb0f e83506fdff call Ntfs!NtfsCompleteRequest (f7b50149)
f7b7fb14 8bc6 mov eax,esi

I looked at the source code for these functions, and it didn’t make sense that a failure here would cause the problems we were seeing; especially specific to large executables. Out of curiosity I ran notepad on the VM again with procmon and found that it too displayed INVALID PARAMETER, but the program started and ran correctly:

Since this wasn’t the problem, I stopped reviewing the code and decided on a new approach. We knew that when running the EXE in a command prompt we received an “Access is denied message”. At that point it made sense to switch to user mode debugging and take a look at the cmd.exe process that was trying to launch install.exe

Doing user mode debugging in a VM can be a bit of a challenge, especially if you are trying to minimize changes to the VM (and in my case, avoid putting any symbols on the customer’s VM image). Since I already had a kernel debugger attached, one option would be to run ntsd.exe (debugger provided in the Debugging Tools for Windows) on the VM with the -p switch specifying the PID of the cmd.exe process I wanted to debug and -d switch which forwards the i/o of ntsd to the kernel debugger. The problem with this approach is the kernel debugger just becomes a front end for issuing commands and seeing the output from ntsd. That means all symbol resolution is still done on the target system running ntsd.

I didn’t want to give the customer VM Internet or corporate network access, so I instead opted to run dbgsrv.exe on the VM. Running “dbgsrv -t tcp:port=9999” tells the debug server to listen on TCP port 9999 for debugger connections. Then, on the HyperV server computer I could just run windbg -premote tcp:server=(IP of VM),port=9999 -p (PID of cmd on VM) to debug it.

I suspected that we may be calling CreateProcess but it was failing, so I set a breakpoint on kernel32!CreateProcessW. Sure enough, it hit when I tried to run install.exe in the command prompt:

0:000> k
ChildEBP RetAddr
0012fac8 4ad0256f kernel32!CreateProcessW
0012fc24 4ad01a2b cmd!ExecPgm+0x221
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad0c30b cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x216
0012ffc0 77e6f23b cmd!mainCRTStartup+0x12f
0012fff0 00000000 kernel32!BaseProcessStart+0x23

This time I could use wt -oR since this was a usermode debug. Looking in ntstatus.h again, the code for STATUS_ACCESS_DENIED is 0xC0000022. Running wt can take a very long time, so I used the -l switch, which limits the number of levels deep it will display. This would be something like using tct as I did above until you were a few calls deep then using pct. Using wt -l 3 -oR gave me the following:

…
575      291       [ 1] kernel32!CreateProcessInternalW
35        0           [ 2] kernel32!BasepCheckWinSaferRestrictions
25        0           [ 3] ntdll!RtlEnterCriticalSection eax = 0
48        25         [ 2] kernel32!BasepCheckWinSaferRestrictions
1          0           [ 3] ntdll!NtOpenThreadToken
3          0           [ 3] ntdll!ZwOpenThreadToken eax = ffffffff`c000007c
57        29       [ 2] kernel32!BasepCheckWinSaferRestrictions
1          0           [ 3] ntdll!ZwOpenProcessToken
3          0           [ 3] ntdll!NtOpenProcessToken eax = 0
116      33         [ 2] kernel32!BasepCheckWinSaferRestrictions
113      0          [ 3] ADVAPI32!SaferIdentifyLevel eax = 0
130      146       [ 2] kernel32!BasepCheckWinSaferRestrictions
4           0          [ 3] ntdll!ZwClose eax = 0
132      150       [ 2] kernel32!BasepCheckWinSaferRestrictions
22        0           [ 3] ntdll!RtlLeaveCriticalSection eax = 0
138      172     [ 2] kernel32!BasepCheckWinSaferRestrictions
5           0          [ 3] kernel32!__security_check_cookie eax = ffffffff`c0000022

Now we are getting close! I set a new breakpoint for kernel32!BasepCheckWinSaferRestrictions and reran the test. This gave me the following line:

63 0 [ 3] ADVAPI32!__CodeAuthzpCheckIdentityHashRules eax = ffffffff`c0000022

One last run with a new breakpoint at ADVAPI32!__CodeAuthzpCheckIdentityHashRules and I found what I was looking for:

58 218 [ 1] ADVAPI32!__CodeAuthzpEnsureMapped eax = ffffffff`c0000022

The depth is shown in brackets. As this call was 1 deep from __CodeAuthzpCheckIdentityHashRules and I was using 3 as my maximum depth in wt, I knew this is where the STATUS_ACCESS_DENIED was coming from. I reviewed the source code and found that this is the code that performs Software Restriction Policy checking. Specifically, we are attempting to map the executable into memory to perform hash checking on it. Since there isn’t 1.8 GB of contiguous available memory, it failed. Looking at the VM, I discovered that the customer had implemented a number of software restriction policies. As a test, I removed their restrictions on the VM, and the programs ran successfully. A search of the KB revealed that a hotfix was published for this problem: 973825. Installing the hotfix in the article also resolved the issue with their policies intact.

-Matt Burrough

↧

Ironical Case of a Performance Monitoring Tool causing a Performance Issue!

January 21, 2010, 6:33 am

≫ Next: Understanding !PTE , Part 1: Let’s get physical

≪ Previous: Case Study - Software Restriction Policies and Large EXE Files

Hello folks, this is Pushkar and I recently worked an interesting case dealing with high CPU usage. The case was particularly interesting but it was not a business critical application consuming the CPU. I was tasked to identify why a third party Performance Monitoring tool was causing a performance issue on the server by consuming 100% of the CPU cycles. The irony of the issue itself made this particular case immensely interesting for me and I decided to investigate it further. Typically issues with third party products are addressed by the Application Vendor.

In this case the monitoring tool, Monitor.exe (disclaimer: Identity of the actual product has been removed), was consistently consuming 100% of the CPU cycles, and as expected if the tool was stopped, the CPU usage was back to normal. As I didn’t have information about how the tool worked, I decided to gather data to give me an idea about the underlying API's the tool was calling. This would give me an understanding of its behavior.

To begin with the troubleshooting I gathered Kernrate.exe logs from the server along with a memory dump. On platforms which are Windows Vista and upwards you can use the Windows Performance Toolkit (aka. Xperf), a better alternative to Kernrate.

Note: To learn more about kernrate.exe check here

A quick review of the Kernrate Log showed a lot of CPU time in the Kernel, and the function mostly called was NtQueryDirectoryObject()

TOTAL K 0:07:35.343 (49.2%) U 0:02:40.734 (17.4%) I 0:05:09.171 (33.4%) DPC 0:00:52.984 (5.7%) Interrupt 0:00:20.312 (2.2%)

Total Interrupts= 4208713, Total Interrupt Rate= 9098/sec.

-----------------------------

Results for Kernel Mode:

-----------------------------

OutputResults: KernelModuleCount = 106

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime 116300 hits, 65536 events per hit --------

Module Hits msec %Total Events/Sec

PROCESSR 45160 462654 38 % 6397017

NTOSKRNL 43573 462654 37 % 6172215

SYMMPI 18258 462654 15 % 2586287

----- Zoomed module NTOSKRNL.EXE (Bucket size = 16 bytes, Rounding Down) --------

Percentage in the following table is based on the Total Hits for this Zoom Module

ProfileTime 43573 hits, 65536 events per hit --------

Module Hits msec %Total Events/Sec

NtQueryDirectoryObject 8433 462654 19 % 1194553

memmove 6584 462654 14 % 932638

KeInsertQueueDpc 5593 462654 12 % 792261

KeFlushMultipleTb 3118 462654 7 % 441671

Further Investigation of a dump gathered indicated the function listed above was invoked to parse through the \BaseNamedObjects namespace in the kernel. BaseNamedObjects is a folder in the Kernel Object Namespace where various kernel objects: events, semaphores, mutexes, waitable timers, file-mapping objects, and job objects are created. The purpose of such global namespace is to enable processes on multiple client sessions to communicate with a service application. Another use of such namespace is for applications that use named objects to detect that there is already an instance of the application running in the system across all sessions.

I started looking under the \BaseNamedObjects namespace in the kernel and found there were numerous, over 2900 objects under that Namespace. You can use SysInternal tool WinObj.exe to browse through the Kernel Object Namespace and the list of objects within those namespaces. In my case I had no prior idea of the cause of the issue so I had to look at the dump file. I pasted a snapshot of the WinObj.exe ouput below along with a stack from the kernel dump.

kd> !object \BaseNamedObjects

Object: fffffa8000d72ec0 Type: (fffffadfe7acb1e0) Directory

ObjectHeader: fffffa8000d72e90 (old version)

HandleCount: 36 PointerCount: 10275 <- This indicates the Number of Object under this namespace

Directory Object: fffffa8000004060 Name: BaseNamedObjects

Hash Address Type Name

---- ------- ---- ----

00 fffffadfe5d79660 Job TestJobObj_9920

fffffadfe5d7a8e0 Job TestJobObj_9913

fffffadfe5d7b8e0 Job TestJobObj_9907

fffffadfe5d84060 Job TestJobObj_9850

fffffadfe5d863e0 Job TestJobObj_9843

fffffadfe5d873e0 Job TestJobObj_9837

fffffadfe5d8e060 Job TestJobObj_9790

fffffadfe5d903e0 Job TestJobObj_9783

fffffadfe5d913e0 Job TestJobObj_9777

fffffadfe5dad660 Job TestJobObj_9611

fffffadfe5dae660 Job TestJobObj_9605

fffffadfe5db7660 Job TestJobObj_9551

fffffadfe5db8660 Job TestJobObj_9545

fffffadfe5db98e0 Job TestJobObj_9538

In my case the third party performance monitoring tool was experiencing high CPU when they queryied “Job Objects” and “Job Object Details” performance counters.

Since this was happening with a third party tool, I tried to recreate the issue in-house, and wrote a small utility to create a lot of Named Job Objects under \BaseNamedObjects namespace using CreateJobObject() API. Then I tried to use the plain old Perfmon tool built into Windows. The moment I tried to add the “Job Objects” & “Job Object Details” counters, the CPU utilization from MMC.EXE hit 90% (Perfmon runs as an MMC). Now I could reproduce the issue in-house so I investigated what happened when we try to query those counters.

Here is what I found:

When we attempt to query the Job Object or Job Object Details performance counter it invokes the following functions respectively in Perfproc.dll which would invoke NtQueryDirectoryObject() to locate the Object under the \BaseNamedObjects Namespace. Then it calls QueryInformationJobObject to gather the performance data related to that objects. One of the thread stacks is given below to show how the flow goes through

CollectJobDetailData()

CollectJobObjectData()

kd> kb

RetAddr : Args to Child : Call Site

fffff800`01040abd : 00000000`000ad490 fffff800`01385b47 00000000`00002990 00000000`00000000 : nt!NtQueryInformationJobObject

00000000`77ef171a : 00000000`77d423cb 00000000`000ad498 00000000`00000000 00000000`c0000135 : nt!KiSystemServiceCopyEnd+0x3

00000000`77d423cb : 00000000`000ad498 00000000`00000000 00000000`c0000135 00000000`00000000 : ntdll!ZwQueryInformationJobObject+0xa

000007ff`5a794a16 : 00000000`000ad990 00000000`00000008 00000000`00000400 00000000`000addf8 : kernel32!QueryInformationJobObject+0x77

000007ff`5a7932b9 : 00000000`0000dd20 000007ff`5a79aa84 00000000`00000001 00000000`00000040 : perfproc!CollectJobObjectData+0x356

000007ff`7feeb497 : 00000000`00000000 00000000`000adee0 00000000`00000000 00000000`00000001 : perfproc!CollectSysProcessObjectData+0x1f9

000007ff`7fef09d1 : 00000000`000ae1d0 00000000`00000000 00000000`0000007a 00000000`00000000 : ADVAPI32!QueryExtensibleData+0x951

000007ff`7fef0655 : 0039002d`00350038 00000000`77c43cbd 00000000`00000000 00000000`00000000 : ADVAPI32!PerfRegQueryValue+0x66d

000007ff`7ff0b787 : 00000000`00000000 00000000`00000000 00000000`000ae8a0 00000000`000ae888 : ADVAPI32!LocalBaseRegQueryValue+0x356

000007ff`5b17ba27 : 00009ae8`b8ee62a9 00000000`000c46a0 00000000`00200000 00000000`00000000 : ADVAPI32!RegQueryValueExW+0xe9

You can determine the type of performance data queried for each Job Object in Win32_PerfFormattedData_PerfProc_JobObject Class. Although it documents a WMI interface for querying the Performance Data, it represents the same data set the native API also queries.

Once I understood what was happening under the hood, it was easy to deduce that every iteration of the query will be a CPU intensive operation. For each object we are going to do a recursive query for the Performance Data as mentioned in the above document. If the number of items is large, and we are running a broader query of the nature “All Counters” & “All Instances” then every single iteration would consume a lot of CPU. To validate this, I got help from one of my colleagues, Steve Heller, who modified the sample code received from the vendor of the Monitoring Tool (Which demonstrated what they were doing). We noticed they were querying this information every 2.5 seconds. The sample tool demonstrated that with about 10,000 Job Objects, one single iteration of the query was taking roughly 12.5 seconds to complete. No wonder the CPU usage would remain high, because before we finish a single iteration of the query for all the Job Objects, four additional queries were already queued and the CPU Usage continued to remain at 90% or more.

The conclusion that can be drawn from this test is that querying \BaseNamedObjects namespace with a large number of objects under it will invariably result in a fair amount of CPU usage. This could get worse if the query is performed for complex objects e.g., Job Objects for which we would run nested queries for individual Performance Data for each Job Object, and if the data is queried too frequently.

The two aspects responsible for the high CPU usage are:

1. The frequency at which the Performance Data was being queried

2. The number of objects under \BaseNamedObjects namespace.

Though there is no defined limitation on the number of objects that can be created under various Kernel Object Namespaces, absence of such documents doesn’t mean that we should simply create a lot of objects. Applications should be designed carefully to use such kernel resources judiciously.

At the same time because we can query a lot of performance data as the interfaces are publicly available, doesn’t mean we should query them at a very high frequency. Certain operations can be CPU intensive as we have seen here and we should identify what we really want to query and at what frequency.

Quoting for the MSDN article Job Objects

“A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object.”

The above point clearly indicates that the Job Object Framework is intended for processes which are participating in a Job to monitor and control each other’s resource usage.

There could be different reason for which a system administrator needs to gather information about the resource usage by a particular Process. One of them could be to meet Service Level Agreement. In that case we can monitor individual processes running on the server with the counters available under “Performance Object -> Process”.

Equipped with the knowledge that querying Performance Data for Job Object may result in excessive CPU usage on the server, an Administrator should evaluate or consult the application owner (If needed) to understand whether they really want to gather this performance data.

The following documents are available for application developers to assist with the creation of application specific namespaces. Creation of a private namespace and creating objects under that namespace would ensure that application specific objects are not exposed to the Performance Library because they won’t appear under \BaseNamedObjects namespace.

Object Namespaces

http://msdn.microsoft.com/en-us/library/ms684295(VS.85).aspx

CreatePrivateNamespace Function

http://msdn.microsoft.com/en-us/library/ms682419(VS.85).aspx

Job Objects

http://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx

CreateJobObjects Function

http://msdn.microsoft.com/en-us/library/ms682409(VS.85).aspx

This problem would be predominantly noticed on older systems. To mitigate it to a certain extent, Windows Vista (and all post operating systems) have a per-session BaseNamedObjects namespace (i.e., it application-specific objects will be created under namespace in the following form \Sessions\<SessionID>\BaseNamedObjects). Hence these objects will not be exposed to the Performance Libraries and will not be available to query via Performance Monitoring Tools. Unfortunately the changes which were done to the kernel to make this work are too intricate to be back-ported to Windows 2003 SP2.

This brings me to the end of this post, I hope this information gives you better insight on the how oversight in an applications design can lead to the breaking of some other components at a later stage.

↧

Understanding !PTE , Part 1: Let’s get physical

February 5, 2010, 4:05 am

≫ Next: X64 Deep Dive Presentation from the 2010 Global Engineering Conference

≪ Previous: Ironical Case of a Performance Monitoring Tool causing a Performance Issue!

Hello. It’s Ryan Mangipano again (Ryanman). Today’s blog will be the first in a multi-part post designed to help you understand the output of the !PTE debuger comand along with the basics of hardware Virtual Addressing. To better understand Virtual Addressing, we will use the debugger to manually translate a 4-KByte Page Table PAE virtual address into the actual physical addresses in order to understand what !PTE is displaying. I’ll provide relevant information about Virtual Addresses and Virtual Memory along the way.

We’ll start by translating a non-prototype valid hardware VA from an x86 PAE system

The actual process of manually decoding a virtual addresses is going to vary according to the architecture (x86, x64), size of the page, whether or not the virtual address is a large page, whether the page is marked as valid, whether it’s a hardware or software PTE, and whether PAE is enabled. For simplicity, we will not be going over the table entry (PTE/PDE) flags until part two of the blog. For my first example, I am going to demonstrate how to use the information in the processor manuals together with the debugger to decode a valid non-prototype virtual address into the physical memory that it references. You can then try this on your own using windbg and the !pte command to validate your findings.

Finding an address to translate

To start, we'll need to locate a virtual address that maps to a valid PTE. I am going to use the following highlighted virtual address which I found in a memory dump.

f9a12d0c ff155400a1f9 call dword ptr [sfilter+0x4054 (f9a10054)]

Get out your processor manuals

The AMD and Intel manuals both contain helpful reference material on this subject. PDFs of these manuals are available online. Since my CPU is an Intel, I’m going to refer to the Intel manuals.

1: kd> !cpuinfo

CP F/M/S Manufacturer MHz PRCB Signature MSR 8B Signature Features

0 15,6,4 GenuineIntel 3192 0000000400000000 a0073fff

My Intel manuals arrived on CD via snail mail a few days after placing my free order:

http://www.intel.com/products/processor/manuals/order.htm

Is PAE in use?

On this Intel x86 system, we first need to determine if we are using PAE. In the Intel “System Programming section 3.6.1 Paging Options” I found that the PAE (Physical Address Extension) flag can be found in bit 5 of the CR4 register. The PG (Paging) flag in CR0 which enables paging must of course also be set. Register bit numbering starts at zero so it’s the fifth bit from the right. Let’s examine cr4 and convert the value it contains into binary:

1: kd> r cr4

cr4=000006f9

1: kd> .formats 000006f9

Binary: 00000000 00000000 00000110 11111001

You can use the debugger to check for other flags in the same manner

You can use the above method to check for other flags that you find documented in the processor manuals. For example, you can see that bit 4 is also set in the cr4 register output above. This is the Page Size Extensions (PSE) bit which enables large page sizes.

Terminology

Paging is simply a method of dividing up the linear address space into chunks. Pages are simply the name that we give to the chunks that result. The size of these sections is referred to as the Page Size. On x86 systems, the standard page size is 4-KBytes. A Large Page means that the page is larger than the standard size (2MB on PAE x86 or 4MB on non-PAE x86).

Keep the last three

On a standard 4-KByte page size virtual address, the address f9a10054 can be thought of as being split up as follows:

Information needed to locate the base of the page in physical memory f9a10

Offset into that physical page once it is found 054

This means that the last three hexadecimal digits of our physical address will also be 0x054. Once we find the base of our physical page(which will always end in three zeros), we can simply add it to the offset 0x54 and we will have our physical address.

Are there any commands that work with physical addresses?

We will also need a way to work with these physical addresses. Most of us are familiar with using the dd command with virtual addresses. There is also a command that accepts a physical address instead of a virtual address: !dd (notice the ! before the command). There are also variants of the !dd command, such as !db , !du , !dq.

What data do we need to obtain to perform this conversion?

We need to determine how to use the f9a10 portion of the virtual address to find the physical page base. The Memory Management System sets up and maintains a hierarchy of tables that keep track of the mappings between virtual addresses and physical addresses. You will need the following information to traverse the tables yourself and convert this address.

1. Starting point. This will be in the form of a pointer to the base of the first table that you need to check

2. The number of table levels in use on the system and the size of each entry in the tables on our system

3. The offsets into the different tables and the bases of each table.

Once you have all the above information, you can use the debugger to traverse these tables.

Let’s get our plan together for how we are going to obtain these three pieces of information

1. Starting point. This can be obtained from the PDBR. The CR3 register is known as the Page Directory Base Register (PDBR) which points to the physical address of the base of the first table.

2. The number of table levels in use on the system and the size of each entry in the tables on our system. This information can also be obtained from the Intel processor manuals. As previously mentioned, the number of tables, the size of the entries in the tables, the flags, where the bits are split up, and the names of the tables vary according to platform and if you are using PAE or not. All of these, however, use the same basic concepts:

Register pointing to a àTable with entries pointing to a table à with more entries pointing to à another tableà pointing to the physical address of the page in memory.

Table1: By reading the Intel Processor Manual “System Programming: Section 3.8, Physical Address Extension”, I was able to determine that the first table used for x86 PAE virtual addressing is called the Page-Directory-Pointer Table. It’s a table with only 4 Entries that are 64 bits (which is 8 bytes) wide. Each entry is referred to as a Page Directory Pointer Entry and abbreviated as PPE, PDP or PDPE depending on the source. These entries provide the index into the page directory, which is known as the Page Directory Index (PDI). One of these four pointers will lead you to the physical address of the base of the next table that you need to visit in the x86 PAE hierarchy. Just like the pointer that we used from the CR3 register, some of the bits in these table entries are not used as part of the index (referred to as a pointer in some documentation). We will grab the relevant bits and add the appropriate number of zeros to the index to obtain our physical address pointer. I will cover what these other bits are used for in part two of this blog. We must substitute zeros for these bits.

Table2: The table at the second level of tables in the x86 PAE hierarchy (which is referenced by the pointers in Table Level One) is called the Page Directory Table. Don’t confuse this with the Page-Directory-Pointer Table. Each Page Directory table can hold 512 entries which are 64 bits in size. The entries in this table are called Page Directory Entries (PDE) and they provide Page Table Index (PTI). Just like the last table, these entries contain indexes (which we convert to a pointer by simply adding zeros) to the base of the next table in the hierarchy.

Table3: The last table is referred to as the Page Table. Each Entry in the page table is called a Page Table Entry (PTE) and provides the Page Offset. Just like the last table, not all the bits are used for pointers. Each page table contains up to 512 entries which are also 64 bits in size. Each 64 bits entry in the table contains a pointer to the base of the page in physical memory.

In summary there are 3 levels of tables when using x86 PAE. These are the Page Directory Pointer Table, Page Directory Tables, and the Page Tables.

1. The offsets into the different tables and the bases of each table.

Each table above is a listing of indexes that will be used to locate the base of the next table. However, once we arrive at each table, we will need to know the index or offset into the table in order to know which table to get to next. These offsets into the tables can be obtained from the virtual address itself. Let’s review our virtual address again. However, this time we will break the address down in binary:

Virtual Address: f9a10054

1: kd> .formats 0xf9a10054

Binary: 11111001 10100001 00000000 01010100

Page Directory Pointer Index(PDPI) 11 Index into 1^st table(Page Directory Pointer Table)

Page Directory Index(PDI) 111001 101 Index into 2^nd table(Page Directory Table)

Page Table Index(PTI) 00001 0000 Index into 3^rd table(Page Table)

Byte Index 0000 01010100 0x054, the offset into the physical memory page

So as you can see, the virtual address is nothing more than a bunch of indexes/offsets.

Putting all this data together to find the physical address

Now that we have all the required data, let’s proceed to locate our physical address

1. Obtain our base pointer to the first table. As we discussed earlier, we can obtain this value from the cr3 register. Bits 0-4 of this register are not used for the pointer to the table base. This means that in order to get the base pointer, we will need to replace these 5 least-significant bits with zero. This will result in the table base being located on a physical address that is always aligned to a 32-bit boundary.

Keep in mind that CR3 will have a different value here for each process. You must make sure that you are in the appropriate process context before proceeding. This is because user mode tables are specific to a particular process. Notice that I said processes, not threads. CR3 will not be changed when swapping threads, since each thread in a given process shares the same address space. Tables relating the system address space (kernel mode) are shared between all processes.

We’ll need to dump out the CR3 register (PDBR) in a format where we can view the last 5 bits. As you can see in the .formats output below, the 5 least significant bits are already set to zero. This means that the hexadecimal value located in the cr3 register is a pointer to the base of the first table. So we now have our starting point. Physical address 023406e0 is the base of the first table.

1: kd> .formats @cr3

Hex: 023406e0

Binary: 00000010 00110100 00000110 11100000

The proper way to display the pointer out the value would be to & against the following mask

1: kd> ? 0y11111111111111111111111111100000

Evaluate expression: -32 = ffffffe0

1: kd> .formats (@cr3 & ffffffe0)

Binary: 00000010 00110100 00000110 11100000

You can use the!process command to get and/or verify this value.

1: kd> !process

PROCESS ff981a58 SessionId: 0 Cid: 0d54 Peb: 7ffde000 ParentCid: 0550

DirBase: 023406e0 ObjectTable: e1541510 HandleCount: 30.

You can also obtain this information from the EPROCESS structure.

1: kd> dt nt!_EPROCESS ff981a58 Pcb.DirectoryTableBase

+0x000 Pcb :

+0x018 DirectoryTableBase : [2] 0x23406e0

2. Obtain our base to the second table by finding our index into the first table. The first table (Page Directory Pointer Table) only has 4 entries and each entry is 64 bits wide. We know from the first two bits of the Virtual Address above that our offset into this table is 0y11 (The y tells the debugger the value is binary, instead of eleven). Eleven would be represented in the debugger as 0n11. We can simply multiply the offset (0y11) by the size of each entry and add the result to the base of the table to get our entry. The entries in the table are 8 bytes wide. We shall use sizeof() as shown below to obtain this value. We can pass this math to the !dq command to dump the data at these physical addresses.

1: kd> !dq (@cr3 & 0xffffffe0)+(0y11*@@(sizeof(nt!_MMPTE))) L1

# 23406f8 00000000`05503801

The processor manuals indicate that the first 12 bits of the entry above are not part of the pointer and must be discarded. This will cause alignment on 4KB boundaries. Since each Hex digit above represents 4 bits, this means that we need to change 801 to 000. That gives us our physical address of the base of our second table, 05503000. We will accomplish this in the next step by ANDing our PDE against a mask.

3. Obtain our base to the third table by finding our index into the second table. The Second Table (Page Directory Table) works the same way, except that this table contains up to 512 entries (on PAE systems). Keep in mind that there can be more than one Page Directory Tables for each process on a PAE system, however we are only concerned with the one that contains the data relating to our virtual address. We know from the virtual address that the offset into this table is 0y111001101. Calculate the address in the same manner as before. We will also need to set the last 12 bits to zero, just like before.

1: kd> !dq (0x05503801 & 0xFFFFFF000) +( 0y111001101*@@(sizeof(nt! _MMPTE))) L1

# 5503e68 00000000`0102d963

The last three hexadecimal digits must be changed to zero, since they are not part of the pointer. This gives us the base address of our last table, the page table, 0x0102d000

4. Find the base of the physical page our memory resides in by finding our index into the third table. Using the base of the page table from the previous step, let’s add the index into this table that we obtained from the virtual address, 0y000010000. The last three digits need to be set to zero.

1: kd> !dq (0x102d963&0xFFFFFF000)+(0y000010000*@@(sizeof(nt! _MMPTE))) L1

# 102d080 00000000`02010121

5. So now we have the physical address of the base of the page, 2010000 Add the base of the page to our offset from the virtual address, 0x054.

1: kd> ? (2010121 & 0xFFFFFF000) +0x054

Evaluate expression: 33620052 = 02010054

A shortcut is to simply change the 0x121 to 0x054 in the previous step.

6. Now, let’s dump out the data at our physical address and the virtual address

1: kd> !dd (2010121 & 0xFFFFFF000) +0x054 L2

# 2010054 804ef09c 804ef12c

1: kd> dd 0xf9a10054 L2

f9a10054 804ef09c 804ef12c

You can see that the data displayed by the two commands is the same.

7. Now let’s dump out the virtual address using the !PTE extension. Notice the values that it provides you with. Look above and compare the values displayed above to what is displayed below. You should now understand what the highlighted and bolded fields mean.

1: kd> !pte 0xf9a10054

VA f9a10054

PDE at 00000000C0603E68 PTE at 00000000C07CD080

contains 000000000102D963 contains 0000000002010121

pfn 102d -G-DA--KWEV pfn 2010 -G--A—KREV

The virtual addresses in italics represent the virtual address of the PDE (Page Directory Entry) and the PTE (Page Table Entry). Also, please note that PFN represents Page Frame Number. PFN is the term used to describe what I referred to as “pointers to the base of the next table” in the hierarchy. This is because it really isn’t a pointer; it’s an index into the table.

Hopefully, the output of !PTE makes a lot of sense to you now. In part two of this blog, I’ll discuss what the PDE/PTE flags (-G-DA—KWEV) represent and provide an example of manual conversion of x86 PAE Large Page Virtual Addresses to Physical.

↧

X64 Deep Dive Presentation from the 2010 Global Engineering Conference

February 19, 2010, 7:06 am

≫ Next: We Need Your XPERF Feedback

≪ Previous: Understanding !PTE , Part 1: Let’s get physical

We just hosted our annual Global Engineer Conference for members of the Escalation Team at Microsoft. It’s always great to meet up with our peers from other regions and see folks like Mark Russinovich. During the conference we had the opportunity to listen to our friend and independent consultant from CodeMachine.com, T.Roy. This year he presented a deck on x64 debugging which I think you’ll find useful. Enjoy!

x64 Deep Dive, presented at Microsoft GEC 2010

http://www.codemachine.com/presentations/GES2010.TRoy.Slides.pdf

In case you missed his presentations from past Microsoft sessions over the years here is a list of links.

Writing Debugger Extension, presented at Microsoft GEC 2007

http://www.codemachine.com/presentations/GES2007.TRoy.Slides.pdf

Supporting Support, presented at Microsoft GEC 2009

http://www.codemachine.com/presentations/GES2009.TRoy.Slides.pdf

Cheers,

Ronsto

↧

We Need Your XPERF Feedback

March 22, 2010, 5:32 am

≫ Next: Video: Using XPERF to root cause CPU consumption

≪ Previous: X64 Deep Dive Presentation from the 2010 Global Engineering Conference

Hi,

This is Tate asking for your direct and valuable feedback on XPerf (The Windows Performance Toolkit). We are knee deep in working with the creators of the tool (The Windows Fundamentals Team) to provide end user feedback which will drive features for the next version. Now that you as our readers have seen several blog posts on the utility and hopefully have used it for a few of your own debugs, what would you like to see in terms of new features, fixes, data collection steps, etc?

Particular areas of interest to the team are:

- Data Collection – easy or difficult, what would you change?

- Usage Scenarios – stack walking is powerful, what would you like the tool to expose that it does not?

- Any blocking factors keeping you from using the toolkit, what are they, how could we remedy that for you?

- Data Analysis – what would make your life as a debugger using the toolkit better?

Please send us your feedback below and thanks in advance.

-Tate

↧

Video: Using XPERF to root cause CPU consumption

March 22, 2010, 5:33 am

≫ Next: Critical Device Database TIP

≪ Previous: We Need Your XPERF Feedback

Hi Debuggers,

Here is a short video showing how to root cause CPU consumption issues on Vista/Server 2008 and higher operating systems with Xperf.

For this scenario, Xperf is an easy to implement and powerful replacement for previous actions like gathering a series of dumps during the CPU, kernrate data, perfmon, other profilers, etc.

Xperf allows one to quickly turn on profiling and see where the CPU time is being spent by using tracing already in the box!

-Tate

Note - Feel free to post questions here, or jump into a discussion about the video on our Twitter page ( @ntdebugging ).

Share this post :

↧

Critical Device Database TIP

March 25, 2010, 2:23 am

≫ Next: How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

≪ Previous: Video: Using XPERF to root cause CPU consumption

On a fairly regular basis, Bob Golding, our resident GES storage guru, sends out debugging tips to our group. We thought our blog readers would find value in these tips so we’re posting one here. Let us know what you think of the nugget.

Thanks,

Ron

Introduction

Hi everyone, Bob here. Today I thought we’d have a little discussion about the Critical Device Database (CDDB) in the registry and an issue that can be caused when a device is not contained there. This database stores configuration data for new devices that must be installed and started before the Windows components that normally install devices have been started. The idea behind the critical device database is to allow the installation of devices without using user mode plug-and-play manager. If a device is determined to be new, this database is queried to see if the information needed to install it are present.

What was the issue?

A customer was getting a Stop 0x7B (Inaccessible_Boot_Device) after they installed a BIOS update. Further investigation via the debugger using the !devnode command showed the following issue with a few devices:

DevNode 0x8631f008 for PDO 0x8631f8e0

InstancePath is "PCI\VEN_15B3&DEV_5A44&SUBSYS_5A4415B3&REV_A1\6&38f4f1f2&0&00080310"

State = DeviceNodeInitialized (0x302)

Previous State = DeviceNodeUninitialized (0x301)

Problem = CM_PROB_NOT_CONFIGURED

The above device is a bridge, and according to the definition of CM_PROB_NOT_CONFIGURED, it does not have a ConfigFlags registry entry. I saw that this same problem occurred with a few bridges. If the bridge cannot be enumerated, devices on the bridge will not be discovered either. The Instance ID is used to look up the particulars such as driver name and class in the HKLM\System\CurrentControlSet\ENUM key in the registry. What happened here is the lookup failed and the system thought it was a new device. Since based on the device class this device was needed for boot, a Stop 0x7B occurred.

What caused the issue?

When the BIOS was updated the Instance ID included the version number of the bridge. The update increased the version number, so the Instance ID was changed.

DevNode 0x8635ca40 for PDO 0x8634c4a8

InstancePath is "PCI\VEN_8086&DEV_1A38&SUBSYS_02DD1014&REV_B1\3&11583659&0&40"

State = DeviceNodeInitialized (0x302)

Previous State = DeviceNodeUninitialized (0x301)

Problem = CM_PROB_NOT_CONFIGURED

Looking at the registry data we could see:

PCI\VEN_8086&DEV_1A38&SUBSYS_02DD1014&REV_93

So a revision change caused the issue.

What stops this from happening?

Certain devices in the system are required for boot. The device class determines if the device is in the boot device family. If so, the hardware ID is written to the CDDB in the registry, so that if the device is determined to be new, it can be found there during boot.

Below is an example of a bridge entry. The contents of the pci#ven_8086&dev_244e contain the driver and the class. This is enough to get the device started for boot. The user mode PNP manager will complete the installation.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CriticalDeviceDatabase\pci#ven_8086&dev_244e

How come this bridge was not there?

When the INF was run, the device class was set to “unknown” class, so the OS did not know to write the information in the CDDB.

What was done to correct the problem?

The BIOS update was temporarily reverted, and then the correct install package was found with the correct INF that has the bridges defined as a system device. The device was re-installed (pre-update) so the device could be written properly in the CDDB, then the BIOS update was reapplied without a problem.

↧

How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

April 2, 2010, 8:37 am

≫ Next: Upcoming Blog ARTICLES

≪ Previous: Critical Device Database TIP

Prior to Windows Vista and Windows Server 2008 we had to keep a large page file on the system drive (typically drive C:) in order to properly capture system memory dumps for troubleshooting. This presented problems as systems with very large amounts of RAM became more common, resulting in requirements for very large amounts of free space on the C: drive, or requiring that system visible memory be artificially limited for troubleshooting purposes. This is no longer a requirement thanks to the Dedicated Dump File feature, which is available for use in Windows Vista and later operating systems.

What is it?

The dedicated dump file is basically a page file that is reserved for use only by the system crash dump routines. It is not used for paging virtual memory. Like a page file, the system process keeps an open handle to the dedicated dump file, which prevents it from being deleted. When you manually initiate a memory dump, or the system crashes on its own, the data is written into the dedicated dump file instead of the page file on the system drive. This feature is enabled by setting the following registry value:

Location: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

Name: DedicatedDumpFile

Type: REG_SZ

Value: A dedicated dump file together with a full path, such as D:\dedicateddumpfile.sys

Where can it be stored?

The dedicated dump file can be stored on any local volume that can support a page file. This volume must be considered a fixed disk device and must not be hosting Windows clustered storage. For example, this means you can store this file on a second partition of your server’s primary hard disk, or on a SAN volume that is mounted early in the boot process.

Fun fact: Unlike a page file, you can actually direct the dedicated dump file to a sub-folder, even one that is actually a volume mount point. However, the sub-folder must pre-exist during boot, and there is generally little reason to actually do this.

How much space will it use?

You can automatically allow the system to determine the size of the dedicated dump file, or you can manually set the size by using the DumpFileSize registry value.

System determined size:

If you choose to let the system determine the size, then regardless of the dump file type set via the CrashDumpEnabled value, the dedicated dump file will be roughly the size of system visible RAM plus 256 MB. The extra 256 MB allows room for secondary dump data that may be written by driver bug check callback routines. In order to write this data, the drivers would have to first call KeRegisterBugCheckReasonCallback() to register the routines that would execute when the system bug checks. This is currently not very common due to the limitations imposed by the high IRQL at which bug check callbacks run. Allowing the system to determine the size is a good idea, especially when disk space is not of huge concern. Also, the dedicated dump file size will be adjusted automatically at boot if additional RAM is installed or removed later. This includes artificial memory reduction implemented via the BCDEdit option, truncatememory. This eliminates the need to manually resize your page file to support proper collection of a memory dump.

Manual size specified via the registry:

The following registry value will allow you to manually set the dedicated dump file size in megabytes:

Location: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

Name: DumpFileSize

Type: REG_DWORD

Value: The dump file size in megabytes

One good reason one might have to set this value manually is in the case of a very large memory system from which only a kernel memory dump is desired. In this case, a dedicated dump file the size of RAM+256 MB is not required. Generally, even on 64-bit systems, kernel memory dumps are no larger than a few gigabytes, although they have the potential to be much larger. In nearly all cases reserving one third the size of RAM will be more than sufficient to capture kernel memory. If you have plenty of room on the drive and want to play it safe, you can simply allow the system to determine the size for you by setting the DumpFileSize value to zero, or by not creating the value at all.

Although you may set the DumpFileSize to a value larger than needed, or the system chooses a size larger than the type of dump selected, the resultant file size after a bug check will only be as large as the data that is stored in it. This is most commonly the size of physical RAM + room for the dump header information. If you choose to set the DumpFileSize value manually, then you should also set the IgnorePagefileSize value under the CrashControl key to 1. This will help ensure a dump file is created.

What about the final dump file?

After a bug check, the data will be written to the dedicated dump file on disk. However, the final memory dump location will be based on the following registry value:

Location: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

Name: DumpFile

Type: REG_EXPAND_SZ

Value: The target dump file name together with the full path, such as D:\memory.dmp

The system will extract the dump data from the dedicated dump file and save the resultant file to the location and file name specified for the DumpFile value. The default for this value is %SYSTEMROOT%\MEMORY.DMP. Since we are using the dedicated dump file to overcome space limitations of the system drive, we will likely want to also set the DumpFile value to point to an alternate location than the default.

If you set this value to point to the same volume as the dedicated dump file, it will save WERFAULT.EXE the step of performing an additional copy of the temporary dump file after reboot. Instead the file will simply be renamed on the target volume. If there is not enough room for both the dedicated dump file and the memory dump after the bug check, then the dedicated dump file will not be recreated, and only the memory dump will remain. If this happens and the system bug checks again, only the original dump will remain because there will be no dedicated dump to write the second bug check to until space is freed up on the drive and the system is rebooted.

DedicatedDumpFile Usage Example:

The following example will set the dedicated dump file and resultant dump to be stored on the root if drive D:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl]

"DumpFile"=hex(2):44,00,3a,00,5c,00,4d,00,45,00,4d,00,4f,00,52,00,59,00,2e,00,\

44,00,4d,00,50,00,00,00

"DedicatedDumpFile"="D:\\DedicatedDump.sys"

(Note: The lines above can be saved as a .REG file and imported into the registry)

The screenshot below demonstrates a system configured to capture a complete 4 GB memory dump to a manually sized dedicated dump file, and will store the final MEMORY.DMP on the D: drive.

Important information:

949052 - Kernel memory dump files may not be generated on Windows Server 2008-based and Windows Vista SP1 or later based computers when physical memory is larger than the size of the page file

957517 - A dedicated complete memory dump file may not be successfully generated if the volume that stores the dedicated dump file has insufficient free space

950858 - Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory

↧

Upcoming Blog ARTICLES

April 2, 2010, 4:05 pm

≫ Next: Understanding !PTE, Part2: Flags and Large Pages

≪ Previous: How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

Working on the Global Escalation Services Team at Microsoft is really a cool gig. We’re privileged to work on several different Windows components at a very deep level so life is never boring. Here’s a list of the articles coming from the group in the next few weeks. Hope you enjoy!

Ron Stock

STORPORT Logging - Bob Golding discusses Storport logging. Beginning with the new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit. These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues.

Walking through a crash dump analysis – Chad Beeder walks through a debug.

Understanding !PTE Part 2 and Part 3 – Continuation of Ryan’s 3 part series

How DPM implements VSS – Dennis Middleton gives a high level view of how DPM implements the VSS API’s to manage replicas.

How Hyper-V does backups of running VM’s - Another article from Dennis

Pushlocks – Mark Lloyd discusses why it's important to use only published APIs.

Debugger Extension Blog – Ryan’s work with debugger extensions.

↧

Understanding !PTE, Part2: Flags and Large Pages

April 14, 2010, 7:27 am

≫ Next: Archive of the Debug Ninja’s Twitter debug tips

≪ Previous: Upcoming Blog ARTICLES

Hello, it's Ryan Mangipano with part two of my PTE series. Today I'll discuss PDE/PTE flags, the TLB, and show you a manual conversion of x86 PAE Large Page Virtual Addresses to Physical. If you haven’t read the first part of this series please find it here. It's a good primer before proceeding.

PDE and PTE flags

I'll start with a discussion about the PDE/PTE flags. If you recall from part one not all of the bits of the Page Directory Entry (PDE) are related to the index (used to form the pointer to the base of the next level). This is true of the table entries in all the levels. For example, on a PAE x86 systems only 9 bits of the PTE (page table entry) are used for the index. During our previous conversion, we only used some of the bits for the index into the next table. The rest of the data, we simply dropped off and replaced with zeros as needed. So what are the other bits used for? They are used for a series of flags. You will observe the state of these flags output by !PTE in the following manner: (-G-DA—KWEV).

These flags are documented in the Intel Manuals. Intel and AMD reserved some of the flags for use by the Operating System. All of these are also documented in chapter 9 (Memory Management) of “Windows Internals, 5th edition”. Let’s dump the PDE from the virtual address we dissected last time. This will allow you to see some of the flags that are present in the other bits

Obtaining the Virtual Address of the PDE

1: kd> !pte 0xf9a10054

VA f9a10054

PDE at 00000000C0603E68 PTE at 00000000C07CD080

contains 000000000102D963 contains 0000000002010121

pfn 102d -G-DA--KWEV pfn 2010 -G--A—KREV

Here is the data-type of our PDE

1: kd> dt nt!_MMPTE u.Hard

+0x000 u :

+0x000 Hard : _MMPTE_HARDWARE

Dumping the PDE and flags

1: kd> dt _MMPTE_HARDWARE 00000000C0603E68

nt!_MMPTE_HARDWARE

   +0x000 Valid            : 0y1

   +0x000 Writable         : 0y1

   +0x000 Owner            : 0y0

   +0x000 WriteThrough     : 0y0

   +0x000 CacheDisable     : 0y0

   +0x000 Accessed         : 0y1

   +0x000 Dirty            : 0y1

   +0x000 LargePage        : 0y0

   +0x000 Global           : 0y1

   +0x000 CopyOnWrite      : 0y0

   +0x000 Prototype        : 0y0

   +0x000 Write            : 0y1

   +0x000 PageFrameNumber : 0y00000000000001000000101101 (0x102d)

   +0x000 reserved1        : 0y00000000000000000000000000 (0)

Take note of the Letters in the PDE and PTE section of the !pte output, such as -G-DA--KWEV . These letters represent various flags. The presence or absence of the letter in the !PTE output tells you the state of the flag. These flags can also be seen in the hardware pte output above.

Valid                   (V)         - Indicates that the data is located in physical memory. If this flag is not set, then the software can use ALL of the rest of the bits for whatever it wants(like storing the pagefile number and offset where the page is stored.
Write                 (W/R)     - Indicates if the data is writeable or read-only. Multiprocessor or Vista or later. Hardware bit is documented in the processor manuals. Reserved Bit 11’s use is documented in Windows Internals, Chap. 9.
Owner               (K/U)      - Indicates if the page is owned kernel mode or usermode. Kernel if cleared. User if set.
WriteThrough (T)           - When set indicates Writethrough caching policy. When not set indicates write-back caching policy
CacheDisable   (N)           - If set, the page translation table or physical page it points to cannot be cached.
Accessed           (A)           - Set when the page itself, or the table referencing it has been read from or written to
Dirty                   (D)          - Indicates if any data on this page has been updated
LargePage         (L)            - This field is only used on PDEs, not PTEs. It indicates whether or not the PDE is the last table level (meaning that this entry references an actual page in memory) or if it is instead referencing a Page Table. If this bit is set in the PDE, this PDE points directly to a 2-MB page when PAE is in use. If PAE is not being used, the large page size that we are referencing is 4-MB. So basically, this is the page size bit. If this bit is cleared, the final destination page is 4k and can be found in the page table that this PDE points to. If this bit is set, then the final destination page is equal to the size of a large page on your system (2MB when PAE is in use) and can be located using the index value of this particular PDE since it becomes the last level. Keep in mind that a larger offset will be needed to reference all the positions in this large page since it is larger. To use this feature, the PSE bit (bit 4 which is the 5th bit over) must be set in CR4. The setting in CR4 is a global setting, enable the use of large pages on the system. The flag in the PDE only applies to the individual PDE.
Global                 (G)        - If not set Translation Caching flushes affect this bit. If set, other processes use this translation also, so don’t flush it from the Translation Lookaside Buffer cache upon process context switches.
CopyOnWrite    (C)       - Intel states this is a software field. Windows uses this for processes to share the same copy of a page. The system will give the process a private copy of this page if there is any attempt to write to the page by the process (by copying it). Any attempt to execute code in this page occurs on a No execute system will cause an access violation.
Prototype           (P)        - Intel states this is a software field. Windows uses this to indicate that this is a prototype PTE.
Reserved0                       - These Bits are Reserved
E                            (E)       - Executable page. E is always displayed on platformst that Do not support hardware No-Execute.

Inspecting the state of the flags is important when attempting to manually convert addresses from Virtual Addresses to Physical. For example, since the valid bit is not set in the following invalid PTE, all of the fields are available for Windows to use. This means the information in the processor manuals doesn’t apply. Instead it is an nt!_MMPTE_SOFTWARE which references data located in the page file.

3: kd> !pte b8ae900c

               VA b8ae900c

PDE at 00000000C0602E28    PTE at 00000000C05C5748

contains 000000000B880863 contains 000B8AF500000000

pfn b880       ---DA--KWEV    not valid

                       PageFile: 0

                       Offset: b8af5

                       Protect: 0

For more information on the different types of invalid PTEs, refer to page 775 of “Windows Internals, 5th edition”.

Manually Converting x86 PAE Large Page Virtual Address to Physical

In part one of this blog, we manually translated a PAE 4-KByte Page Virtual Address (VA). Now we are going to manually translate a VA that represents a Large Page from our PAE system. As discussed in the previous section on PTE flags, a large page allocation means that the page size is larger and the PDE points directly to the page itself. The PDE will not point to the base of a page table. This means that there will be one less level of tables used in the translation. This also means that more bits will be needed to represent the offsets in the large page. I found the following address on my system that references a Large Page, 8054099e. Once again, all the required information was obtained from the processor manuals, debugger help file, and Windows Internals Book.

1: kd> !pte 8054099e

               VA 8054099e

PDE at 00000000C0602010    PTE at 00000000C0402A00

contains 00000000004009E3 contains 0000000000000000

pfn 400        -GLDA--KWEV    LARGE PAGE pfn 540

Below is the Virtual Address in binary.

1: kd> .formats 8054099e

Binary: 10000000 01010100 00001001 10011110

I have split this VA into it's three parts.

10                     Page Directory Pointer Table Offset

000000 010      Page Directory Table Offset

10100 00001001 10011110     This is the Offset into the large page

Let’s get the base of the Page Directory Pointer Table and indentify which of the four entries we will need to follow.

1: kd> !dq (@cr3 & 0xffffffe0) + ( 0y10 * 8) L1

# 23406f0 00000000`06c46801

Now take our address from above, add our zeros and we have the base of Page Directory Table. Then add the offset from our Virtual Address and we'll dump out the PDE.

1: kd> !dq (6c46801 & 0xFFFFFF000) + ( 0y000000010 * 8) L1

# 6c46010 00000000`004009e3

Let’s convert the PDE to binary format to analyze the lower 12 bits. This will allow us to analyze the flags. The last Twelve bits (0-11) are used for the PFN. They are used for the flags that we discussed earlier.

1: kd> .formats (00000000`004009e3 & 0x0000000FFF)

Binary: 00000000 00000000 00001001 11100011

Let’s analyze the flags from this VA using the information we learned earlier....

· Bit Zero is set indicating that the page is Valid, located in physical memory, and all other bits

· Bit One is set indicating that this page is Writeable (Hardware Field)

· Bit Two is cleared indicating that this is a Kernel Mode Page

· Bit Three is cleared indicating a Write-Back Caching policy (caching of writes to the page is enabled)

· Bit Four is cleared indicating that caching is not disabled for the page.

· Bit Five is set indicating this page has been Accessed

· Bit Six is set indicating that this page is Dirty

· Bit Seven is set indicating that this is a Large Page. This PDE points directly to a page, not a Page Table.

· Bit Eight is set indicating other process share this Global PDE. No Delete upon TLB Cache Flush for process context switches.

· Bit Nine is cleared indicating this page is not Copy-On-Write

· Bit Ten is cleared indicating this is NOT a Prototype PTE

· Bit Eleven is set also indicating this page is Writeable (Reserved Field, See Windows Internals, Chap. 9.)

...and compare our findings to the Flags output from !PTE, -GLDA—KWEV. My system doesn’t support No-Execute, so the E is also displayed. For more information, .hh !PTE in windbg.

We know this is a Large Page and is Valid, so we can obtain the directory of our 2-MB Large Page (on this PAE system) from this PDE. The Intel Manual states that in our PDE the last 21 bits aren’t part of the address base.

1: kd> .formats (004009e3 & 0y11111111111000000000000000000000)

Binary: 00000000 01000000 00000000 00000000

So let’s combine the data from the PDE (Highlighted) with the offset from the VA (Virtual Address).

00000000 010 10100 00001001 10011110

Now I'll remove the spaces, precede this binary value with 0y, and send it to .formats.

1: kd> .formats 0y00000000010101000000100110011110

Hex: 0054099e

We could have obtained the same data in this manner

1: kd> ? (004009e3 & 0y11111111111000000000000000000000) + (8054099e & 0y00000000000111111111111111111111)

Evaluate expression: 5507486 = 0054099e

Now let’s dump the data in memory at this physical address

1: kd> !db 0054099e

# 54099e 33 db 8b 75 18 8b 7d 1c-0f 23 fb 0f 23 c6 8b 5d 3..u..}..#..#..]

# 5409ae 20 0f 23 cf 0f 23 d3 8b-75 24 8b 7d 28 8b 5d 2c .#..#..u$.}(.],

# 5409be 0f 23 de 0f 23 f7 0f 23-fb e9 43 ff ff ff 8b 44 .#..#..#..C....D

Now let’s dump the same data using the virtual address

1: kd> db 8054099e

8054099e 33 db 8b 75 18 8b 7d 1c-0f 23 fb 0f 23 c6 8b 5d 3..u..}..#..#..]

805409ae 20 0f 23 cf 0f 23 d3 8b-75 24 8b 7d 28 8b 5d 2c .#..#..u$.}(.],

805409be 0f 23 de 0f 23 f7 0f 23-fb e9 43 ff ff ff 8b 44 .#..#..#..C....D

So now you can see how I used the debugger to translate virtual addresses to physical adrresess. This concludes part two of this blog and in part three we will cover translation of x86 Non-PAE Virtual Address Translation, x64 Address Translation, and the TLB.

↧

Archive of the Debug Ninja’s Twitter debug tips

April 15, 2010, 5:33 am

≫ Next: A Case of Mysterious Memory Corruption

≪ Previous: Understanding !PTE, Part2: Flags and Large Pages

Every Wednesday I post a debug tip to our twitter page at www.twitter.com/ntdebugging. This blog is an archive of these tips to allow our readers to find this information easily. We will update this blog every few weeks with the new tips; follow us on twitter if you want to see the tips as I post them.

The goal of these tips is to share debug commands, and forms of commands (parameters, flags, etc) that my colleagues and I find useful. I hope you can add these commands to your toolkit and they will help you debug more efficiently.

Tips

!thread/!process [address] e - on x64 will not show you the meaningless Args to Child information.

.frame /c [FrameNumber] - sets context to specified stack frame. Provides more reliable information than .trap on x64.

kn - Dumps call stack with frame numbers, easier than counting stacks for .frame.

.frame /r [FrameNumber] - same as .frame /c, but shows registers without changing context.

Note: With .frame /c or /r you can only trust the nonvolatile registers. See http://msdn.microsoft.com/en-us/library/9z1stfyw(VS.80).aspx for vol/nonvol regs.

k=rbp rip FrameCount - Dumps call stack starting at rbp/rip on x64. Useful when the stack is corrupt. #debug ^DN

.process/.thread /p /r [address] - sets new process context, sets .cache forcedecodeuser, and reloads user symbols. #debug ^DebugNinja

!process [address] 17 - Sets the context for this command, avoids the need for .process to see user stacks. Try !process 0 17 #debug ^DN

~~[ThreadID]s - Changes threads in user mode. Use Thread ID number from output such as !locks. Ex: ~~[1bd4]s #debug ^DN

runas /netonly /u:<account> windbg.exe - Launch windbg with domain account. Use when dbg computer isn't in domain and symbol server is. ^DN

!heap -p -a <address> - Shows information about the heap block containing <address>, even if you aren't using pageheap. #debug ^DN

ub - Unassembles starting at a location prior to your address. Accepts l<number> to specify how many instructions to go back. ub . l20 ^DN

!stacks 2 [FilterString] - Finds kernel mode call stacks that contain the FilterString in a symbol. #debug ^DN

!thread [address] 17 (or 1e on x64) - Sets context for this command, avoids the need for .thread/.process for user stacks. #debug ^DN

.hh [Text] - Opens the debugger help. [Text] is the topic to lookup in the index. Example: .hh !pte #debug ^DN

?? can dump structs using C++ style expressions. Ex: ??((nt!_KTHREAD*)(0xfffffa800ea43bb0))->ApcState #debug ^DN

bp /t EThread - Sets a kernel mode breakpoint that only triggers when hit in the context of this thread. #debug ^DN

bp /p EProcess - Sets a kernel mode breakpoint that only triggers when hit in the context of this process. #debug ^DN

gc - If you run 'p' and hit a breakpoint, gc takes you where p would have gone if you had not hit the bp. #debug ^DN

gu - Go until the current function returns. Effectively this unwinds one stack frame. #debug #windbg ^DN

pc - Steps through until the next 'call' instruction. Combine with other commands to find who returned your error> pc;p;r eax #debug ^DN

pt - Steps through until the next 'ret' instruction. Similar to gu, but pt stops on the ret and gu stops after the ret. #debug ^DN

.ignore_missing_pages 1 - supresses the error: "Page 2a49 not present in the dump file. Type ".hh dbgerr004" for details" #debug ^DN

.exr -1 shows the most recent exception. Useful in user dumps of crashes, especially for no execute crashes (NX/DEP). #debug ^DN

wt - Trace calls until they return to the current address. More useful with -or to get return values. Use -l for depth. ^DN #debug

.thread /w - Changes to the WOW64 32-bit context from 64-bit kernel mode. Wow64exts doesn't work in kernel mode. #debug ^DN

??sizeof(structure) - Gets the size of a structure, it's easier than counting. #debug ^DN

sxe ld:module.dll - Enables an exception which will break into the debugger when module.dll is loaded. #debug ^DN

↧

A Case of Mysterious Memory Corruption

April 15, 2010, 6:04 am

≫ Next: Understanding VSS in DPM

≪ Previous: Archive of the Debug Ninja’s Twitter debug tips

Hi, this is Chad from the OEM team. You may remember me from such posts as “Debugging a bluescreen at home.”

Some time ago I debugged a bluescreen on a friend’s home computer, and I thought the results were interesting enough to share. My friend had an older Windows XP system that had been experiencing random crashes for a while. I had looked at a memory dump or two, and while there wasn’t enough information to pinpoint a specific cause, I noticed what appeared to be pool corruption, so I recommended he enable Driver Verifier against all third-party drivers on the system in an effort to track down the offending driver.

(You can learn more about Driver Verifier in the following Knowledge Base article: “Using Driver Verifier to identify issues with Windows drivers for advanced users”).

With Verifier enabled the machine crashed with a bugcheck, and I debugged the memory dump. As always, I start with the “!analyze -v” command:

1: kd> !analyze -v

*******************************************************************************

* *

* Bugcheck Analysis *

* *

*******************************************************************************

DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)

Memory was referenced after it was freed.

This cannot be protected by try-except.

When possible, the guilty driver's name (Unicode string) is printed on

the bugcheck screen and saved in KiBugCheckDriver.

Arguments:

Arg1: 88328eac, memory referenced

Arg2: 00000000, value 0 = read operation, 1 = write operation

Arg3: 86c6929b, if non-zero, the address which referenced memory.

Arg4: 00000000, (reserved)

Debugging Details:

------------------

READ_ADDRESS: 88328eac Special pool

FAULTING_IP:

+5283c2a1

86c6929b 8b423c mov eax,dword ptr [edx+3Ch]

MM_INTERNAL_CODE: 0

DEFAULT_BUCKET_ID: DRIVER_FAULT_DELL

BUGCHECK_STR: 0xD5

PROCESS_NAME: services.exe

TRAP_FRAME: f7516bf0 -- (.trap 0xfffffffff7516bf0)

ErrCode = 00000000

eax=00000000 ebx=88328e70 ecx=00000003 edx=88328e70 esi=806e6410 edi=86c31af8

eip=86c6929b esp=f7516c64 ebp=f7516ca4 iopl=0 nv up ei pl zr na pe nc

cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246

86c6929b 8b423c mov eax,dword ptr [edx+3Ch] ds:0023:88328eac=????????

Resetting default scope

LAST_CONTROL_TRANSFER: from 8052037a to 804f9f43

STACK_TEXT:

f7516b70 8052037a 00000050 88328eac 00000000 nt!KeBugCheckEx+0x1b

f7516bd8 80544588 00000000 88328eac 00000000 nt!MmAccessFault+0x9a8

f7516bd8 86c6929b 00000000 88328eac 00000000 nt!KiTrap0E+0xd0

WARNING: Frame IP not in any known module. Following frames may be wrong.

f7516ca4 86c695ec 86cc71bd 86d3c318 88328e70 0x86c6929b

f7516ccc ba631459 f7516cf8 86c69605 86d3cbc8 0x86c695ec

f7516cd4 86c69605 86d3cbc8 88328e70 88328e70 sr!SrPassThrough+0x31

f7516cf8 8057f982 f7516d64 0007f964 80579e64 0x86c69605

f7516d0c 80579ec1 86d3cbc8 88328e70 86c31af8 nt!IopSynchronousServiceTail+0x70

f7516d30 8054162c 00000578 00000000 00000000 nt!NtQueryDirectoryFile+0x5d

f7516d30 7c90e514 00000578 00000000 00000000 nt!KiFastCallEntry+0xfc

0007f9ac 00000000 00000000 00000000 00000000 0x7c90e514

STACK_COMMAND: kb

FOLLOWUP_IP:

nt!IopSynchronousServiceTail+70

8057f982 807d1c00 cmp byte ptr [ebp+1Ch],0

A good way to get more information about a particular bugcheck code is to search for it in the Windows Debugger help file. Under the entry “Bug Check 0xD5: DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL” we learn that this particular bugcheck occurs when “The Driver Verifier Special Pool option has caught the driver accessing memory which was earlier freed.”

So, as a result of having previously enabled Driver Verifier, we have some memory allocations coming out of Special Pool. (Incidentally, you can use the “!verifier” command in the debugger to get a list of which drivers are being verified and various information about them.) Accesses to Special Pool memory undergo additional verification checking, and in this case, the verifier has thrown a bugcheck because the memory in question is free.

Using !pte against the address in question (88338eac) shows that, in fact, it’s not a valid virtual address at all:

1: kd> !pte 88328eac

VA 88328eac

PDE at C0602208 PTE at C0441940

contains 000000000676C963 contains FFFFFFFF00000000

pfn 676c -G-DA--KWEV not valid

PageFile: 0

Offset: ffffffff

Protect: 0

So, some code tried to read from a memory location that was completely invalid. This isn’t altogether uncommon, but there is something more unusual about this crash: If we look at the call stack leading up to the crash, the debugger isn’t even displaying a module name for the function that did the bad memory access! Let’s use the .trap command (helpfully supplied in the !analyze output above) to look at the instruction that actually failed, and dump the stack again.

1: kd> .trap 0xfffffffff7516bf0

ErrCode = 00000000

eax=00000000 ebx=88328e70 ecx=00000003 edx=88328e70 esi=806e6410 edi=86c31af8

eip=86c6929b esp=f7516c64 ebp=f7516ca4 iopl=0 nv up ei pl zr na pe nc

cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246

86c6929b 8b423c mov eax,dword ptr [edx+3Ch] ds:0023:88328eac=????????

1: kd> kb

*** Stack trace for last set context - .thread/.cxr resets it

ChildEBP RetAddr Args to Child

WARNING: Frame IP not in any known module. Following frames may be wrong.

f7516ca4 86c695ec 86cc71bd 86d3c318 88328e70 0x86c6929b

f7516ccc ba631459 f7516cf8 86c69605 86d3cbc8 0x86c695ec

f7516cd4 86c69605 86d3cbc8 88328e70 88328e70 sr!SrPassThrough+0x31

f7516cf8 8057f982 f7516d64 0007f964 80579e64 0x86c69605

f7516d0c 80579ec1 86d3cbc8 88328e70 86c31af8 nt!IopSynchronousServiceTail+0x70

f7516d30 8054162c 00000578 00000000 00000000 nt!NtQueryDirectoryFile+0x5d

f7516d30 7c90e514 00000578 00000000 00000000 nt!KiFastCallEntry+0xfc

0007f9ac 00000000 00000000 00000000 00000000 0x7c90e514

So, we were doing some file I/O (NtQueryDirectoryFile) and somehow ended up running some code which is loaded in memory around 0x86c6929b. But the debugger isn’t even able to match this code up with a module name. Why not? Well, because there’s nothing in the kernel’s loaded module list that matches up with this address. (You can dump the loaded module list with the “lm” command.)

This most likely means that this module was not loaded into memory using the standard Win32 APIs, since these APIs would always add the module to the loaded list. Alternately, the loaded module list may have been corrupted or tampered with in some way.

Let’s do a “!address” on the location of this function to see if we can tell anything more:

1: kd> !address 0x86c6929b

86c68000 - 00037000

Usage KernelSpaceUsageNonPagedPool

Now things are looking really strange indeed – this code resides in nonpaged pool memory. We do not ordinarily execute code out of pool memory. Worse, it’s not even allocated pool:

1: kd> !pool 0x86c6929b

Pool page 86c6929b region is Nonpaged pool

86c69000 is not a valid small pool allocation, checking large pool...

unable to get pool big page table - either wrong symbols or pool tagging is disabled

86c69000 is freed (or corrupt) pool

Bad previous allocation size @86c69000, last size was 0

…

This all looked pretty strange, so I started dumping memory to get a look at the code in question. Turns out there is a module header at the start of the previous page, at 0x86c68000. You can always identify these by the “MZ” string at the beginning of the header. (Fun trivia fact: These are the initials of Mark Zbikowski, the Microsoft developer who originally designed the .exe file format, way back in the early days of MS-DOS.)

1: kd> dc 0x86c68000

86c68000 00905a4d 00000003 00000004 0000ffff MZ..............

86c68010 000000b8 00000000 00000040 00000000 ........@.......

86c68020 00000000 00000000 00000000 00000000 ................

86c68030 00000000 00000000 00000000 00000248 ............H...

86c68040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th

86c68050 70207369 72676f72 63206d61 6f6e6e61 is program canno

86c68060 65622074 6e757220 206e6920 20534f44 t be run in DOS

86c68070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......

If you have an executable file in memory, you can dump the header using the “!dh” command. The resulting output is pretty long, so I’ve trimmed some of the output for purposes of this post. But there was one thing that really stood out, highlighted in red below:

1: kd> !dh 0x86c68000

File Type: EXECUTABLE IMAGE

FILE HEADER VALUES

14C machine (i386)

5 number of sections

4A53A574 time date stamp Tue Jul 07 12:43:48 2009

0 file pointer to symbol table

0 number of symbols

E0 size of optional header

102 characteristics

Executable

32 bit word machine

…

SECTION HEADER #2

.rdata name

106 virtual size

1980 virtual address

180 size of raw data

1980 file pointer to raw data

0 file pointer to relocation table

0 file pointer to line numbers

0 number of relocations

0 number of line numbers

48000040 flags

Initialized Data

Not Paged

(no align specified)

Read Only

Debug Directories(1)

Type Size Address Pointer

cv 5a 1a2c 1a2c Format: RSDS, guid, 1, c:\programs\revolution6\innerdrv\objfre_w2k_x86\i386\InnerDrv.pdb

…

Aha! When you build a module, the compiler puts information in the header to help debuggers find the appropriate symbol file. In this case, we can see that this program seems to be called “InnerDrv.” Now we have something to go on!

A quick Bing search for “InnerDrv.pdb” shows that this particular code is part of a rootkit known as “Pushdo”. My friend’s system had been infected by this malware. In the end, my friend opted to play it safe and simply reformat and reinstall this machine.

↧

Understanding VSS in DPM

April 16, 2010, 6:25 am

≫ Next: Storport ETW Logging to Measure Requests Made to a Disk Unit

≪ Previous: A Case of Mysterious Memory Corruption

Data Protection Manager 2007 utilizes Windows’ Volume Shadow Copy Service to protect Data Sources on Production Servers, and to manage Recovery Points on the DPM server. This presentation provides step-by-step illustrations and explanations of how DPM uses VSS to backup a Protected Server’s file share and System State data sources, and then manage the resulting recovery points in DPM.

↧

Storport ETW Logging to Measure Requests Made to a Disk Unit

April 22, 2010, 5:12 am

≫ Next: Uncovering How Workspaces Work in WinDbg

≪ Previous: Understanding VSS in DPM

Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

· KB 979764 (Windows Server 2008)

· KB 978000 (Windows Server 2008 R2)

Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit. These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows). Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver. The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions. Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

Configuring and Starting STORPORT ETW Logging

To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.) When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.

Here is a walkthrough to enable this feature via the GUI:

Click Start, type “perfmon”, and press ENTER to start Performance Monitor.

Expand “Data Collector Sets” and select “Event Trace Sessions”.

Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”.

The following dialog will appear:

Give the new data collector set a name in the dialog box. In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.

After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.

When the dialog below opens, select “Keywords (Any)” then click Edit.

Check the box for IOPerfNotification, and then click OK. Note: For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume).

Next, select “Filter” and click Edit…

This is where we enter our time threshold. Remember that timing values greater than or equal to this filter value will be logged to the trace file.

Select “Binary” and check the “Filter enabled” box

In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.

Decimal

Hexadecimal

Binary(little endian)

1ms

5ms

10ms

15ms

01 00 00 00 00 00 00 00

05 00 00 00 00 00 00 00

0A 00 00 00 00 00 00 00

0F 00 00 00 00 00 00 00

Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms.

WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.

Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.

After clicking OK, you will see a new Event Trace session (screenshot below). Right-Click the new Event Trace session and select “Properties”

Next we need to set a location for the trace file. Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits.

The target folder may be changed in the “Directory” tab

I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”

Note: It is not recommended to store the trace file on the same storage you are troubleshooting.

The name for the file is selected on the “File” tab.

When you are done setting the file and directory information click OK.

Now right-click the log and select “Start” to begin tracing.

The “Status” should indicate that it is running.

How long to run the log depends on you issue. You should let the log run until the issue or slow down occur then check the log for delays.

Viewing the Log File

First right-click and select “Stop” to flush any cached data and stop the logging.

Then use Event Viewer to examine the log.

Next choose "Open the saved log…"

After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.

Example: Event ID 13

Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set. If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.

Breaking down the Event Data we have …

Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

PortNumber: This is the adapter port number (i.e. RaidPort1, etc.)

PathID: Path number if multi-pathing is in place

TargetID: Target LUN exposed to the Operating System

LUN: The Logical Unit Number of the physical storage where the particular I/O is directed

*RequestDuration: How long the request took in milliseconds (the number of note here for storage latency diagnostics)

Command: the SCSI command sent to the unit. (Usually read or write) This is listed in decimal. Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

ScsiStatus: the status that came back from the drive

SRBStatus: the status as returned from adapter (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)

Example: Command 42 equals 2A which is a Write of 10 bytes.

Here’s the respective SCSI command:

When viewing the log, the commands will be mostly reads and writes. If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

Hope you enjoyed the walkthrough!

-Bob

↧

Uncovering How Workspaces Work in WinDbg

May 7, 2010, 2:54 am

≫ Next: x64 Manual Stack Reconstruction and Stack Walking

≪ Previous: Storport ETW Logging to Measure Requests Made to a Disk Unit

Author - Jason Epperly

Workspaces have always been a little confusing to me. I knew how to bend them to do what I needed to get the job done, however they still remained a bit mysterious. Recently I decided to sort this out, just so I knew how they worked under the hood. But before I show you my investigation let's discuss the different types of workspaces. Windbg uses several built-in types including Base, User, Kernel, Remote, Processor Architecture, Per Dump, and Per Executable. It also uses named workspaces (or user defined workspaces). When you perform a particular type of debugging (e.g. live user-mode, post-mortem dump analysis etc.) these workspaces are combined into the final environment. Here's a diagram to illustrate the possible combination of workspaces.

The green line is the case where WinDbg is used to open a dump file. In this scenario the Base workspace + per dump workspace is used. Note: per dump simply means each individual dump file opened gets its own workspace.
The blue line is the scenario where WinDbg is used to live debug a running application, using the Base+User-mode workspaces.
The orange line is an example of WinDbg used to perform a live kernel debug on an x86 machine. In this case windbg is using Base+Kernel+x86 workspaces.

From the diagram you can see windbg typically uses a combination of two workspaces. While live kernel debugging it uses three workspaces.

So what is in a workspace?

Session Information
- All break point (bp's) and handling information for exceptions and events (sx settings).
- All open source files.
- All user-defined aliases.
Configuration settings
- The symbol path.
- The executable image path.
- The source path.
- The current source options that were set with l+, l- (Set Source Options).
- Log file settings.
- The COM or 1394 kernel connection settings, if the connection was started by using the graphical interface.
- The most recent paths in each Open dialog box (except for the workspace file and text file paths, which are not saved).
- The current .enable_unicode, .force_radix_output, and .enable_long_status settings.
WinDbg graphical interface.
- The title of the WinDbg window
- The Automatically Open Disassembly setting
- The default font
- The size and position of the WinDbg window on the desktop.
- Which debugging information windows are open.
- The size and position of each open window, including the window's size, its floating or docked status, whether it is tabbed with other windows, and all of the related settings in its shortcut menu.
- The location of the pane boundary in the Debugger Command window and the word wrap setting in that window.
- Whether the toolbar and status bar, and the individual toolbars on each debugging information window, are visible.
- The customization of the Registers window.
- The flags in the Calls window, Locals window, and Watch window.
- The items that were viewed in the Watch window.
- The cursor location in each Source window.

All of these settings (except for the blue ones) are applied cumulatively (Base first, then the next workspace, etc). The blue items above are only loaded from the last workspace in the chain. To show this in action I created a simple walk through to illustrate the use of workspaces the debugger.

First I opened windbg without the use of any command line options. When it opens in this dormant state (not attached to anything and has nothing opened) its using the Base workspace. If I don't change anything (e.g. window placement) I am not prompted with any workspace dialogs when I start debugging. However if I moved the debugger's main window to any location (we will call this position 1) followed by executing any of the highlighted operations below -

I am prompted with this dialog-

Choosing "Yes" on the dialog above integrates my changes into the "Base" workspace so window position 1 is now part of my Base workspace.

Now I'm going to select "Open Executable" and browse to our old faithful target binary notepad.exe. Once the binary is opened, windbg uses Base+Notepad (per Executable file). Now I'll move the debugger's main window again (we will call this position 2) and choose the option Debug > Stop debugging. Because of the window location change, I am prompted with the following-

If I choose 'Yes', windbg will use window position 2 for anytime I open the notepad executable in the future. After closing the notepad.exe executable, windbg reverts back to using the Base workspace.

This time I'll actually launch notepad (not from the debugger) and attach to the running notepad.exe process with the debugger. We are now at Base+User-mode. I moved the debugger window (new position 3), selected Debug > Stop Debugging and get prompted with this dialog-

Choosing "Yes" will store WinDbg window position 3 in the User-mode workspace. Once I have completed this step, Windbg is again using the Base workspace because we stopped debugging.

To futher illustrate workspaces I'll attach a to a target Virtual Machine for Kernel Debugging but not break in. Windbg is using Base+Kernel now. I moved the window again and as soon as I break-in I get this dialog-

I chose 'No' on the dialog because I'm getting the hang of things. If I move the window again and type qd (quit and detach) to end the current kernel debug session, I will see this dialog-

So before we ended the session, we were at Base+Kernel+AMD64.

Running through this exercise helped me understand why I usually create a named workspace, change all my settings and use the command line option -W to open my workspace. Hopefully this will clear up some of the complexities involved with workspaces. This is why the debugger help file recommends making all the changes you need at the lowest possible level (i.e. Base first, then the others).

Hope this helps...

↧

x64 Manual Stack Reconstruction and Stack Walking

May 12, 2010, 10:14 am

≫ Next: ‘Hidden’ ETW Stack Trace Feature – Get Stacks All Over the Place!

≪ Previous: Uncovering How Workspaces Work in WinDbg

My name is Trey Nash and I am an Escalation Engineer on the Core OS team. My experience is as a software developer, and therefore my blog posts tend to be slanted in the direction of helping developers during the feature development, testing and the support phases.

In this installment I would like to expand a bit on a previous post of mine called Challenges of Debugging Optimized x64 Code. In that post I discussed the nuances of the x64 calling convention (thankfully of which there is only one) and how it is used in optimized builds of software. The calling convention is sometimes referred to as the Application Binary Interface (ABI). In this post, I would like to discuss the x64 unwind metadata and how you can use it in the debugger to manually walk a stack.

In some cases, you may have a corrupted stack that the debugger simply cannot effectively walk for you. This often happens because the debugger walks a stack from the top down (assuming the stack grows upwards as if it were a stack of places on a table), and if the stack is sufficiently trashed then the debugger cannot find its bearing. In the x86 world, a large percentage of the time, you can spot the stack frames by following the chain of base pointers and then build a crafty stack backtrace command to display the stack at some point in time. But in the x64 calling convention there is no base pointer. In fact, once a function’s prolog code has executed the rsp register generally never changes until the epilog code. To read more about x64 prolog and epilog code conventions, go here.

Moreover, the syntax for creating a crafty stack backtrace command in the x64 environment is currently undocumented, and I aim to shed some light on that near the end of that blog post. J

The Example Code

For this blog post I have used the following example C# code that requires the .NET 4.0 framework and can be easily built from a Visual Studio 2010 command prompt. You can find the code below:

using System;
using System.Numerics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;

class EntryPoint
{
    const int FactorialsToCompute = 2000;

    static void Main() {
        var numbers = new ConcurrentDictionary<BigInteger, BigInteger>(4, FactorialsToCompute);

        // Create a factorial delegate.
        Func<BigInteger, BigInteger> factorial = null;
        factorial = (n) => ( n == 0 ) ? 1 : n * factorial(n-1);

        // Now compute the factorial of the list
        // concurrently.
        Parallel.For( 0,
                      FactorialsToCompute,
                      (i) => {
                        numbers[i] = factorial(i);
                      } );
    }
}

The spirit of this code is to concurrently compute the first 2000 factorials and store the results in a dictionary. This code uses the new Task Parallel Library to distribute this work evenly across the multiple cores on the system. To compile the example (assuming the code is stored in test.cs), you can execute the following command from a Visual Studio 2010 command prompt:

csc /r:system.numerics.dll test.cs

Note: If you are using a 64bit platform, be sure to use the x64 command prompt shortcut installed by the Visual Studio 2010 installer.

You can download a free evaluation of Visual Studio 2010 here.

x64 Unwind Metadata

So how does the debugger and functions such as RtlVirtualUnwind know how to walk the x64 stack if it cannot find a base pointer? The secret is that it uses unwind metadata that is typically baked into the Portable Executable (PE) file at link time. You can inspect this information using the /UNWINDINFO option of the command line tool dumpbin. For example, I went to the directory on my machine which contains clr.dll (c:\Windows\Microsoft.NET\Framework\v4.0.30319) and dumped the unwind info looking for CLREvent::WaitEx, which I have pasted below:

00013F20 000DFDB0 000DFE3C 007267D8 ?WaitEx@CLREvent@@QEAAKKW4WaitMode@@PEAUPendingSync@@@Z (public: unsigned long __cdecl CLREvent::WaitEx(unsigned long,enum WaitMode,struct PendingSync *))
    Unwind version: 1
    Unwind flags: UHANDLER
    Size of prologue: 0x20
    Count of codes: 10
    Unwind codes:
      20: SAVE_NONVOL, register=rbp offset=0xB0
      1C: SAVE_NONVOL, register=rbx offset=0xA8
      0F: ALLOC_SMALL, size=0x70
      0B: PUSH_NONVOL, register=r14
      09: PUSH_NONVOL, register=r13
      07: PUSH_NONVOL, register=r12
      05: PUSH_NONVOL, register=rdi
      04: PUSH_NONVOL, register=rsi
    Handler: 0020ADF0 __CxxFrameHandler3
    EH Handler Data: 007B3F54

I’ll get into what all of this means shortly.

Note: The dumpbin.exe functionality is also exposed via the linker. For example, the command “dumpbin.exe /?” is identical to “link.exe /dump /?”.

Within the debugger, you can find this same information for a particular function using the .fnent command. To demonstrate, I executed the example code within a windbg instance and broke in at some random point and chose one of the threads to look at which has a stack looking like the following:

12 Id: f80.7f0 Suspend: 1 Teb: 000007ff`fffa0000 Unfrozen
# Child-SP RetAddr Call Site
00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

At first glance, it may appear that this stack is already trashed since there is no symbol information for the bottom frames in the display. Before jumping to this conclusion, recall that this is a managed application and therefore contains JIT compiled code. To verify that the addresses without symbol information are JIT’ed code, you can do a couple of things.

First, use the !EEHeap extension in the SOS extension to determine if these addresses reside in the JIT code heap. Below, you can see the commands I used to both load the SOS extension and then display the Execution Engine (EE) Heap information:

0:014> .loadby sos clr

0:014> !EEHeap -loader
Loader Heap:
--------------------------------------
System Domain:     000007fef50955a0
LowFrequencyHeap: 000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
StubHeap:          000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
Virtual Call Stub Heap:
IndcellHeap:     000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
LookupHeap:      000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap:     000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap:    000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size:        Size: 0x8000 (32768) bytes.
--------------------------------------
Shared Domain:     000007fef5095040
LowFrequencyHeap: 000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
StubHeap:          000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
Virtual Call Stub Heap:
IndcellHeap:     000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
LookupHeap:      000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap:     000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap:    000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size:        Size: 0x8000 (32768) bytes.
--------------------------------------
Domain 1:          00000000003e73c0
LowFrequencyHeap: 000007ff00030000(2000:1000) 000007ff00140000(10000:5000) Size: 0x6000 (24576) bytes total, 0x1000 (4096) bytes wasted.
HighFrequencyHeap: 000007ff00032000(8000:5000) Size: 0x5000 (20480) bytes.
StubHeap:          Size: 0x0 (0) bytes.
Virtual Call Stub Heap:
IndcellHeap:     000007ff00040000(4000:1000) Size: 0x1000 (4096) bytes.
LookupHeap:      000007ff0004b000(2000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap:     000007ff0007c000(54000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap:    000007ff0004d000(2f000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size:        Size: 0xf000 (61440) bytes total, 0x1000 (4096) bytes wasted.
--------------------------------------
Jit code heap:
LoaderCodeHeap:    000007ff00150000(40000:2000) Size: 0x2000 (8192) bytes.
Total size:        Size: 0x2000 (8192) bytes.
--------------------------------------
Module Thunk heaps:
Module 000007fee5581000: Size: 0x0 (0) bytes.
Module 000007ff000330d8: Size: 0x0 (0) bytes.
Module 000007fef06f1000: Size: 0x0 (0) bytes.
Total size:      Size: 0x0 (0) bytes.
--------------------------------------
Module Lookup Table heaps:
Module 000007fee5581000: Size: 0x0 (0) bytes.
Module 000007ff000330d8: Size: 0x0 (0) bytes.
Module 000007fef06f1000: Size: 0x0 (0) bytes.
Total size:      Size: 0x0 (0) bytes.
--------------------------------------
Total LoaderHeap size:   Size: 0x21000 (135168) bytes total, 0x1000 (4096) bytes wasted.
=======================================

I have highlighted the JIT heap information and you can see that the JIT’ed code instruction pointers in the stack fall within this range.

The second sanity check you can perform is to use a variant of the u instruction to confirm that there is a call instruction just prior to that address as shown below:

0:012> ub 0x7ff`0015098c
000007ff`0015095e 488b01          mov     rax,qword ptr [rcx]
000007ff`00150961 48898424b0000000 mov     qword ptr [rsp+0B0h],rax
000007ff`00150969 488b4108        mov     rax,qword ptr [rcx+8]
000007ff`0015096d 48898424b8000000 mov     qword ptr [rsp+0B8h],rax
000007ff`00150975 4c8d8424b0000000 lea     r8,[rsp+0B0h]
000007ff`0015097d 488b5308        mov     rdx,qword ptr [rbx+8]
000007ff`00150981 488d8c24c0000000 lea     rcx,[rsp+0C0h]
000007ff`00150989 ff5318          call    qword ptr [rbx+18h]

So at this point we have verified that we probably have a valid stack. But how does the debugger so effectively walk this stack for us if there is no stack frame pointer? The answer, of course, is that it uses the unwind information.

To explore the answer to that question, let’s focus on a particular frame within the stack such as frame 4 in the stack above. The code at that frame is inside the function clr!CLREvent::WaitEx, and if we pass that to .fnent, we get the following output:

0:012> .fnent clr!CLREvent::WaitEx
Debugger function entry 00000000`04075e40 for:
(000007fe`f48bfdb0)   clr!CLREvent::WaitEx   | (000007fe`f48bfe3c)   clr!CLREvent::Set
Exact matches:
    clr!CLREvent::WaitEx = <no type information>

BeginAddress      = 00000000`000dfdb0
EndAddress        = 00000000`000dfe3c
UnwindInfoAddress = 00000000`007267d8

Unwind info at 000007fe`f4f067d8, 20 bytes
version 1, flags 2, prolog 20, codes a
frame reg 0, frame offs 0
handler routine: clr!_CxxFrameHandler3 (000007fe`f49eadf0), data 7b3f54
00: offs 20, unwind op 4, op info 5    UWOP_SAVE_NONVOL FrameOffset: b0
02: offs 1c, unwind op 4, op info 3    UWOP_SAVE_NONVOL FrameOffset: a8
04: offs f, unwind op 2, op info d     UWOP_ALLOC_SMALL
05: offs b, unwind op 0, op info e     UWOP_PUSH_NONVOL
06: offs 9, unwind op 0, op info d     UWOP_PUSH_NONVOL
07: offs 7, unwind op 0, op info c     UWOP_PUSH_NONVOL
08: offs 5, unwind op 0, op info 7     UWOP_PUSH_NONVOL
09: offs 4, unwind op 0, op info 6     UWOP_PUSH_NONVOL

Notice that this output is virtually identical to the same information provided by dumpbin using the /UNWINDINFO option.

I have highlighted two interesting values above. The value highlighted in green is a relative virtual address (RVA) to the unwind info that is baked into the PE file by the linker. The value highlighted in yellow is the actual virtual address of the unwind info and can be computed by adding the module base address shown below to the RVA for UnwindInfoAddress.

0:012> lmnm clr

start end module name

000007fe`f47e0000 000007fe`f5145000 clr

By examining the PE header using !dh you can confirm that the unwind information resides in the .rdata section of the module, which I have shown below:

0:012> !dh clr

File Type: DLL
FILE HEADER VALUES
    8664 machine (X64)
       6 number of sections
4BA21EEB time date stamp Thu Mar 18 07:39:07 2010

<snip>
SECTION HEADER #2
.rdata name
1FC8EC virtual size
67F000 virtual address
1FCA00 size of raw data
67E200 file pointer to raw data
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
40000040 flags
         Initialized Data
         (no align specified)
         Read Only
<snip>

Using the Unwind Info

Now let’s take a look at the unwind info and compare it to the prolog code of the function with which it is associated. For convenience, I have reprinted the .fnent output for the function:

The yellow highlighted value tells us that the prolog code for the function is 0x20 bytes in length. Using that information we can dump out the prolog code for the function:

0:012> u clr!CLREvent::WaitEx clr!CLREvent::WaitEx+20
clr!CLREvent::WaitEx:
000007fe`f48bfdb0 488bc4          mov     rax,rsp
000007fe`f48bfdb3 56              push    rsi
000007fe`f48bfdb4 57              push    rdi
000007fe`f48bfdb5 4154            push    r12
000007fe`f48bfdb7 4155            push    r13
000007fe`f48bfdb9 4156            push    r14
000007fe`f48bfdbb 4883ec70        sub     rsp,70h
000007fe`f48bfdbf 48c7442440feffffff mov   qword ptr [rsp+40h],0FFFFFFFFFFFFFFFEh
000007fe`f48bfdc8 48895810        mov     qword ptr [rax+10h],rbx
000007fe`f48bfdcc 48896818        mov     qword ptr [rax+18h],rbp

The list of operations in the unwind info is listed in the reverse order of the operations in the assembly code. Each of the UWOP_PUSH_NONVOL operations in the unwind info maps to a nonvolatile register that is pushed onto the stack for safe keeping in the prolog code. I have highlighted the sections within the prolog and the .fnent output such that highlighting with like colors indicates related information. Now, let’s take a look at the raw stack and tie all of this information together.

Below is the stack with the frame we are focusing on highlighted in yellow:

0:012> kn
# Child-SP RetAddr Call Site
00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

Note: The symbols above look a little weird and may lead you to believe that WaitEx is calling itself recursively, but it is not. It only appears that way because you need the private symbols for clr.dll to be able to see the real function name. Only public symbols are available outside of Microsoft.

And below is the raw stack relevant to this frame with some highlighting and annotations that I have added:

0:012> dps 00000000`04a51f60-10 L20
00000000`04a51f50 00000000`00000001
00000000`04a51f58 000007fe`f48bfe23 clr!CLREvent::WaitEx+0x5e
00000000`04a51f60 00000000`c0402388
00000000`04a51f68 00000000`c0402500
00000000`04a51f70 000007fe`f48afaf0 clr!SystemNative::ArrayCopy
00000000`04a51f78 00000000`00000182
00000000`04a51f80 00000000`04a521d0
00000000`04a51f88 000007fe`00000001
00000000`04a51f90 00000000`00000057
00000000`04a51f98 00000000`c0402398
00000000`04a51fa0 ffffffff`fffffffe
00000000`04a51fa8 007f0000`04a521d0
00000000`04a51fb0 fffff880`009ca540
00000000`04a51fb8 000007fe`f483da5b clr!SVR::heap_select::select_heap+0x1c
00000000`04a51fc0 fffff880`009ca540
00000000`04a51fc8 000007fe`fd4e18aa KERNELBASE!ResetEvent+0xa
00000000`04a51fd0 00000000`0043dc60
00000000`04a51fd8 00000000`00000178
00000000`04a51fe0 00000000`00493c10
00000000`04a51fe8 00000000`0043dc60     ß saved rdi
00000000`04a51ff0 00000000`00000001

   *** call into clr!CLREvent::WaitEx ***

00000000`04a51ff8 000007fe`f48d51d8 clr!SVR::gc_heap::wait_for_gc_done+0x98
00000000`04a52000 00000000`00493ba0
00000000`04a52008 00000000`00493ba0     ß saved rbx
00000000`04a52010 00000000`00000058     ß saved rbp
00000000`04a52018 000007fe`f0711e0f System_Numerics_ni+0x21e0f
00000000`04a52020 00000000`00000178
00000000`04a52028 000007fe`f4995249 clr!SVR::GCHeap::Alloc+0xb4
00000000`04a52030 00000000`0043a140
00000000`04a52038 00000000`0043dc60
00000000`04a52040 00000000`00000000
00000000`04a52048 00000000`04a522e0

In the stack listing I have used the same color highlighting scheme as before to show how the data on the raw stack correlates to the unwind data. And, using green highlighting, I have shown how the Child-SP value correlates to the stack frame.

The cyan highlighting represents nonvolatile registers that are pushed onto the stack in the prolog code. The blue highlighting represents stack space reserved for locals and for register home space allocated for calling sub routines. In the unwind data the stack reservation is represented by a UWOP_ALLOC_SMALL operation. And the red highlighting represents nonvolatile registers that are stored in the home space of the previous stack frame and represented by a UWOP_SAVE_NONVOL operation stored in the unwind information.

As you can see, we have all of the information we need in the unwind data to determine which slots on the stack are used for what. The only thing we don’t know is the partitioning of the reserved stack space for locals, which is described by the private symbol information for the clr.dll module.

Tying it all Together

.fnent produces its output directly from parsing the definition of the UNWIND_INFO structure and it even gives you the address of where that structure lives in memory. The UNWIND_INFO structure also contains a variable amount of UNWIND_CODE structures. You can find details of the structure definitions for UNWIND_INFO and UNWIND_CODE here. Each parsed line of unwind information in the .fnent output is backed by at least one of these structures. In fact, you can see the correlation between the structure fields for UNWIND_INFO and the data in the .fnent output as shown below:

From UNWIND_CODE:

UBYTE	Offset in prolog
UBYTE: 4	Unwind operation code
UBYTE: 4	Operation info

From .fnent:

05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL

The meaning of the OpInfo (operation info) field is dependent on the UnwindOp (unwind operation code) field and is spelled out in the documentation for UNWIND_CODE. For example, for the UWOP_PUSH_NONVOL operation shown above, the OpInfo field is an index into the following table, which indicates which nonvolatile register this push is associated with. Note that the values in the below table are in decimal, while the .fnent values are in hex:

0	RAX
1	RCX
2	RDX
3	RBX
4	RSP
5	RBP
6	RSI
7	RDI
8 to 15	R8 to R15

Therefore, the previous line from the .fnent output represents a push operation for the r14 register (05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL). Looking at the assembly above, we see that the topmost UWOP_PUSH_NONVOL operation in the .fnent output correlates to the last nonvolatile register push in the prolog code (push r14).

Note: Remember, the push operations in the .fnent output are listed in the reverse order of where they are in the actual prolog code. This helps the unwind code easily calculate offsets of where they should live in the stack.

One thing that you will notice in the x64 calling convention is that once the prolog code has executed, the value for rsp will very rarely change. The Child-SP value in the stack displayed by the k commands is the value of rsp for that frame after the prolog code has executed. So the offsets to access these nonvolatile registers are then applied to the Child-SP value (previously highlighted in green) to find where they live on the stack. So, in a way, the Child-SP value acts like the base pointer we are used to on the x86 platform.

In the .fnent output above, you will also see the following:

00: offs 20, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: b0

For UWOP_SAVE_NONVOL, you see that the .fnent output shows us the offset where we can find this register, and the register in question is represented by the OpInfo value that equates to rbp. The offset above is applied to the Child-SP value (00000000`04a51f60 in this case) to produce the address 00000000`04a52010, which indicates that’s where we can find a saved copy of rbp. I have also annotated where it lives in the raw stack output shown previously.

Note: If you’re wondering why rbp is stored in the previous stack frame, check out my previous post on this topic where I describe how in optimized builds, the compiler can use the home space from the previous stack frame to save nonvolatile registers thus saving them with a MOV operation as opposed to a PUSH operation. This is possible because in optimized builds the home space is not necessarily used to store parameters.

So how does all of This Work for CLR JIT Code?

If you have asked this question, then you are definitely paying attention! As we have shown, the compiler and linker are responsible for placing unwind info in the Portable Executable file at build time. But what about dynamic code that is generated at runtime? Certainly there must be unwind information for dynamically compiled code as well, otherwise there would be no way to walk the stack or unwind the stack after an exception.

As it turns out, APIs exist for this very situation, including RtlAddFunctionTable and RtlInstallFunctionTableCallback. In fact, the CLR uses RtlInstallFunctionTableCallback. The generated unwind information is then rooted in a linked list where the head is at ntdll!RtlpDynamicFunctionTable. The format of the linked list items is undocumented as it is an implementation detail, but using dbghelp.dll you can find the unwind information for a given instruction pointer if you so desire by calling SymFunctionTableAccess64.

In fact, if you want to see the CLR adding dynamic unwind info in action you can run the test code above under the debugger, and then at the initial breakpoint, before the application starts running, set the following breakpoint:

bu ntdll!RtlInstallFunctionTableCallback

When you let the application run you should then end up with a call stack at the breakpoint that looks like the following, which clearly shows the JIT compiler adding the unwind info to the table dynamically:

0:000> kn
# Child-SP RetAddr Call Site
00 00000000`0017dca8 000007fe`f4832cc6 ntdll!RtlInstallFunctionTableCallback
01 00000000`0017dcb0 000007fe`f4831422 clr!InstallEEFunctionTable+0x77
02 00000000`0017df60 000007fe`f4828ca8 clr!StubLinker::EmitUnwindInfo+0x492
03 00000000`0017e050 000007fe`f4832c1a clr!StubLinker::EmitStub+0xe8
04 00000000`0017e0b0 000007fe`f48328e5 clr!StubLinker::LinkInterceptor+0x1ea
05 00000000`0017e160 000007fe`f4831e40 clr!CTPMethodTable::CreateStubForNonVirtualMethod+0xa35
06 00000000`0017e300 000007fe`f4832926 clr!CRemotingServices::GetStubForNonVirtualMethod+0x50
07 00000000`0017e3c0 000007fe`f48223f3 clr!MethodDesc::DoPrestub+0x38b
08 00000000`0017e4d0 000007fe`f47e2d07 clr!PreStubWorker+0x1df
09 00000000`0017e590 000007fe`f48210b4 clr!ThePreStubAMD64+0x87
0a 00000000`0017e660 000007fe`f48211c9 clr!CallDescrWorker+0x84
0b 00000000`0017e6d0 000007fe`f4821245 clr!CallDescrWorkerWithHandler+0xa9
0c 00000000`0017e750 000007fe`f4823cf1 clr!MethodDesc::CallDescr+0x2a1
0d 00000000`0017e9b0 000007fe`f49cdc3d clr!MethodDescCallSite::Call+0x35
0e 00000000`0017e9f0 000007fe`f4999f0d clr!AppDomain::InitializeDomainContext+0x1ac
0f 00000000`0017ebf0 000007fe`f49212a1 clr!SystemDomain::InitializeDefaultDomain+0x13d
10 00000000`0017f0c0 000007fe`f4923dd6 clr!SystemDomain::ExecuteMainMethod+0x191
11 00000000`0017f670 000007fe`f4923cf3 clr!ExecuteEXE+0x43
12 00000000`0017f6d0 000007fe`f49a7365 clr!CorExeMainInternal+0xc4
13 00000000`0017f740 000007fe`f8ad3309 clr!CorExeMain+0x15

But there is one more wrinkle to this picture. We now know that by using RtlInstallFunctionTableCallback the CLR, or any other JIT engine, can register a callback that provides the unwind information at runtime. But how does the debugger access this information? When the debugger is broken into the process or if you are debugging a dump, it cannot execute the callback function registered with RtlInstallFunctionTableCallback.

This is where the sixth and final parameter to RtlInstallFunctionTableCallback comes into play. By providing the OutOfProcessCallbackDll parameter, the CLR is providing a dll which the debugger can use to effectively parse through the JITer’s unwind information statically. When inspecting which path the CLR passes for OutOfProcessCallbackDll on my machine, I see the following string:

0:000> du /c 80 000007fe`f5916160
000007fe`f5916160 "C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll"

So, the debugger uses mscordacwks.dll to statically examine the unwind info while the process is broken in the debugger or while inspecting a dump.

Note: This is one of the many reasons why you must have a complete process dump to effectively post-mortem debug managed applications.

Using the ‘k =’ Command to Dump the Stack

If you look at the documentation for the k command, you’ll see that there is a way to override the base pointer when walking the stack. However, the documentation leaves it a complete mystery as to how to apply this in the x64 world. To demonstrate what I mean, consider the following stack from earlier:

Now, imagine the top of the stack is corrupted, which I have “simulated” by blacking out the top few frames in this stack dump. Furthermore, let’s assume that we identified a frame where the stack starts to look sane again by looking at the raw stack below:

0:012> dps 00000000`04a51e90
00000000`04a51e90 00000000`00000000
00000000`04a51e98 00000000`04a52130
00000000`04a51ea0 00000000`ffffffff
00000000`04a51ea8 00000000`ffffffff
00000000`04a51eb0 00000000`00000108
00000000`04a51eb8 000007fe`f48bffc7 clr!CLREvent::WaitEx+0x170
00000000`04a51ec0 00000000`00000000
00000000`04a51ec8 00000000`00000108
00000000`04a51ed0 000007fe`00000000
00000000`04a51ed8 00000000`00000108
00000000`04a51ee0 ffffffff`fffffffe
00000000`04a51ee8 00000000`00000001
00000000`04a51ef0 00000000`00000000
00000000`04a51ef8 000007fe`f48bff70 clr!CLREvent::WaitEx+0xf8
00000000`04a51f00 00000000`00000000
00000000`04a51f08 00000000`00493ba0

From looking at this stack, we can see the typical pattern of stack frames because the return addresses resolve to symbols of sorts.

To dump out the corrupted stack, here is the undocumented syntax for the x64 platform:

k = <rsp> <rip> <frame_count>

<rsp> is the stack pointer to start with. You want to use the stack pointer that would have been in rsp when that function was active. Remember, typically rsp does not change after the function prolog code completes. Therefore, if you pick the stack pointer just below the return address, you should be good.

<rip> should be an instruction pointer from within the function that was executing at the time the <rsp> value above was in play. In this case, the return address directly above <rsp> comes from that function and I have highlighted it in green. This piece of information is critical so that the debugger can find the unwind metadata for the function that was current at this point in the stack. Without it, the debugger cannot walk the stack.

Armed with this information, you can construct a k command to dump the stack starting from this frame as shown below:

0:012> kn = 00000000`04a51ec0 000007fe`f48bffc7 10
# Child-SP RetAddr Call Site
00 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
01 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
02 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
03 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
04 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
05 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
06 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
07 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
08 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
09 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0a 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0b 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0c 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c

Note: The frame count in the above k expression is required. That is the way the debugger engine distinguishes between this variant of the command (with an overridden rip) and the documented form of k that does not provide an overridden rip.

Conclusion

Since the x64 calling convention does not utilize a base pointer (among other things), we need some extra information to effectively walk the stack. That extra information comes in the form of unwind metadata and is generated by the compiler and linker for static code and baked into the portable executable file. If you happen to code in assembly language, there are various macros that you must use to decorate your assembly code so that the assembler can generate the proper unwind metadata. For dynamically compiled code, that information is instead provided at runtime by registering a callback with the system. Knowing this information is critical if you encounter a corrupted stack and must piece it together manually. In such situations you’ll need to know how to dig out the unwind metadata manually and use it to effectively reconstruct the call stack.

That said, you could spare yourself some effort and use the undocumented variant of the k command described above to dump the stack starting at any frame. J

Happy debugging everyone!

"The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious. No association with any real company, organization, product, domain name, email address, logo, person, places, or events is intended or should be inferred."

↧

‘Hidden’ ETW Stack Trace Feature – Get Stacks All Over the Place!

June 8, 2010, 2:24 pm

≫ Next: Part 3: Understanding !PTE - Non-PAE and X64

≪ Previous: x64 Manual Stack Reconstruction and Stack Walking

I’ve heard Mark R. say that “sometimes Microsoft gave me a gift” when coming across an interesting feature not in mainstream documentation. How true that often the odd find can prove ultimately very valuable. Such was the case when I read Bruce Dawson’s Stack Walking in Xperf blog entry.

Here’s the excerpt that blew me away:

You can also record call stacks for manifest-based ETW (Event Tracing for Windows) events, but the syntax is quite different, and it only works on Windows 7 and above. When you specify your ETW provider to xperf after “-on” you can specify extra parameters after the provider name, separated by colons. These are flags, a level, and, for manifest-based providers, a list of extra data to record, which can include call stacks. You can leave the flags and level fields blank and just specify ‘stack’ (in single quotes) after three colons like this:

     xperf -on Latency -stackwalk profile -start browse -on Microsoft-IE:::'stack'
     rem Your scenario goes here...
     xperf -stop browse -stop -d mytrace.etl

This sounds rather ordinary at first, but looking at the syntax you may realize that this means for just about every manifest based provider (Microsoft-*) one can generate a stack trace on all ETW events therein. This is really cool because if you are only interested in stacks, you don’t have to monkey around with any other trace level or flags which is often the most frustrating part about manual ETW tracing. Recall from Ivan’s blog that there are over 600+ inbox providers to explore so we can go crazy here exploring components! Let’s demonstrate.

First, remember to use the Win7 SDK version of XPerf. The XPerf(WPT) install *.msi files are found in the Bin directory after installing Win32 Development Tools, so you have to do a two-step at present to get the tools installed.

Let’s list all the providers…

C:\utils\xperf7sdk>xperf -providers | findstr "Microsoft-"
      .
       .
       .
       93c05d69-51a3-485e-877f-1806a8731346:0x0001000000000000:0x4       : Microsoft-Windows-LUA
       9580d7dd-0379-4658-9870-d5be7d52d6de:0x0001000000000000:0x4       : Microsoft-Windows-WLAN-AutoConfig
       959f1fac-7ca8-4ed1-89dc-cdfa7e093cb0:0x0001000000000000:0x4       : Microsoft-Windows-HealthCenterCPL
       96ac7637-5950-4a30-b8f7-e07e8e5734c1:0x0001000000000000:0x4       : Microsoft-Windows-Kernel-BootDiagnostics
       96f4a050-7e31-453c-88be-9634f4e02139:0x0001000000000000:0x4       : Microsoft-Windows-UserPnp
       98583af0-fc93-4e71-96d5-9f8da716c6b8:0x0001000000000000:0x4       : Microsoft-Windows-Dwm-Udwm
       a50b09f8-93eb-4396-84c9-dc921259f952:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-MSDE<
       ad5162d8-daf0-4a25-88a7-01cbeb33902e:0x0001000000000000:0x4       : Microsoft-Windows-WPDClassInstaller
       ae4bd3be-f36f-45b6-8d21-bdd6fb832853:0x0001000000000000:0x4       : Microsoft-Windows-Audio
       af2e340c-0743-4f5a-b2d3-2f7225d215de:0x0001000000000000:0x4       : Microsoft-Windows-Netshell
       b03d4051-3564-4e93-93db-3c34f1b5b503:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-TaskManager
       b1f90b27-4551-49d6-b2bd-dfc6453762a6:0x0001000000000000:0x4       : Microsoft-Windows-PowerCpl
       b2a40f1f-a05a-4dfd-886a-4c4f18c4334c:0x0001000000000000:0x4       : Microsoft-PerfTrack-IEFRAME
       c89b991e-3b48-49b2-80d3-ac000dfc9749:0x0001000000000000:0x4       : Microsoft-Windows-Documents
       c9bdb4eb-9287-4c8e-8378-6896f0d1c5ef:0x0001000000000000:0x4       : Microsoft-Windows-HomeGroup-ProviderService
       dbe9b383-7cf3-4331-91cc-a3cb16a3b538:0x0001000000000000:0x4       : Microsoft-Windows-Winlogon
       de513a55-c345-438b-9a74-e18cac5c5cc5:0x0001000000000000:0x4       : Microsoft-Windows-Help
       ded165cf-485d-4770-a3e7-9c5f0320e80c:0x0001000000000000:0x4       : Microsoft-Windows-DeviceUx
       e978f84e-582d-4167-977e-32af52706888:0x0001000000000000:0x4       : Microsoft-Windows-TabletPC-InputPanel
       ed6b3ba8-95b2-4cf5-a317-d4af7003884c:0x0001000000000000:0x4       : Microsoft-Windows-Sidebar
       f3f14ff3-7b80-4868-91d0-d77e497b025e:0x0001000000000000:0x4       : Microsoft-Windows-WMP
       fbcfac3f-8459-419f-8e48-1f0b49cdb85e:0x0001000000000000:0x4       : Microsoft-Windows-NetworkProfile
       ffdb9886-80f3-4540-aa8b-b85192217ddf:0x0001000000000000:0x4       : Microsoft-PerfTrack-MSHTML

We have a lot to play with as you can see, but let’s say you are trying to track down a socket leak. Since we know sockets are implemented in afd.sys, let’s search for that provider…

C:\utils\xperf7sdk>xperf -providers | findstr "AFD"
e53c6823-7bb8-44bb-90dc3f86090d48a6 : Microsoft-Windows-Winsock-AFD
Microsoft-Windows-Winsock-AFD

Now, we can form the XPerf syntax to trace socket operations during the leak. The “AFDSession” is just the session name we are making up, the rest is the prescriptive syntax. Just to explain the syntax we are starting multiple sessions at the time here – kernel logger session + AFDSession. Similarly we are stopping both sessions at the same time and then merging both into the one file AFD.etl file.

xperf -on Latency -stackwalk profile -start AFDSession -on Microsoft-Windows-Winsock-AFD:::'stack'
<reproduce leak>
xperf -stop AFDSession -stop -d AFD.etl

Run xperfview AFD.etl and switch on “Stack counts by type graph”. If you don’t see this frame, make sure you are on the 7SDK version

The beauty here is the Stack Counts by Type frame which allows us to explore which stack traces where collected, notice all the AFD stack count sources!

Let’s say we have the simple case of a massive number of Create’s happening and thus handle counts rocketing up (leaking) for a given process. Perhaps exhausting all ephemeral ports on the system, ultimately resulting in any number of hang symptoms.

Simply filter the Events for AfdCreate as follows.

Now, we have a filtered Frame of just those stack counts….

You can imagine that if there were a lot more creates here that the frequency distribution shown above would be much higher, but for the sake of just exploring the feature, let’s just look at these small number of creates anyway. In fact, let’s select them all by just choosing Select View in the context menu which will get all of these events over the trace sample.

The frame area is now highlighted in light purple indicating the selected area.

Because we are interested in Stacks, make sure you have configured your symbol path and have checked Load Symbols

(be patient, this could take a while to build up your symbol cache)

Next, invoke the Summary Table

Click the flyout control on the left to select only the Creates again.

Next, make it easy on yourself by Filtering on the interesting process only.

After, much better.

Now, here is the fruit of our labor in the full stack trace inside iexplore.exe allocating the socket handles.

Continued…

Consider that what we see above is read from the top down. Iexplore.exe is doing this work through wininet, jsproxy, jscript, etc. all the way down through the calling of ws2_32.dll!socket which ultimately creates the socket and returned our file handle. I’ve seen so many cases of ephemeral port exhaustion which is essentially a handle leak against sockets, these could have been solved in minutes with this tracing! Keep in mind that this data isn’t perfectly suited for tracking handle leaks, especially small ones since there’s no accounting for outstanding handles. However, sometimes the leak is egregious enough that you just need to look at activity over time to be pretty certain of the culprit. Really, what we are doing here is just probing a particular ETW instrumented API, which is incredibly powerful.

Typically, on 64-bit computers, you cannot capture the kernel stack in certain contexts when page faults are not allowed. To enable walking the kernel stack on AMD64, set the DisablePagingExecutive Memory Management registry value to 1 and reboot. The DisablePagingExecutive registry value is located under the following registry key: HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management

In summary, this isn’t really a hidden feature since it’s documented in EnableTraceEx2’s Enable Parameters parameter, but I hope with this additional context you see what powerful analysis is possible. Previously, this would have required live debugging a process, so now we have another example of the power of XPerf to decode and view key system activity, non-invasively!

-Tate

↧