Monday, August 4, 2014

User space access on Windows, Mac OS X and Linux.

I made this notes just for a record to have all in one place.

Windows

All access to user mode space must be done by code protected by SEH, for more information look at MSDN's  "Structured Exception Handling" .

__try
{
   AccessUserModeSpace();
}
__except( ExceptionFilter() )
{
    ExceptionHandler();
}

Below is a discussion of SEH internal implementation in a compiler and kernel.

Windows 32 bit

A plethora of information is available online, e.g. "A Crash Course on the Depths of Win32 Structured Exception Handling"  , the basic idea is placing SEH frame registrations with filter and handler addresses on a stack and linking them in a list with a head at fs:PcExceptionList where fs is an IA 32 register containing an address for per-thread information, i.e. it is reloaded on each thread switching, so when an exception on memory access happens the kernel exception handler calls RtlDispatchException that knows where to find a list of registered exception filters and handlers.

Windows 64 bit

When a compiler meets __try() __except() construction it adds an entry in the exception table that contains start and end address for a protected region and offsets for filter and handler, this is similar to the method used on Linux. The details can be found here "Exceptional Behavior - x64 Structured Exception Handling"  below I describe what is not mentioned in this article.

 The linker adds an entry for every function, this entry contains an information for RtlVirtualUnwind that is called by RtlDispatchException , this allows to call any function inside a __try() block and the kernel still able to find a way up the call stack to a place where an exception filter and handler are registered, while in case of 32 bit kernel this is done by looking at fs:PcExceptionList  but the kernel doesn't have such luxury in 64 bit mode.

You can look at these entries with .fnent command.

For example an entry for memcpy that is  used to copy memory to/from user space, this function does not register any exception handlers but the kernel must be able to unwind a call stack to find it, so only unwinding information presents

 3: kd> .fnent nt!memcpy
Debugger function entry 00000000`003da6e8 for:
(fffff800`9046b340)   nt!memcpy   |  (fffff800`904d7c50)   nt! ?? ::FNODOBFM::`string'
Exact matches:
    nt!memcpy (<no parameter info>)
    nt!memmove (<no parameter info>)

BeginAddress      = 00000000`00055340
EndAddress        = 00000000`00055679
UnwindInfoAddress = 00000000`00201380

Unwind info at fffff800`90617380, 4 bytes
  version 1, flags 0, prolog 0, codes 0


Now look at the bigger function MmLockAndCopyMemory , as you see an unwinding information is pretty extensive and still no exception filter and handler

2: kd> .fnent nt!MmLockAndCopyMemory
Debugger function entry 00000000`003da6e8 for:
(fffff800`9098f920)   nt!MmLockAndCopyMemory   |  (fffff800`9098fb30)   nt!MmStoreRegister
Exact matches:
    nt!MmLockAndCopyMemory (<no parameter info>)

BeginAddress      = 00000000`00579920
EndAddress        = 00000000`00579b30
UnwindInfoAddress = 00000000`0025406c

Unwind info at fffff800`9066a06c, 20 bytes
  version 2, flags 0, prolog 1c, codes e
  00: offs a, unwind op 6, op info 0 UWOP_EPILOG Length: a. Flags: 0
  01: offs 10, unwind op 6, op info 0 UWOP_EPILOG Offset from end: 10 (FFFFF8009098FB20)
  02: offs 1c, unwind op 4, op info 6 UWOP_SAVE_NONVOL FrameOffset: 70 reg: rsi.
  04: offs 1c, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: 68 reg: rbp.
  06: offs 1c, unwind op 4, op info 3 UWOP_SAVE_NONVOL FrameOffset: 60 reg: rbx.
  08: offs 1c, unwind op 2, op info 5 UWOP_ALLOC_SMALL.
  09: offs 18, unwind op 0, op info f UWOP_PUSH_NONVOL reg: r15.
  0a: offs 16, unwind op 0, op info e UWOP_PUSH_NONVOL reg: r14.
  0b: offs 14, unwind op 0, op info d UWOP_PUSH_NONVOL reg: r13.
  0c: offs 12, unwind op 0, op info c UWOP_PUSH_NONVOL reg: r12.
  0d: offs 10, unwind op 0, op info 7 UWOP_PUSH_NONVOL reg: rdi.

Now look at a function that definetly contains a code that accesses user space, e.g. NtReadFile , now you can see _C_specific_handler as a handler routine that takes care of calling filter and handler

3: kd> .fnent nt!NtReadFile
Debugger function entry 00000000`003da6e8 for:
(fffff800`908b6690)   nt!NtReadFile   |  (fffff800`908b6f50)   nt!FsRtlCancellableWaitForMultipleObjects
Exact matches:
    nt!NtReadFile (<no parameter info>)

BeginAddress      = 00000000`004a0690
EndAddress        = 00000000`004a0f50
UnwindInfoAddress = 00000000`00242114

Unwind info at fffff800`90658114, 24 bytes
  version 2, flags 1, prolog 21, codes b
  handler routine: nt!_C_specific_handler (fffff800`905039a0), data 4
  00: offs c, unwind op 6, op info 0 UWOP_EPILOG Length: c. Flags: 0
  01: offs b6, unwind op 6, op info 3 UWOP_EPILOG Offset from end: 3b6 (FFFFF800908B6B9A)
  02: offs 21, unwind op 1, op info 0 UWOP_ALLOC_LARGE FrameOffset: c0.
  04: offs 1a, unwind op 0, op info f UWOP_PUSH_NONVOL reg: r15.
  05: offs 18, unwind op 0, op info e UWOP_PUSH_NONVOL reg: r14.
  06: offs 16, unwind op 0, op info d UWOP_PUSH_NONVOL reg: r13.
  07: offs 14, unwind op 0, op info c UWOP_PUSH_NONVOL reg: r12.
  08: offs 12, unwind op 0, op info 7 UWOP_PUSH_NONVOL reg: rdi.
  09: offs 11, unwind op 0, op info 6 UWOP_PUSH_NONVOL reg: rsi.
  0a: offs 10, unwind op 0, op info 3 UWOP_PUSH_NONVOL reg: rbx.


To be continued (Mac OS X and Linux)...

Wednesday, July 9, 2014

If your GWInstek GSP-730 becomes unresponsive.

 If  a GWInstek GSP-730 spectrum analyzer becomes unresponsive this is probably because of it trying to restore a connection via USB interrupted by a last shutdown when it was connected to a PC. To return it back to live just connect it to a PC before powering on and disconnect from a PC before switching it off.

Thursday, April 10, 2014

The trouble with DPC


   A strange thing attracted my attention when I was going through a disassembled  KiExecuteAllDpcs. There was no memory fence or barrier on a path that synchronizes DPC insertion in the CPU's DPC list and DPC execution by KiExecuteAllDpcs which might create some problems on high performance CPUs with out-of-order memory read access.

   The DPC insertion is synchronized via DpcData member of the KDPC structure. Before a DPC is inserted in the list it is set to non zero value by interlocked operation, so when a DPC is removed from a list it is set to zero again. If DpcData is non zero a DPC had been already inserted in a CPU core's DPC list and KeInsertQueueDpc does nothing,. You should understand that this might be another CPU core's DPC list as each CPU core has its own DPC list in per CPU core's KiProcessorBlock.

   Lets look on a disassembled KeInsertQueueDpc and KiExecuteAllDpcs for 64 bit Windows 8. We need to find where DpcData is changed. I took an easy way out by setting a breakpoint on KeInsertQueueDpc, then set a breakpoint on memory write to DpcData for a DPC that is provided as a parameter in RCX register.

So, I started with a breakpoint on KeInsertQueueDpc

kd> bp nt!KeInsertQueueDpc

The breakpoint was hit

kd> g
Breakpoint 0 hit
nt!KeInsertQueueDpc:
fffff802`d34f6420 4889542410      mov     qword ptr [rsp+10h],rdx

The DPC address is the first parameter for KeInsertQueueDpc, so it was in RCX register

kd> r rcx
rcx=ffffe00002385830

Lets look on the _KDPC structure stored in a kernel symbols file

kd> dt nt!_KDPC
   +0x000 TargetInfoAsUlong : Uint4B
   +0x000 Type             : UChar
   +0x001 Importance       : UChar
   +0x002 Number           : Uint2B
   +0x008 DpcListEntry     : _SINGLE_LIST_ENTRY
   +0x010 ProcessorHistory : Uint8B
   +0x018 DeferredRoutine  : Ptr64     void
   +0x020 DeferredContext  : Ptr64 Void
   +0x028 SystemArgument1  : Ptr64 Void
   +0x030 SystemArgument2  : Ptr64 Void
   +0x038 DpcData          : Ptr64 Void

The DpcData is on 7*8 == 0x38 bytes offset from the structure's beginning. Below is a DPC dump

kd> dq ffffe00002385830
ffffe000`02385830  00000000`02850113 00000000`00000000
ffffe000`02385840  00000000`00000020 fffff800`00cb5700
ffffe000`02385850  00000000`00000005 00000000`b0f66d37
ffffe000`02385860  00000000`01cf5513 00000000`00000000
ffffe000`02385870  00000001`40ee0088 ffffe000`02385878
ffffe000`02385880  ffffe000`02385878 00000000`23ba44fc
ffffe000`02385890  ffffe000`04e74950 fffff802`d3773f48
ffffe000`023858a0  11360faa`f68c8fd6 0000000a`00000000

Just FYI, this was a DPC from the TCP subsystem

kd> u fffff800`00cb5700
tcpip!TcpPeriodicTimeoutHandler:
fffff800`00cb5700 48895c2408      mov     qword ptr [rsp+8],rbx
fffff800`00cb5705 4889742418      mov     qword ptr [rsp+18h],rsi
fffff800`00cb570a 48897c2420      mov     qword ptr [rsp+20h],rdi
fffff800`00cb570f 4889542410      mov     qword ptr [rsp+10h],rdx
fffff800`00cb5714 55              push    rbp
fffff800`00cb5715 4154            push    r12
fffff800`00cb5717 4155            push    r13
fffff800`00cb5719 4156            push    r14

Next I wanted to track where DpcData would be changed, so a breakpoint on memory access was set

kd> ba w8 ffffe00002385830+8*7

kd> bl
 0 d fffff802`d34f6420     0001 (0001) nt!KeInsertQueueDpc
 1 e ffffe000`02385868 w 8 0001 (0001) 

I did not need the breakpoint on KeInsertQueueDpc, so it was disabled.
Eventually the memory watch breakpoint was hit

kd> g
Breakpoint 1 hit
nt!KeInsertQueueDpc+0xf4:
fffff802`d34f6514 753a            jne     nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)

Lets check the IRQL 

kd> !irql
Debugger saved IRQL for processor 0x0 -- 15 (HIGH_LEVEL)

This is the highest level for 64 bit x86 architecture, so the scheduler and all interrupts were disabled for the current CPU core.

Ok, I found a place where DpcData was changed by InterlockedCompareExchangePointer to a nonzero value. Lets look at a couple of instructions before and after this point. The InterlockedCompareExchangePointer is implemented by the lock cmpxchg instruction that represents a memory barrier by itself, so a compiler do not rearrange instructions around this call and a CPU retires all instruction before a barrier and do not allow out-of-order execution crossing the barrier.

nt!KeInsertQueueDpc+0xf4

fffff802`d34f6504 443b4d24        cmp     r9d,dword ptr [rbp+24h]
fffff802`d34f6508 480f45c8        cmovne  rcx,rax
fffff802`d34f650c 33c0            xor     eax,eax
fffff802`d34f650e f0480fb14e38    lock cmpxchg qword ptr [rsi+38h],rcx <======= here!
fffff802`d34f6514 753a            jne     nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)
fffff802`d34f6516 ff4718          inc     dword ptr [rdi+18h]
fffff802`d34f6519 ff471c          inc     dword ptr [rdi+1Ch]
fffff802`d34f651c 48895628        mov     qword ptr [rsi+28h],rdx
fffff802`d34f6520 4c896e30        mov     qword ptr [rsi+30h],r13
fffff802`d34f6524 4584e4          test    r12b,r12b

Execution was continued to find a place where the DpcData would be changed back to zero, here it is,

kd> g
Breakpoint 1 hit
nt!KiExecuteAllDpcs+0x10d:
fffff802`d34dbc6d ff4b18          dec     dword ptr [rbx+18h]

As you might know WinDBG points to the next instruction. Lets look around this point.

nt!KiExecuteAllDpcs+0x10d: 

fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
fffff802`d34dbc61 4c8b4f30        mov     r9,qword ptr [rdi+30h]
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi  <======== here!
fffff802`d34dbc6d ff4b18          dec     dword ptr [rbx+18h] ds:002b:ffffd000`207bbf18=00000001

check that the RSI registry was zero as it was written in DpcData 

kd> r rsi
rsi=0000000000000000

and again check the IRQL

kd> !irql
Debugger saved IRQL for processor 0x5 -- 2 (DISPATCH_LEVEL)

This was a DPC retirement by an idle thread, pretty common for a not busy system

kd> kn
 # Child-SP          RetAddr           Call Site
00 ffffd000`207e39a0 fffff802`d34db9f0 nt!KiExecuteAllDpcs+0x10d
01 ffffd000`207e3af0 fffff802`d35d27ea nt!KiRetireDpcList+0xd0
02 ffffd000`207e3c60 00000000`00000000 nt!KiIdleLoop+0x5a

 Now it is time to digest the information. If the above assembler instructions are being translated to a high level language like C we would have

BOOLEAN
KeInsertQueueDpc (
    __inout PRKDPC Dpc,
    __in_opt PVOID SystemArgument1,
    __in_opt PVOID SystemArgument2
    )
{
    // prevent any interrupt on this CPU core
    KeRaiseIrql(HIGH_LEVEL, &OldIrql);
    .....
    // a per CPU core lock for DPC queue
    KiAcquireSpinLock(&PerCpuDpcData->DpcLock);
    .....
    // acquire the DPC by changing DpcData, interlocked functions pose a memory barrier
    if (InterlockedCompareExchangePointer(&Dpc->DpcData, PerCpuDpcData, NULL) == NULL) {

        .......
        /*
        fffff802`d34f651c 48895628        mov     qword ptr [rsi+28h],rdx
        fffff802`d34f6520 4c896e30        mov     qword ptr [rsi+30h],r13
        */
        Dpc->SystemArgument1 = SystemArgument1;
        Dpc->SystemArgument2 = SystemArgument2;

        // insert in the list
        InsertHeadList(&PerCpuDpcData->DpcListHead, &Dpc->DpcListEntry);
   } else {
        // do nothing
   }

   KiReleaseSpinLock(&PerCpuDpcData->DpcLock);
   KeLowerIrql(OldIrql);
}

VOID
KiExecuteAllDpcs()
{
    // a per CPU core lock for DPC queue
    KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);

     ......
    RemoveEntryList(Entry);
    Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
    ......
/*
fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
*/
    SystemArgument1 = Dpc->SystemArgument1;
    SystemArgument2 = Dpc->SystemArgument2;

/*
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi
*/
    Dpc->DpcData = NULL;

    KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}

Lets discuss what we have at this point. Actually, the synchronisation between KeInsertQueueDpc  and KiExecuteAllDpcs does not look flawless. The problem is in out-of-order data access and compiler optimisation. In the current case the compiler did not rearrange memory access in KiExecuteAllDpcs but nothing prevents it in the future to schedule Dpc->DpcData = NULL before SystemArgument1 = Dpc->SystemArgument1; or SystemArgument2 = Dpc->SystemArgument2; or both as there is no data or control dependency between them. The situation is exacerbated by modern CPUs out-of-order memory access for reading, it is legitimate for a CPU to write NULL in Dpc->DpcData before reading Dpc->SystemArgument1 or Dpc->SystemArgument2 as again there is no data or control dependency here and it is possible that the instructions are retired out of order. In that case if there is a concurrent KeInsertQueueDpc on another CPU core there is a probability ( though extremely low ) that KiExecuteAllDpcs picks a wrong SystemArgument1 or SystemArgument2  that has been just rewritten by  KeInsertQueueDpc after InterlockedCompareExchangePointer successfully changed DpcData to a non zero value. Below is a table of memory reordering in some architectures, IA 64 and ARM look pretty scary for this case, Microsoft is lucky at least by dropping IA 64 support.

Memory ordering in some architectures
TypeAlphaARMv7PA-RISCPOWERSPARC RMOSPARC PSOSPARC TSOx86x86 oostoreAMD64IA-64zSeries
Loads reordered after loadsYYYYYYY
Loads reordered after storesYYYYYYY
Stores reordered after storesYYYYYYYY
Stores reordered after loadsYYYYYYYYYYYY
Atomic reordered with loadsYYYYY
Atomic reordered with storesYYYYYY
Dependent loads reorderedY
Incoherent instruction cache pipelineYYYYYYYYYY

The corrected code should use some type of memory fence or barrier, like the code below. I hope Microsoft did this for ARM, but in any case this should have been done for x86, x86-64 and IA 64, especially the last one.

VOID
KiExecuteAllDpcs()
{
    // a per CPU core lock for DPC queue
    KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);

     ......
    RemoveEntryList(Entry);
    Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
    ......
/*
fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
*/
    SystemArgument1 = Dpc->SystemArgument1;
    SystemArgument2 = Dpc->SystemArgument2;

    KeMemoryBarrier();

/*
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi
*/
    Dpc->DpcData = NULL;

    KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}

The rule for such synchronization is in employing a memory barrier when a resource is being released if a resource acquisition was made by an interlocked operation.

Acquire( SomeValue )
{
    // interlocked operations is a barrier for both a CPU and a compiler
    if( InterlockedCompareExchange( &Data->Resource, 0x1, 0x0 ) == 0 ){

        // the resource has been acquired
        Data->Value = SomeValue;
    }
}
.........

Release()
{
    // get the value when the resource is held
    StackedValue = Data->Value;

    // tell a CPU to retire all preceding operations, this is also a compiler barrier
    KeMemoryBarrier();

    // release the resource
    Data->Resource = 0x0;

    // Do something with the local StackedValue
    foo( StackedValue );
}

P.S. If you look at the Windows NT open source clone - ReactOS the situation there is not better, look at KiRetireDpcList ( there is no KiExecuteAllDpcs which was introduced somewhere in Vista or )
 /* Clear its DPC data and save its parameters */
                Dpc->DpcData = NULL;
                DeferredRoutine = Dpc->DeferredRoutine;
                DeferredContext = Dpc->DeferredContext;
                SystemArgument1 = Dpc->SystemArgument1;
                SystemArgument2 = Dpc->SystemArgument2;
after Dpc->DpcData = NULL a concurent KeInsertQueueDpc starts writing new values to Dpc->SystemArgument1 and Dpc->SystemArgument2 before KiRetireDpcList has fetched their consistent values. Without memory barrier there is a room for compiler optimisation by rearranging store and loads and for CPU optimisation.

Saturday, April 5, 2014

An ideal filter is hard to build

Why is it hard to build an ideal digital filter, e.g. an ideal lowpass band filter




From the mathematical point of view such a filter has an infinite spectrum in the time domain, i.e. it is not casual. What does it mean for software or hardware implementation? Consider two close frequencies one below cutoff frequency f and one above it, f-d and f+d respectively, where d is an infinitesimal and the filter must remove the second from its output. If a digital filter performs sampling of an input signal then for such close frequencies it would see the same values for an infinite time as a difference is below its resolution threshold defined by its internal ALU implementation, the filter will see one frequency

and it will take an infinite time as d approaches zero before both signals split up far enough so the filter recognises the difference, i.e. you must have an infinite delay in a filter.


Saturday, March 29, 2014

A promiscuous MmProbeAndLockPages

Consider a scenario when a Windows driver sweeping through a process address space somehow gets a pointer to a valid address range in the system address space and wants it to be accessible at IRQL greater or equal DISPATCH_LEVEL, i.e. when a scheduler is not available and swapped out pages can't be retrieved from a backing store. The solution is to lock pages by calling MmProbeAndLockPages. Is this is a bullet proof solution? The answer is NO. The driver will cause intermittent system crashes with a stack like shown below

nt!KeBugCheckEx
nt!MiBadRefCount
nt!MiFreePoolPages
nt!ExFreePoolWithTag
<.....>

The reason is that a system pool returns a page to a list of free pages and the system does not expect the page to be locked. This happens when the last allocation from a page has been released so the page does not contain valid allocations and can be returned to the system's list of free pages. The system implies that all pool allocations that have been locked are unlocked by calling MmUnlockPages before being freed by calling ExFreePool.

Tuesday, March 25, 2014

MacOS X hibernate path and preemption

Below is a MacOS X 10.9 callstack while processing a request to hibernate

mach_kernel`IOPMrootDomain::pmStatsRecordEvent() + 227 at IOPMrootDomain.cpp:6969
mach_kernel`hibernate_machine_init + 3160 at IOHibernateIO.cpp:3112
mach_kernel`acpi_sleep_kernel() + 503 at acpi.c:320
AppleACPIPlatform`AppleACPIPlatformExpert::sleepPlatform() + 443
AppleACPIPlatform`AppleACPICPU::haltCPU() + 117
mach_kernel`IOCPUSleepKernel() + 764 at IOCPU.cpp:403
mach_kernel`IOPMrootDomain::powerChangeDone() + 531 at IOPMrootDomain.cpp:2256
mach_kernel`IOService::all_done() + 1221 at IOServicePM.cpp:4269
mach_kernel`IOService::servicePMRequest(IOPMRequest*, IOPMWorkQueue*) [inlined] IOService::OurChangeFinish() + 8 at IOServicePM.cpp:4736
mach_kernel`IOService::servicePMRequest() + 2773 at IOServicePM.cpp:7337
mach_kernel`IOPMWorkQueue::checkRequestQueue() + 52 at IOServicePM.cpp:8236
mach_kernel`IOPMWorkQueue::checkForWork() + 127 at IOServicePM.cpp:8296
mach_kernel`IOWorkLoop::runEventSources() + 258 at IOWorkLoop.cpp:367
mach_kernel`IOWorkLoop::threadMain() + 195 at IOWorkLoop.cpp:395

The call to IOPMrootDomain::pmStatsRecordEvent is done with preemption disabled, i.e. the threads scheduling is not allowed, but IOPMrootDomain::pmStatsRecordEvent calls IORegistryEntry::setProperty that acquires a mutex and mutex acquisition can block a thread calling the scheduler in case of contention for mutex. So, is this a bug in the Apple code? I do not know but there is a workaround in the Apple code to not panic a debug kernel build when calling IORegistryEntry::setProperty with preemption disabled - a check cmpl $0,%gs:CPU_HIBERNATE bypasses the whole CHECK_PREEMPTION_LEVEL macro if a CPU is in hibernating mode. It makes sense if there are no other active CPUs in the system.

An excerpt from i386_locks.s
/*
 * If one or more simplelocks are currently held by a thread,
 * an attempt to acquire a mutex will cause this check to fail
 * (since a mutex lock may context switch, holding a simplelock
 * is not a good thing).
 */
#if MACH_RT
#define CHECK_PREEMPTION_LEVEL() \
cmpl $0,%gs:CPU_HIBERNATE ; \
jne 1f ; \
cmpl $0,%gs:CPU_PREEMPTION_LEVEL ; \
je 1f ; \
ALIGN_STACK() ; \
movl %gs:CPU_PREEMPTION_LEVEL, %eax ; \
LOAD_ARG1(%eax) ; \
LOAD_STRING_ARG0(2f) ; \
CALL_PANIC() ; \
hlt ; \
.data ; \
2: String "preemption_level(%d) != 0!" ; \
.text ; \
1:
#else /* MACH_RT */
#define CHECK_PREEMPTION_LEVEL()
#endif /* MACH_RT */

Monday, March 24, 2014

Windows Object Manager, Paged Pool and elevated IRQL

Surprisingly Windows 8 Object Manager allocates some objects from the Paged Pool, that means that ObReferenceObject and ObDereferenceObject can't be safely called at DISPATCH_LEVEL as the actual maximum IRQL becomes APC_LEVEL if an object is allocated from the paged pool, for example a token object might be from the paged pool, as !pool command shows

1: kd> !pool ffffc00002b73770
Pool page ffffc00002b73770 region is Paged pool
.....
*ffffc00002b73740 size:  8c0 previous size:  1c0  (Allocated) *Toke
Pooltag Toke : Token objects, Binary : nt!se

The object itself ( a pretty large pointer count, but nevertheless this is a valid object )

1: kd> !object ffffc00002b737a0
Object: ffffc00002b737a0  Type: (ffffe00000153db0) Token
    ObjectHeader: ffffc00002b73770 (new version)
    HandleCount: 33  PointerCount: 131067

Driver Verifier was active and cleared the valid bit from a PTE mapping the paged pool's page on which the object was allocated

1: kd> !pte ffffc00002b737a0
                                           VA ffffc00002b737a0
PXE at FFFFF6FB7DBEDC00    PPE at FFFFF6FB7DB80000    PDE at FFFFF6FB700000A8    PTE at FFFFF6E000015B98
contains 000000000134F863  contains 0000000001DCE863  contains 00000001257C2863  contains FB40000129FE9882
pfn 134f      ---DA--KWEV  pfn 1dce      ---DA--KWEV  pfn 1257c2    ---DA--KWEV  not valid
                                                                                  Transition: 129fe9
                                                                                  Protect: 4 - ReadWrite

the PTE was marked as invalid though the physical page actually contains valid data and has not been reused and swapped out, the valid bit will be brought back by the page fault handler when processing a page fault ( this is called a soft page fault when there is no IO from backing store ), but calling ObDereferenceObject and providing this object at DISPATCH_LEVEL would crash the system

TRAP_FRAME:  ffffd000201fc800 -- (.trap 0xffffd000201fc800)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000005 rbx=0000000000000000 rcx=ffffc00002b737a0
rdx=0000000000000005 rsi=0000000000000000 rdi=0000000000000000
rip=fffff803b20565a3 rsp=ffffd000201fc990 rbp=fffff800017bf594
 r8=0000000000000007  r9=fffff800017debac r10=0000000000000000
r11=ffffd000201fcc70 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po nc
nt!ObfDereferenceObject+0x23:
fffff803`b20565a3 f0480fc15ed0    lock xadd qword ptr [rsi-30h],rbx ds:ffffffff`ffffffd0=????????????????
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff803b21f10ea to fffff803b216f890

STACK_TEXT:  
 nt!DbgBreakPointWithStatus
nt!KiBugCheckDebugBreak+0x12
nt!KeBugCheck2+0x8ab
nt!KeBugCheckEx+0x104
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x23a
nt!ObfDereferenceObject+0x23
<here is an offending driver ))))>

Monday, March 17, 2014

A spectrum of a 2.4 GHz WiFi

Just for fun, below is a picture for a 2.4 GHz WiFi spectrum captured by a GWInstek-730 spectrum analyzer at my house



The bar lines don't represent WiFi channels, the channels are much wider - at least 20 MHz, these bar lines is just a feature of the sweep-tuned spectrum analyzer capturing fast changing signal. I believe that GWInstek-730 is an analogue sweep-tuned analyzer that does not perform DFT / FFT of a signal.




Thursday, February 27, 2014

What is in a name? ( of a process )

What does PsGetCurrentProcess return?

The answer - it returns the thread's process. But do you know that a thread might have TWO processes? The first one is the parent process that created the thread and the other one is a process to which the thread has been attached by KeStackAttachProcess . Which one does PsGetCurrentProcess return? It returns the attached process if one is not NULL or a parent process otherwise.

So this brings a question - How to get a parent process? The answer is IoThreadToProcess.

The other question - What does it mean "attach to process"? This mean that the thread operates in the address space of the attached process( i.e. PDE and CR3 are changed ). That means that any function that operates on the UserMode part of the address space will change or fetch the data from the attached process. The notion of "attached process" is meaningful only when a thread is executing in the KernelMode, as the system space is nearly completely shared between all processes and changing the Page Tables does not have a serious impact on accessing the system space.

The notion of attaching is much more profound in 32 bit Mac OS X or iOS where all processes have access to the full virtual address space of 4 GB, there is no division on system and user space, when the thread switches to the kernel mode the CR3 register is reloaded, the access to a user space by a pointer is not possible for 32 bit Mac OS X kernel so to access the user space the kernel ( or kernel module ) calls the functions that access the user space by switching CR3. In case of 64 bit Mac OS X or iOS the process space is divided on user space and kernel space and the access by pointer becomes possible though is discouraged by Apple and will crash the system in debug mode when the CR3 is reloaded when a thread enters kernel mode.

Wednesday, February 26, 2014

Outswapped kernel stack

You definitely know that kernel stack can be outswapped if some conditions are met. One such condition is waiting with WaitMode set to UserMode.

 If an event is allocated on a kernel stack it can be swapped out when a driver does something like this

GetOperationCompletionStatus( ... )
{
    KEVENT    Event;

    KeInitializeEvent( &Event. SynchronizationEvent, FALSE );

    KeAcquireSpinlock( &Lock, &OldIrql );
    {
       if( FALSE == Opeartion->Completed  ){
          Opeartion->CompletionEvent = &Event;
          Wait = TRUE;
       }
    }
    KeReleaseSpinLock( &Lock, OldIrql );

    while( Wait ){

       // allow a user to wake up the thread when terminating the
       // process, but note that the stack might be outswapped
       // when the thread is blocked waiting for the event
       WaitStatus = KeWaitForSingleObject( &Event,
                                           Executive,
                                           UserMode,
                                           FALSE,
                                           NULL );

       if( STATUS_SUCCESS != WaitStatus )
       {
          KeAcquireSpinlock( &Lock, &OldIrql );
          {
             // if NULL then go back to waiting as
             // there is a ongoing completion
             if( Opeartion->CompletionEvent ){
                Opeartion->CompletionEvent = NULL;
                Wait = FALSE;
             }
          }
          KeReleaseSpinLock( &Lock, OldIrql );

       }
}


NotifyOfCompletion()
{
    KeAcquireSpinlock( &Lock, &OldIrql );
    {
        Opeartion->Completed = TRUE;
        if( Opeartion->CompletionEvent  ){

           // the following call sometimes crashes the system
           // when tries to access an outswapped page
           KeSetEvent( Opeartion->CompletionEvent,                                      IO_NO_INCREMENT,
                       FALSE );
           Opeartion->CompletionEvent = NULL;
       }
    }
    KeReleaseSpinLock( &Lock, OldIrql );
}

 the reason to do this is when you want to be slightly more gentle and allow a user to terminate a waiting thread, this is a common scenario for distributed file systems where response time might be up to minutes.

    The problem with the above code is that a kernel stack can be swapped out while waiting in  KeWaitForSingleObject whith waiting mode set to UserMode . The call to KeSetEvent tries to access the event on the outswapped stack when the IRQL is DISPATCH_LEVEL and it has nothing to do with a call to KeAcquireSpinlock, the same will be even if you try to call KeSetEvent without raising IRQL as KeSetEvent elevates IRQL when working with the event.

   If you check the event address on an outswapped stack with WinDBG you see 

1: kd> !pte 0xaf792cac
                    VA af792cac
PDE at C0602BD8            PTE at C057BC90
contains 000000005402F863  contains 00000000A5129BE2
pfn 5402f     ---DA--KWEV  not valid
                            Transition: a5129

                            Protect: 1f - Outswapped kernel stack

The solution to the above example is to allocate the event from the NonPaged pool.

Monday, February 10, 2014

How IoCancelFileOpen works

WDK says

"IoCancelFileOpen sets the FO_FILE_OPEN_CANCELLED flag in the Flags member of the file object that FileObject points to. This flag indicates that the IRP_MJ_CREATE request has been canceled, and an IRP_MJ_CLOSE request will be issued for this file object."

But this does not tell the full story. First of all IoCancelFileOpen issues IRP_MJ_CLEANUP , then sets the FO_FILE_OPEN_CANCELLED  flag. Also, IoCancelFileOpen checks that no handles have been created for the file object, if this check fails the system will crash itself with KeBugCheck. Here you should say 

  ... Wait a minute! What about IRP_MJ_CLEANUP being sent? Should it be sent only for object with handles?

 The answer is NO. The system always sends IRP_MJ_CLEANUP for all file objects, if there were no handles created for a file object the IRP_MJ_CLEANUP  request is sent by IopDeleteFile  ( called by ObDereferenceObject ) before issuing  IRP_MJ_CLOSE. Here you must understand why IoCancelFileOpen does not send IRP_MJ_CLOSE , because it is sent by IopDeleteFile called by ObDereferenceObject .

Lets now change our focus on FO_FILE_OPEN_CANCELLED . What is this flag for? This flag is used by IoCreateFile when it decides how to reclaim  file object resources when an error is returned by IoCallDevice, if the flag is set then ObDereferenceObject or IopDeleteFile is called for the file object so the file system and attached filters will receive close request. If the flag is not set then the DeviceObject member of the file object is set to NULL so the close and cleanup request will not be sent when ObDereferenceObject  or  IopDeleteFile is called to reclaim the memory occupied by the file object, the latter is a case of an error returned by the lowest driver in the stack which is a file system driver.

Below is a call stack for create request processing when an attached filter called IoCancellFileOpen that resulted in sending close request from ObfDereferenceObject 

nt!IofCallDriver+0x3f
nt!IopDeleteFile+0xef
nt!ObpRemoveObjectRoutine+0x43
nt!ObfDereferenceObjectWithTag+0x5c
nt!ObfDereferenceObject+0xd
nt!IopParseDevice+0x167a
nt!ObpLookupObjectName+0x251
nt!ObOpenObjectByName+0xfe
nt!IopCreateFile+0x2a5
nt!IoCreateFileEx+0x88
nt!IoCreateFileSpecifyDeviceObjectHint+0x59