Thursday, April 10, 2014

The trouble with DPC


   A strange thing attracted my attention when I was going through a disassembled  KiExecuteAllDpcs. There was no memory fence or barrier on a path that synchronizes DPC insertion in the CPU's DPC list and DPC execution by KiExecuteAllDpcs which might create some problems on high performance CPUs with out-of-order memory read access.

   The DPC insertion is synchronized via DpcData member of the KDPC structure. Before a DPC is inserted in the list it is set to non zero value by interlocked operation, so when a DPC is removed from a list it is set to zero again. If DpcData is non zero a DPC had been already inserted in a CPU core's DPC list and KeInsertQueueDpc does nothing,. You should understand that this might be another CPU core's DPC list as each CPU core has its own DPC list in per CPU core's KiProcessorBlock.

   Lets look on a disassembled KeInsertQueueDpc and KiExecuteAllDpcs for 64 bit Windows 8. We need to find where DpcData is changed. I took an easy way out by setting a breakpoint on KeInsertQueueDpc, then set a breakpoint on memory write to DpcData for a DPC that is provided as a parameter in RCX register.

So, I started with a breakpoint on KeInsertQueueDpc

kd> bp nt!KeInsertQueueDpc

The breakpoint was hit

kd> g
Breakpoint 0 hit
nt!KeInsertQueueDpc:
fffff802`d34f6420 4889542410      mov     qword ptr [rsp+10h],rdx

The DPC address is the first parameter for KeInsertQueueDpc, so it was in RCX register

kd> r rcx
rcx=ffffe00002385830

Lets look on the _KDPC structure stored in a kernel symbols file

kd> dt nt!_KDPC
   +0x000 TargetInfoAsUlong : Uint4B
   +0x000 Type             : UChar
   +0x001 Importance       : UChar
   +0x002 Number           : Uint2B
   +0x008 DpcListEntry     : _SINGLE_LIST_ENTRY
   +0x010 ProcessorHistory : Uint8B
   +0x018 DeferredRoutine  : Ptr64     void
   +0x020 DeferredContext  : Ptr64 Void
   +0x028 SystemArgument1  : Ptr64 Void
   +0x030 SystemArgument2  : Ptr64 Void
   +0x038 DpcData          : Ptr64 Void

The DpcData is on 7*8 == 0x38 bytes offset from the structure's beginning. Below is a DPC dump

kd> dq ffffe00002385830
ffffe000`02385830  00000000`02850113 00000000`00000000
ffffe000`02385840  00000000`00000020 fffff800`00cb5700
ffffe000`02385850  00000000`00000005 00000000`b0f66d37
ffffe000`02385860  00000000`01cf5513 00000000`00000000
ffffe000`02385870  00000001`40ee0088 ffffe000`02385878
ffffe000`02385880  ffffe000`02385878 00000000`23ba44fc
ffffe000`02385890  ffffe000`04e74950 fffff802`d3773f48
ffffe000`023858a0  11360faa`f68c8fd6 0000000a`00000000

Just FYI, this was a DPC from the TCP subsystem

kd> u fffff800`00cb5700
tcpip!TcpPeriodicTimeoutHandler:
fffff800`00cb5700 48895c2408      mov     qword ptr [rsp+8],rbx
fffff800`00cb5705 4889742418      mov     qword ptr [rsp+18h],rsi
fffff800`00cb570a 48897c2420      mov     qword ptr [rsp+20h],rdi
fffff800`00cb570f 4889542410      mov     qword ptr [rsp+10h],rdx
fffff800`00cb5714 55              push    rbp
fffff800`00cb5715 4154            push    r12
fffff800`00cb5717 4155            push    r13
fffff800`00cb5719 4156            push    r14

Next I wanted to track where DpcData would be changed, so a breakpoint on memory access was set

kd> ba w8 ffffe00002385830+8*7

kd> bl
 0 d fffff802`d34f6420     0001 (0001) nt!KeInsertQueueDpc
 1 e ffffe000`02385868 w 8 0001 (0001) 

I did not need the breakpoint on KeInsertQueueDpc, so it was disabled.
Eventually the memory watch breakpoint was hit

kd> g
Breakpoint 1 hit
nt!KeInsertQueueDpc+0xf4:
fffff802`d34f6514 753a            jne     nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)

Lets check the IRQL 

kd> !irql
Debugger saved IRQL for processor 0x0 -- 15 (HIGH_LEVEL)

This is the highest level for 64 bit x86 architecture, so the scheduler and all interrupts were disabled for the current CPU core.

Ok, I found a place where DpcData was changed by InterlockedCompareExchangePointer to a nonzero value. Lets look at a couple of instructions before and after this point. The InterlockedCompareExchangePointer is implemented by the lock cmpxchg instruction that represents a memory barrier by itself, so a compiler do not rearrange instructions around this call and a CPU retires all instruction before a barrier and do not allow out-of-order execution crossing the barrier.

nt!KeInsertQueueDpc+0xf4

fffff802`d34f6504 443b4d24        cmp     r9d,dword ptr [rbp+24h]
fffff802`d34f6508 480f45c8        cmovne  rcx,rax
fffff802`d34f650c 33c0            xor     eax,eax
fffff802`d34f650e f0480fb14e38    lock cmpxchg qword ptr [rsi+38h],rcx <======= here!
fffff802`d34f6514 753a            jne     nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)
fffff802`d34f6516 ff4718          inc     dword ptr [rdi+18h]
fffff802`d34f6519 ff471c          inc     dword ptr [rdi+1Ch]
fffff802`d34f651c 48895628        mov     qword ptr [rsi+28h],rdx
fffff802`d34f6520 4c896e30        mov     qword ptr [rsi+30h],r13
fffff802`d34f6524 4584e4          test    r12b,r12b

Execution was continued to find a place where the DpcData would be changed back to zero, here it is,

kd> g
Breakpoint 1 hit
nt!KiExecuteAllDpcs+0x10d:
fffff802`d34dbc6d ff4b18          dec     dword ptr [rbx+18h]

As you might know WinDBG points to the next instruction. Lets look around this point.

nt!KiExecuteAllDpcs+0x10d: 

fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
fffff802`d34dbc61 4c8b4f30        mov     r9,qword ptr [rdi+30h]
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi  <======== here!
fffff802`d34dbc6d ff4b18          dec     dword ptr [rbx+18h] ds:002b:ffffd000`207bbf18=00000001

check that the RSI registry was zero as it was written in DpcData 

kd> r rsi
rsi=0000000000000000

and again check the IRQL

kd> !irql
Debugger saved IRQL for processor 0x5 -- 2 (DISPATCH_LEVEL)

This was a DPC retirement by an idle thread, pretty common for a not busy system

kd> kn
 # Child-SP          RetAddr           Call Site
00 ffffd000`207e39a0 fffff802`d34db9f0 nt!KiExecuteAllDpcs+0x10d
01 ffffd000`207e3af0 fffff802`d35d27ea nt!KiRetireDpcList+0xd0
02 ffffd000`207e3c60 00000000`00000000 nt!KiIdleLoop+0x5a

 Now it is time to digest the information. If the above assembler instructions are being translated to a high level language like C we would have

BOOLEAN
KeInsertQueueDpc (
    __inout PRKDPC Dpc,
    __in_opt PVOID SystemArgument1,
    __in_opt PVOID SystemArgument2
    )
{
    // prevent any interrupt on this CPU core
    KeRaiseIrql(HIGH_LEVEL, &OldIrql);
    .....
    // a per CPU core lock for DPC queue
    KiAcquireSpinLock(&PerCpuDpcData->DpcLock);
    .....
    // acquire the DPC by changing DpcData, interlocked functions pose a memory barrier
    if (InterlockedCompareExchangePointer(&Dpc->DpcData, PerCpuDpcData, NULL) == NULL) {

        .......
        /*
        fffff802`d34f651c 48895628        mov     qword ptr [rsi+28h],rdx
        fffff802`d34f6520 4c896e30        mov     qword ptr [rsi+30h],r13
        */
        Dpc->SystemArgument1 = SystemArgument1;
        Dpc->SystemArgument2 = SystemArgument2;

        // insert in the list
        InsertHeadList(&PerCpuDpcData->DpcListHead, &Dpc->DpcListEntry);
   } else {
        // do nothing
   }

   KiReleaseSpinLock(&PerCpuDpcData->DpcLock);
   KeLowerIrql(OldIrql);
}

VOID
KiExecuteAllDpcs()
{
    // a per CPU core lock for DPC queue
    KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);

     ......
    RemoveEntryList(Entry);
    Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
    ......
/*
fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
*/
    SystemArgument1 = Dpc->SystemArgument1;
    SystemArgument2 = Dpc->SystemArgument2;

/*
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi
*/
    Dpc->DpcData = NULL;

    KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}

Lets discuss what we have at this point. Actually, the synchronisation between KeInsertQueueDpc  and KiExecuteAllDpcs does not look flawless. The problem is in out-of-order data access and compiler optimisation. In the current case the compiler did not rearrange memory access in KiExecuteAllDpcs but nothing prevents it in the future to schedule Dpc->DpcData = NULL before SystemArgument1 = Dpc->SystemArgument1; or SystemArgument2 = Dpc->SystemArgument2; or both as there is no data or control dependency between them. The situation is exacerbated by modern CPUs out-of-order memory access for reading, it is legitimate for a CPU to write NULL in Dpc->DpcData before reading Dpc->SystemArgument1 or Dpc->SystemArgument2 as again there is no data or control dependency here and it is possible that the instructions are retired out of order. In that case if there is a concurrent KeInsertQueueDpc on another CPU core there is a probability ( though extremely low ) that KiExecuteAllDpcs picks a wrong SystemArgument1 or SystemArgument2  that has been just rewritten by  KeInsertQueueDpc after InterlockedCompareExchangePointer successfully changed DpcData to a non zero value. Below is a table of memory reordering in some architectures, IA 64 and ARM look pretty scary for this case, Microsoft is lucky at least by dropping IA 64 support.

Memory ordering in some architectures
TypeAlphaARMv7PA-RISCPOWERSPARC RMOSPARC PSOSPARC TSOx86x86 oostoreAMD64IA-64zSeries
Loads reordered after loadsYYYYYYY
Loads reordered after storesYYYYYYY
Stores reordered after storesYYYYYYYY
Stores reordered after loadsYYYYYYYYYYYY
Atomic reordered with loadsYYYYY
Atomic reordered with storesYYYYYY
Dependent loads reorderedY
Incoherent instruction cache pipelineYYYYYYYYYY

The corrected code should use some type of memory fence or barrier, like the code below. I hope Microsoft did this for ARM, but in any case this should have been done for x86, x86-64 and IA 64, especially the last one.

VOID
KiExecuteAllDpcs()
{
    // a per CPU core lock for DPC queue
    KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);

     ......
    RemoveEntryList(Entry);
    Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
    ......
/*
fffff802`d34dbc59 4c8b5720        mov     r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728        mov     r8,qword ptr [rdi+28h]
*/
    SystemArgument1 = Dpc->SystemArgument1;
    SystemArgument2 = Dpc->SystemArgument2;

    KeMemoryBarrier();

/*
fffff802`d34dbc65 4c8b6f38        mov     r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738        mov     qword ptr [rdi+38h],rsi
*/
    Dpc->DpcData = NULL;

    KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}

The rule for such synchronization is in employing a memory barrier when a resource is being released if a resource acquisition was made by an interlocked operation.

Acquire( SomeValue )
{
    // interlocked operations is a barrier for both a CPU and a compiler
    if( InterlockedCompareExchange( &Data->Resource, 0x1, 0x0 ) == 0 ){

        // the resource has been acquired
        Data->Value = SomeValue;
    }
}
.........

Release()
{
    // get the value when the resource is held
    StackedValue = Data->Value;

    // tell a CPU to retire all preceding operations, this is also a compiler barrier
    KeMemoryBarrier();

    // release the resource
    Data->Resource = 0x0;

    // Do something with the local StackedValue
    foo( StackedValue );
}

P.S. If you look at the Windows NT open source clone - ReactOS the situation there is not better, look at KiRetireDpcList ( there is no KiExecuteAllDpcs which was introduced somewhere in Vista or )
 /* Clear its DPC data and save its parameters */
                Dpc->DpcData = NULL;
                DeferredRoutine = Dpc->DeferredRoutine;
                DeferredContext = Dpc->DeferredContext;
                SystemArgument1 = Dpc->SystemArgument1;
                SystemArgument2 = Dpc->SystemArgument2;
after Dpc->DpcData = NULL a concurent KeInsertQueueDpc starts writing new values to Dpc->SystemArgument1 and Dpc->SystemArgument2 before KiRetireDpcList has fetched their consistent values. Without memory barrier there is a room for compiler optimisation by rearranging store and loads and for CPU optimisation.

Saturday, April 5, 2014

An ideal filter is hard to build

Why is it hard to build an ideal digital filter, e.g. an ideal lowpass band filter




From the mathematical point of view such a filter has an infinite spectrum in the time domain, i.e. it is not casual. What does it mean for software or hardware implementation? Consider two close frequencies one below cutoff frequency f and one above it, f-d and f+d respectively, where d is an infinitesimal and the filter must remove the second from its output. If a digital filter performs sampling of an input signal then for such close frequencies it would see the same values for an infinite time as a difference is below its resolution threshold defined by its internal ALU implementation, the filter will see one frequency

and it will take an infinite time as d approaches zero before both signals split up far enough so the filter recognises the difference, i.e. you must have an infinite delay in a filter.