Wednesday, December 28, 2016

ExInterlockedPopEntrySList processing by scheduler.

I believe this topic on ExInterlockedPopEntrySList might be interesting for Windows drivers developers.

Safety of using ExInterlockedPopEntrySList

The question was

To my knowledge, pre-Windows 8 x64 implementations of SList use 9-bit sequence numbers in the SLIST_HEADER. This means that 512 operations can complete concurrently (without progress from particular thread) until an ABA problem potentially manifests. I wonder whether, depending on the number of threads and physical cores, this couldn't plausibly occur. To further complicate, the kernel could run on a vcpu, creating time discontinuities. I would like to ask: 1. Does the Windows scheduler protect against ABA by, e.g., restarting interlocked operation upon preemption? 2. Is there some protection against hypervisor interference? 3. In the light of the above concerns, is SList on a pre-Windows 8 x64 deployment really safe for all workloads? I would have speculated that per-thread kernel allocator behavior was factored in for the ABA avoidance, but the primitives are in the Win32 API as well and any driver can employ custom pool allocator.
My answer was

I looked at the code again and found that interrupt processing code has a fixup for SList . There is a routine KiCheckForSListAddress. This routine is called at DISPATCH_LEVEL before returning from an interrupt and it fixes the EIP(RIP for x64) of a trap frame to restart SList pop operation if interrupt happened inside ExInterlockedPopEntrySList. So when an interrupt processing code returns execution to an interrupted code the code resumes at the beginning of ExInterlockedPopEntrySList ( namely ExpInterlockedPopEntrySListResume ). kd> uf KiCheckForSListAddress nt!KiCheckForSListAddress: 82acbdf1 0fb7416c movzx eax,word ptr [ecx+6Ch] 82acbdf5 8b5168 mov edx,dword ptr [ecx+68h] 82acbdf8 6683f808 cmp ax,8 82acbdfc 7511 jne nt!KiCheckForSListAddress+0x1e (82acbe0f) Branch nt!KiCheckForSListAddress+0xd: 82acbdfe b8f4dda882 mov eax,offset nt!ExpInterlockedPopEntrySListResume (82a8ddf4) 82acbe03 3bd0 cmp edx,eax 82acbe05 7222 jb nt!KiCheckForSListAddress+0x38 (82acbe29) Branch nt!KiCheckForSListAddress+0x16: 82acbe07 81fa1fdea882 cmp edx,offset nt!ExpInterlockedPopEntrySListEnd (82a8de1f) 82acbe0d eb15 jmp nt!KiCheckForSListAddress+0x33 (82acbe24) Branch nt!KiCheckForSListAddress+0x1e: 82acbe0f 6683f81b cmp ax,1Bh 82acbe13 7514 jne nt!KiCheckForSListAddress+0x38 (82acbe29) Branch nt!KiCheckForSListAddress+0x24: 82acbe15 a1ac69bb82 mov eax,dword ptr [nt!KeUserPopEntrySListResume (82bb69ac)] 82acbe1a 3bd0 cmp edx,eax 82acbe1c 720b jb nt!KiCheckForSListAddress+0x38 (82acbe29) Branch nt!KiCheckForSListAddress+0x2d: 82acbe1e 3b15a469bb82 cmp edx,dword ptr [nt!KeUserPopEntrySListEnd (82bb69a4)] nt!KiCheckForSListAddress+0x33: 82acbe24 7703 ja nt!KiCheckForSListAddress+0x38 (82acbe29) Branch nt!KiCheckForSListAddress+0x35: 82acbe26 894168 mov dword ptr [ecx+68h],eax nt!KiCheckForSListAddress+0x38: 82acbe29 c3 ret Branch

Sunday, December 25, 2016

MacOS network filter

I have added a MacOS network sockets filter to my GitHub repository - MacOSX-Network-Sockets-Filter . The filter allows to inspect and modify network data in a user mode application.

Thursday, December 22, 2016

MacOS VFS file system isolation filter.

I have committed a MacOS VFS isolation filter project to my GitHub repository - MacOSX-VFS-Isolation-Filter The filter allows to isolate files I/O operation. The possible applications for a filter are content analyzing, encryption or any advanced data flow modification.

Thursday, December 15, 2016

Linux process. The beginning.

Do you ever wonder how a Linux process address space looks like when the first user mode instruction is executed? The answer is below ( the executable file is /bin/grep )

00400000-0042d000 r-xp 00000000 08:11 27316285             /bin/grep
0062d000-0062f000 rw-p 0002d000 08:11 27316285             /bin/grep
0062f000-00630000 rw-p 00000000 00:00 0                    [heap]
7ffff7dda000-7ffff7dfd000 r-xp 00000000 08:11 11172260     /lib/x86_64-linux-gnu/ld-2.19.so
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0            [vdso]
7ffff7ffc000-7ffff7ffe000 rw-p 00022000 08:11 11172260     /lib/x86_64-linux-gnu/ld-2.19.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0 
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0            [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0    [vsyscall]

The first user mode instruction is

0x00007ffff7ddb2d0 in _start () from /lib64/ld-linux-x86-64.so.2

Sunday, November 6, 2016

VFS structures have changed in Sierra

After Sierra(10.12) was released I received two tickets for my Mac OS X file system filter project on GitHub MacOSX-FileSystem-Filter . It happened that vnode structure has changed. The offset for v_op field has changed from 0xC8 to 0xD0. Instead of hardcoding a new definition I activated the dynamic inference for structure layouts that has been in the project for a long time but had not been used as structures had not changed for years.

Update: Apple released 10.12 source code on 25 November 2016. To download xnu/Darwin source code follow the link https://opensource.apple.com/release/os-x-1012.html 

Wednesday, October 12, 2016

Windows and Linux kernels exception handling and stack unwinding

The interesting difference between Windows and Linux kernels is in Windows mechanism to unwind a call stack, aka Frame Unwind. Windows 64 bit and Linux kernels use the table based exception processing to locate a handler for an instruction that caused an exception. Windows kernel can unwind a call stack to locate a caller's handler while Linux requires to have a table entry for each executable address range that can cause an exception.

You can look at pseudo-code for Windows 64 bit RtlUnwind here StackWalk64.cpp .

Some resources on Windows 64 bit SEH implementation.

1. Exceptional behavior: the Windows 8.1 X64 SEH Implementation  http://blog.talosintel.com/2014/06/exceptional-behavior-windows-81-x64-seh.html

2. Exceptional Behavior - x64 Structured Exception Handling - OSR Online. http://www.osronline.com/article.cfm?article=469

3. Johnson, Ken. " Programming against the x64 exception handling support ."  http://www.nynaeve.net/?p=113

The code was borrowed from http://www.nynaeve.net/Code/StackWalk64.cpp

__declspec(noinline)
VOID
StackTrace64(
 VOID
 )
{
 CONTEXT                       Context;
 KNONVOLATILE_CONTEXT_POINTERS NvContext;
 UNWIND_HISTORY_TABLE          UnwindHistoryTable;
 PRUNTIME_FUNCTION             RuntimeFunction;
 PVOID                         HandlerData;
 ULONG64                       EstablisherFrame;
 ULONG64                       ImageBase;

 DbgPrint("StackTrace64: Executing stack trace...\n");

 //
 // First, we'll get the caller's context.
 //

 RtlCaptureContext(&Context);

 //
 // Initialize the (optional) unwind history table.
 //

 RtlZeroMemory(
  &UnwindHistoryTable,
  sizeof(UNWIND_HISTORY_TABLE));

 UnwindHistoryTable.Unwind = TRUE;

 //
 // This unwind loop intentionally skips the first call frame, as it shall
 // correspond to the call to StackTrace64, which we aren't interested in.
 //

 for (ULONG Frame = 0;
   ;
   Frame++)
 {
  //
  // Try to look up unwind metadata for the current function.
  //

  RuntimeFunction = RtlLookupFunctionEntry(
   Context.Rip,
   &ImageBase,
   &UnwindHistoryTable
   );

  RtlZeroMemory(
   &NvContext,
   sizeof(KNONVOLATILE_CONTEXT_POINTERS));

  if (!RuntimeFunction)
  {
   //
   // If we don't have a RUNTIME_FUNCTION, then we've encountered
   // a leaf function.  Adjust the stack approprately.
   //

   Context.Rip  = (ULONG64)(*(PULONG64)Context.Rsp);
   Context.Rsp += 8;
  }
  else
  {
   //
   // Otherwise, call upon RtlVirtualUnwind to execute the unwind for
   // us.
   //

   RtlVirtualUnwind(
    UNW_FLAG_NHANDLER,
    ImageBase,
    Context.Rip,
    RuntimeFunction,
    &Context,
    &HandlerData,
    &EstablisherFrame,
    &NvContext);
  }

  //
  // If we reach an RIP of zero, this means that we've walked off the end
  // of the call stack and are done.
  //

  if (!Context.Rip)
   break;

  //
  // Display the context.  Note that we don't bother showing the XMM
  // context, although we have the nonvolatile portion of it.
  //

  DbgPrint(
   "FRAME %02x: Rip=%p Rsp=%p Rbp=%p\n",
   Frame,
   Context.Rip,
   Context.Rsp,
   Context.Rsp);
  DbgPrint(
   "r12=%p r13=%p r14=%p\n"
   "rdi=%p rsi=%p rbx=%p\n"
   "rbp=%p rsp=%p\n",
   Context.R12,
   Context.R13,
   Context.R14,
   Context.Rdi,
   Context.Rsi,
   Context.Rbx,
   Context.Rbp,
   Context.Rsp
   );

  static const CHAR* RegNames[ 16 ] =
  { "Rax", "Rcx", "Rdx", "Rbx", "Rsp", "Rbp", "Rsi", "Rdi", "R8", "R9",
    "R10", "R11", "R12", "R13", "R14", "R15" };

  //
  // If we have stack-based register stores, then display them here.
  //

  for (ULONG i = 0;
    i < 16;
    i++)
  {
   if (NvContext.IntegerContext[ i ])
   {
    DbgPrint(
     " -> Saved register '%s' on stack at %p (=> %p)\n",
     RegNames[ i ],
     NvContext.IntegerContext[ i ],
     *NvContext.IntegerContext[ i ]);
   }
  }

  DbgPrint("\n");
 }

 DbgBreakPoint();

 return;
}

Wednesday, September 28, 2016

What happens if Linux kernel module unloads with a running kernel thread?


The kernel will ooops in the page fault handler in the kernel thread context and terminate this thread. Click on the image to see a backtrace.


Tuesday, September 6, 2016

Waiting for concurrent page fault completion

An interesting call stack when a thread waits in a page fault for another thread completing paging data from a file

00 nt!KiSwapContext
01 nt!KiSwapThread
02 nt!KiCommitThreadWait
03 nt!KeWaitForSingleObject
04 nt!MiWaitForCollidedFaultComplete
05 nt!MiResolveTransitionFault
06 nt!MiResolveProtoPteFault
07 nt!MiDispatchFault
08 nt!MmAccessFault
09 nt!KiPageFault
0a nt!memcpy
0b nt!CcCopyBytesToUserBuffer
0c nt!CcMapAndCopyFromCache
0d nt!CcCopyReadEx
0e nt!CcCopyRead
0f nt!FsRtlCopyRead
10 ***
11 ***
12 ***
13 nt!NtReadFile
14 nt!KiSystemServiceCopyEnd

Friday, September 2, 2016

FileObjects and SectionObjectPointer in Windows.

Just for the record.

FileObject->SectionObjectPointer is allocated and set by a file system driver but the structure is managed by the Memory Manager (Mm). SectionObjectPointer is shared between all file objects for the same data stream.

FileObject->SectionObjectPointer->DataSectionObject and FileObject->SectionObjectPointer->ImageSectionObject contain address of ControlArea for data and image.

ControlArea deletion is synchronized by ControlArea->WaitingForDeletion and ControlArea->u.Flags.BeingDeleted. WaitingForDeletion points to a structure with notification event and a reference counter.

All functions that might destroy control area take SectionObjectPointer as a parameter. These functions acquire a global lock then check that ControlArea is not NULL. If control area exists ControlArea->u.Flags.BeingDeleted is checked and if it is set a function waits on WaitingForDeletion event with incremented reference counter so the event is deleted when the last waiting thread exit from a waiting state and the reference counter drops to zero. A call to MiCleanSection set SectionObjectPointer->DataSectionObject  and  SectionObjectPointer->ImageSectionObject  to NULL. This call is synchronized with ControlArea->u.Flags.BeingDeleted.

The functions that might delete control area include MmFlushImageSection and CcPurgeCacheSection. That means that it is safe to provide SectionObjectPointer to these functions without synchronizing with file objects deletion. It is even possible to call this functions with a SectionObjectPointer when all related file objects have been deleted or have IopDeleteFile being called for them which might happen in IRP_MJ_PNP processing path.

Friday, August 26, 2016

File mapping and FILE_OBJECT in Windows

There is a WinDBG command !ca that shows file mapping related information. I will show how to get this file mapping information for a file object ( FILE_OBJECT type) by a direct access to structures.

The core of file mapping( and file data caching that uses file mapping ) is SEGMENT object and CONTROL_AREA structures. SEGMENT object contains a pointer to an array of Prototype PTEs ( ProtoPTE ) of _MMPTE_PROTOTYPE type. Each ProtoPTE points to a related physical page if the page is valid. When a file mapping is created the related  virtual memory range PTEs( Page Table Entries ) have the invalid bit set and point to Prototype PTEs. When a corresponding virtual address is accessed a page fault happens, the page fault handler follows a link to ProtoPTE and fixes process PTE to point to a real page. That allows all processes to share the same physical pages for the same file memory mapping. The physical page might need to be allocated and data read in from a file if this has not been done before, after that the page is shared between all processes mapping the file.

FILE_OBJECT has SectionObjectPointer field which is set by a file system driver (FSD) but all its fields are initialized by Memory Manager(CC) and Cache Manager(CC). SectionObjectPointer is of _SECTION_OBJECT_POINTERS type with DataSectionObject field pointing to a CONTROL_AREA structure that in turn points to a SEGMENT object. CONTROL_AREA has a _SUBSECTION structure following it at the tail, all subsequent _SUBSECTION structures are linked by NextSubsection  pointer. Each _SUBSECTION has SubsectionBase field that points to a related ProtoPTEs array.

Below all these structures for a real file object are printed from WinDBG.

0: kd> ??FileObject
struct _FILE_OBJECT * 0x8750ef80
   +0x000 Type             : 0n5
   +0x002 Size             : 0n128
   +0x004 DeviceObject     : 0x879c9030 _DEVICE_OBJECT
   +0x008 Vpb              : 0x879d5888 _VPB
   +0x00c FsContext        : 0x87bdde68 Void
   +0x010 FsContext2       : 0x863cc188 Void
   +0x014 SectionObjectPointer : 0x87bddea8 _SECTION_OBJECT_POINTERS
   +0x018 PrivateCacheMap  : 0x869acf90 Void
   +0x01c FinalStatus      : 0n0
   +0x020 RelatedFileObject : (null) 
   +0x024 LockOperation    : 0 ''
   +0x025 DeletePending    : 0 ''
   +0x026 ReadAccess       : 0 ''
   +0x027 WriteAccess      : 0 ''
   +0x028 DeleteAccess     : 0 ''
   +0x029 SharedRead       : 0 ''
   +0x02a SharedWrite      : 0 ''
   +0x02b SharedDelete     : 0 ''
   +0x02c Flags            : 0xc0012
   +0x030 FileName         : _UNICODE_STRING "\Sample Pictures\Chrysanthemum.jpg"
   +0x038 CurrentByteOffset : _LARGE_INTEGER 0x11000
   +0x040 Waiters          : 0
   +0x044 Busy             : 1
   +0x048 LastLock         : (null) 
   +0x04c Lock             : _KEVENT
   +0x05c Event            : _KEVENT
   +0x06c CompletionContext : (null) 
   +0x070 IrpListLock      : 0
   +0x074 IrpList          : _LIST_ENTRY [ 0x8750eff4 - 0x8750eff4 ]
   +0x07c FileObjectExtension : 0x8774e950 Void

0: kd> ??FileObject->SectionObjectPointer
struct _SECTION_OBJECT_POINTERS * 0x87bddea8
   +0x000 DataSectionObject : 0x863c1758 Void
   +0x004 SharedCacheMap   : 0x869acea0 Void
   +0x008 ImageSectionObject : (null) 

0: kd> dt nt!_CONTROL_AREA 0x863c1758 
   +0x000 Segment          : 0xaeb311a8 _SEGMENT
   +0x004 DereferenceList  : _LIST_ENTRY [ 0x0 - 0x0 ]
   +0x00c NumberOfSectionReferences : 1
   +0x010 NumberOfPfnReferences : 0x40
   +0x014 NumberOfMappedViews : 1
   +0x018 NumberOfUserReferences : 0
   +0x01c u                : <unnamed-tag>
   +0x020 FlushInProgressCount : 0
   +0x024 FilePointer      : _EX_FAST_REF
   +0x028 ControlAreaLock  : 0n0
   +0x02c ModifiedWriteCount : 0
   +0x02c StartingFrame    : 0
   +0x030 WaitingForDeletion : (null) 
   +0x034 u2               : <unnamed-tag>
   +0x040 LockedPages      : 0n1
   +0x048 ViewList         : _LIST_ENTRY [ 0x86a3a898 - 0x86a3a898 ]

0: kd> ??sizeof(nt!_CONTROL_AREA)
unsigned int 0x50

0: kd> dt nt!_SUBSECTION 0x863c1758+0x50
   +0x000 ControlArea      : 0x863c1758 _CONTROL_AREA
   +0x004 SubsectionBase   : 0xa8e9b008 _MMPTE
   +0x008 NextSubsection   : (null) 
   +0x00c PtesInSubsection : 0x40
   +0x010 UnusedPtes       : 0
   +0x010 GlobalPerSessionHead : (null) 
   +0x014 u                : <unnamed-tag>
   +0x018 StartingSector   : 0
   +0x01c NumberOfFullSectors : 0x40

0: kd> dt nt!_SEGMENT  0xaeb311a8 
   +0x000 ControlArea      : 0x863c1758 _CONTROL_AREA
   +0x004 TotalNumberOfPtes : 0x40
   +0x008 SegmentFlags     : _SEGMENT_FLAGS
   +0x00c NumberOfCommittedPages : 0
   +0x010 SizeOfSegment    : 0x40000
   +0x018 ExtendInfo       : (null) 
   +0x018 BasedAddress     : (null) 
   +0x01c SegmentLock      : _EX_PUSH_LOCK
   +0x020 u1               : <unnamed-tag>
   +0x024 u2               : <unnamed-tag>
   +0x028 PrototypePte     : 0xa8fe97e8 _MMPTE
   +0x030 ThePtes          : [1] _MMPTE

1: kd> dt nt!_MMPTE .
   +0x000 u                :
      +0x000 Long             : Uint8B
      +0x000 VolatileLong     : Uint8B
      +0x000 HighLow          : _MMPTE_HIGHLOW
      +0x000 Flush            : _HARDWARE_PTE
      +0x000 Hard             : _MMPTE_HARDWARE
      +0x000 Proto            : _MMPTE_PROTOTYPE
      +0x000 Soft             : _MMPTE_SOFTWARE
      +0x000 TimeStamp        : _MMPTE_TIMESTAMP
      +0x000 Trans            : _MMPTE_TRANSITION
      +0x000 Subsect          : _MMPTE_SUBSECTION
      +0x000 List             : _MMPTE_LIST

1: kd> dt nt!_MMPTE_PROTOTYPE
   +0x000 Valid            : Pos 0, 1 Bit
   +0x000 Unused0          : Pos 1, 7 Bits
   +0x000 ReadOnly         : Pos 8, 1 Bit
   +0x000 Unused1          : Pos 9, 1 Bit
   +0x000 Prototype        : Pos 10, 1 Bit
   +0x000 Protection       : Pos 11, 5 Bits
   +0x000 Unused           : Pos 16, 16 Bits
   +0x000 ProtoAddress     : Pos 32, 32 Bits


Tuesday, August 23, 2016

Mac OS X file system redirector

 I committed a new project in my GitHub repository. A file system requests redirection filter MacOSX-VFS-redirector. The project is based on MacOSX-FileSystem-Filter .

The filter redirects file creation, open requests, rename and data IO (read, write) from an application to a shadow directory where shadow copies for files are created. The shadow directory path can cross mount points. An application under control doesn't aware about redirection and believes it works with original files by using unmodified paths. Applications under control are registered in gApplicationsData array. The array is declared in ApplicationsData.cpp .

The filter employs a user mode client for data modification and shadow file creation. See processing for VFSDataType_PreOperationCallback in user mode client's main.cpp .

The filter's core is VFSHooks.cpp . It contains VFS hooks to intercept file creation and open, redirect IO and call a user client.

The filter was tested on Mac OS X Yosemite (10.10) and Mac OS X El Capitan (10.12).

Tuesday, June 21, 2016

Two approaches to register renaming

The figure is borrowed from "The Berkeley Out-of-Order Machine (BOOM) Design Specification"

At the left is the physical register file approach when a register file contains more registers than ISA. At the right a case when the register file is identical to ISA register set and the renaming is implemented in ROB ( ReOrder Buffer ) that commits to the register file.

A quote from "The Berkeley Out-of-Order Machine (BOOM) Design Specification"


Monday, June 20, 2016

I/O Kit filtering by hooking technique.

Apple I/O Kit is a set of classes to develop kernel modules for Mac OS X and iOS. Its analog in the Windows world is KMDF/UMDF framework. I/O Kit is built atop Mach and BSD subsystems like Windows KMDF is built atop WDM and kernel API.

The official way to develop a kernel module filter for Mac OS X and iOS is to inherit filter C++ class from a C++ class it filters. This requires access to a class declaration which is not always possible as some classes are Apple private or not published by a third party developers. In most cases all these private classes are inherited from classes that are in the public domain. That means they extend an existing interface/functionality and a goal to filter device I/O can be achieved by filtering only the public interface. This is nearly always true because an I/O Kit class object that is attached to this private C++ class object is usually an Apple I/O Kit class that knows nothing about the third party extended interface or is supposed to be from a module developed by third party developers and it knows only about a public interface. In both cases the attached object issues requests to a public interface.

Let's consider some imaginary private class IOPrivateInterface that inherits from some IOAppleDeviceClass which declaration is available and an attached I/O Kit object issues requests to IOAppleDeviceClass interface

class IOPrivateInterface: public IOAppleDeviceClass {
};

You want to filter requests to a device managed by IOPrivateInterface, that means that you need to declare your filter like

class IOMyFilter: public IOPrivateInterface{
};

this would never compile as you do not have access to IOPrivateInterface class. You can't declare you filter as

class IOMyFilter: public IOAppleDeviceClass {
};

as this will jettison all IOPrivateInterface code and the device will not function correctly.

There might be another reason to avoid standard I/O Kit filtering by inheritance. A filtering class objects replaces an original class object in the I/O Kit device stack. That means a module with a filter should be available on device discovery and initialization. In nearly all cases this means that an instance of a filter class object will be created during system startup. This puts a great responsibility on a module developer as an error might render the system unbootable without an easy solution for a customer to fix the problem.

To overcome these limitations I developed a hooking technique for I/O kit classes. I/O Kit uses virtual C++ functions so a class can be extended but its clients still be able to use a base class declaration. That means that all functions that used for I/O are placed in the array of function pointers known as vtable.

The hooking technique supports two types of hooking.
  - replacing vtable array pointer in class object
  - replacing selected functions in vtable array without changing vtable pointer

The former method allows to filter access to a particular object of a class but requires knowing a vtable size. The latter method allows to filter request without knowing vtable size but as vtable is shared by all objects of a class a filter will see requests to all objects of a particular class. To get a vtable size you need a class declaration or get the size by reverse engineering.

The hooker code can be found in my GitHub repository https://github.com/slavaim/MacOSX-Kernel-Filter. Below is an excerpt from  DldHookerCommonClass.cpp  for a function that hooks a particular I/O kit object.


As you can see there are two types of hooking
   - DldHookTypeObject which replaces a vtable pointer
   - DldHookTypeVtable which replaces selected function in vtable array

As an example is a function that is used to filter read requests to USB HCI controller class



The function is a template to allow inherited classes hooking without code duplication. Though Apple documentation declares that C++ templates are not supported by I/O Kit it is true only if a template is used to declare I/O kit object. You can compile I/O kit module with C++ template classes if they are not I/O Kit classes but template parameters can be I/O Kit classes. As you probably know after instantiation a template is just an ordinary C++ class. Template classes support is not required from run time environment. You can't declare I/O Kit class as a template just because a way Apple declares them by using C style preprocessor definitions.

Below is a call stack when a hooked I/O Kit object virtual function is called by IOStorage::open

DldIOService::open at DldIOService.cpp:364
DldHookerCommonClass::open at DldHookerCommonClass.cpp:621
IOServiceVtableDldHookDldInheritanceDepth_0::open_hook at IOServiceDldHook.cpp:20
IOStorage::open at IOStorage.cpp:216
IOApplePartitionScheme::scan at IOApplePartitionScheme.cpp:258
IOApplePartitionScheme::probe at IOApplePartitionScheme.cpp:101
IOService::probeCandidates at IOService.cpp:2702
IOService::doServiceMatch at IOService.cpp:3088
_IOConfigThread::main at IOService.cpp:3350

Friday, June 17, 2016

Caching and file object reference in Windows.

This is how the last reference to a file object backing cached file data is being released by the kernel. In that case this was a network filesystem

19 nt!IofCallDriver
1a mup!MupiCallUncProvider
1b mup!MupStateMachine
1c mup!MupClose
1d nt!IofCallDriver
1e nt!IopDeleteFilec
1f nt!ObpRemoveObjectRoutine
20 nt!ObfDereferenceObjectWithTag
21 nt!ObfDereferenceObject
22 nt!CcDeleteSharedCacheMap
23 nt!CcWriteBehind
24 nt!CcWorkerThread
25 nt!ExpWorkerThread
26 nt!PspSystemThreadStartup
27 nt!KiThreadStartup

Friday, June 10, 2016

How handles are closed on process termination in Windows

Just for curiosity. A call stack when handles are closed on process termination

00 nt!ObpDecrementHandleCount
01 nt!ObpCloseHandleTableEntry
02 nt!ExSweepHandleTable
03 nt!ObKillProcess
04 nt!PspExitThread
05 nt!PsExitSpecialApc
06 nt!KiDeliverApc
07 nt!KiServiceExit
08 ntdll!KiFastSystemCallRet
09 ntdll!ZwWaitForWorkViaWorkerFactory
0a ntdll!TppWorkerThread
0b KERNEL32!BaseThreadInitThunk
0c ntdll!__RtlUserThreadStart
0d ntdll!_RtlUserThreadStart

Monday, May 23, 2016

ZwQuerySystemInformation fails for SystemSessionProcessesInformation(53) when called from a driver

The following kernel mode code will always fail with STATUS_ACCESS_DENIED ( C0000005 ) error if used with a well known definition for SYSTEM_SESSION_PROCESS_INFORMATION.

typedef struct _SYSTEM_SESSION_PROCESS_INFORMATION {
    ULONG SessionId;
    ULONG SizeOfBuf;
    PVOID Buffer;
} SYSTEM_SESSION_PROCESS_INFORMATION, *PSYSTEM_SESSION_PROCESS_INFORMATION;

SYSTEM_SESSION_PROCESS_INFORMATION     Info;

Info.SessionId = SessionId;
Info.Buffer = Buffer; // a buffer allocated in the system space
Info.SizeOfBuf = SizeOfBuf;

RC = ZwQuerySystemInformation( SystemSessionProcessesInformation, &Info, sizeof(Info), &ReturnedLength );


I disassembled the sequence of calls until an error was returned. The reason for failure is that the definition for SYSTEM_SESSION_PROCESS_INFORMATION has probably changed starting from Vista. The kernel checks the size of the structure. The size is a third parameter for ZwQuerySystemInformation. If the size is 0x10(on 64 bit system) ExpQuerySystemInformation calls ProbeForWrite for Info.Buffer regardless of the previous mode ( in this case the previous mode was KernelMode ). Obviously the system allows to use the old definition only for user mode code as ProbeForWrite always throws an exception ( SEH ) when called with a kernel mode address as a parameter.

Below is a call stack when ProbeForWrite is called

nt!ProbeForWrite
nt!ExpQuerySystemInformation
nt!NtQuerySystemInformation
nt!KiSystemServiceCopyEnd
nt!KiServiceLinkage
<a call to ZwQuerySystemInformation from a kernel mode driver>

Wednesday, April 13, 2016

Linux kernel stacks. Page fault processing.

A call to filemap_fault to process a page fault in user space. The function is registered as

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};

The related kernel source is Quark X1000 Linux kernel 



#0  filemap_fault (vma=0xcc4dfb40, vmf=0xc006fe88) at mm/filemap.c:1591
#1  __do_fault (mm=mm@entry=0xcc4a0820, vma=vma@entry=0xcc4dfb40, address=address@entry=134768518, pmd=pmd@entry=0xcc4c6200, pgoff=63, flags=flags@entry=40, orig_pte=...) at mm/memory.c:3250
#2  do_linear_fault (page_table=0xcc4d2430, orig_pte=..., flags=40, pmd=0xcc4c6200, address=134768518, vma=0xcc4dfb40, mm=0xcc4a0820) at mm/memory.c:3404
#3  handle_pte_fault (mm=mm@entry=0xcc4a0820, vma=vma@entry=0xcc4dfb40, address=address@entry=134768518, pte=0xcc4d2430, pmd=pmd@entry=0xcc4c6200, flags=flags@entry=40) at mm/memory.c:3628
#4  handle_mm_fault (mm=mm@entry=0xcc4a0820, vma=vma@entry=0xcc4dfb40, address=address@entry=134768518, flags=flags@entry=40) at mm/memory.c:3768
#5  __do_page_fault (regs=0xc006ffb4, error_code=4294901780) at arch/x86/mm/fault.c:1192
#6  do_page_fault (regs=<optimised out>, error_code=<optimised out>) at arch/x86/mm/fault.c:1232
#7  <signal handler called>

Linux kernel stacks. Executable file mapping.

A call to generic_file_mmap registered as mmap file operation for ext3, i.e. the mmap member of file_operations structure and called by file->f_op->mmap . The related kernel source is Quark X1000 Linux kernel .



#0  generic_file_mmap (file=0xcea60300, vma=0xcc4df0c0) at mm/filemap.c:1746
#1  mmap_region (file=file@entry=0xcea60300, addr=1339731968, len=len@entry=8192, flags=flags@entry=2066, vm_flags=1050739, vm_flags@entry=2163, pgoff=pgoff@entry=32) at mm/mmap.c:1483
#2  do_mmap_pgoff (file=file@entry=0xcea60300, addr=<optimised out>, addr@entry=1339731968, len=len@entry=8192, prot=prot@entry=3, flags=flags@entry=2066, pgoff=pgoff@entry=32) at mm/mmap.c:1282
#3  vm_mmap_pgoff (file=file@entry=0xcea60300, addr=addr@entry=1339731968, len=len@entry=8192, prot=prot@entry=3, flag=flag@entry=2066, pgoff=pgoff@entry=32) at mm/util.c:362
#4  vm_mmap (file=file@entry=0xcea60300, addr=addr@entry=1339731968, len=len@entry=8192, prot=3, flag=2066, offset=offset@entry=131072) at mm/util.c:377
#5  elf_map (total_size=0, type=<optimised out>, prot=<optimised out>, eppnt=0xcebba1a0, addr=1339731968, filep=0xcea60300) at fs/binfmt_elf.c:353
#6  load_elf_interp (no_base=0, interp_map_addr=<synthetic pointer>, interpreter=0xcea60300, interp_elf_ex=0xcea604b4) at fs/binfmt_elf.c:457
#7  load_elf_binary (bprm=bprm@entry=0xce165500) at fs/binfmt_elf.c:894
#8  search_binary_handler (bprm=bprm@entry=0xce165500) at fs/exec.c:1401
#9  do_execve_common (filename=<optimised out>, argv=..., argv@entry=..., envp=..., envp@entry=...) at fs/exec.c:1539
#10 do_execve (__envp=0x9d7c608, __argv=0x9d81668, filename=<optimised out>) at fs/exec.c:1585
#11 sys_execve (filename=0x9d816c8 "\320\061\025\001", argv=0x9d81668, envp=0x9d7c608) at fs/exec.c:1681
#12 <signal handler called>

Linux kernel stacks. Lookup operation.

Some stacks for ext3 lookup operation. The related kernel source is Quark X1000 Linux kernel

Stack 1:


#0  ext3_lookup (dir=0xcd5df248, dentry=0xcb81df80, flags=1) at fs/ext3/namei.c:1019
#1  lookup_real (dir=dir@entry=0xcd5df248, dentry=0xcb81df80, flags=flags@entry=1) at fs/namei.c:1317
#2  __lookup_hash (name=name@entry=0xcc4e3ebc, base=base@entry=0xcd5dd400, flags=1) at fs/namei.c:1335
#3  lookup_slow (nd=nd@entry=0xcc4e3eb4, name=name@entry=0xcc4e3ebc, path=path@entry=0xcc4e3e74) at fs/namei.c:1447
#4  walk_component (follow=1, type=0, name=0xcc4e3ebc, path=0xcc4e3e74, nd=0xcc4e3eb4) at fs/namei.c:1536
#5  lookup_last (path=0xcc4e3e74, nd=0xcc4e3eb4) at fs/namei.c:1933
#6  path_lookupat (dfd=dfd@entry=-100, name=<optimised out>, flags=flags@entry=65, nd=nd@entry=0xcc4e3eb4) at fs/namei.c:1968
#7  filename_lookup (dfd=dfd@entry=-100, flags=flags@entry=1, nd=nd@entry=0xcc4e3eb4, name=0xcc4c6000) at fs/namei.c:2007
#8  user_path_at_empty (dfd=dfd@entry=-100, name=name@entry=0x9751188 "/work", flags=flags@entry=1, path=path@entry=0xcc4e3f40, empty=empty@entry=0x0) at fs/namei.c:2155
#9  user_path_at (dfd=dfd@entry=-100, name=name@entry=0x9751188 "/work", flags=flags@entry=1, path=path@entry=0xcc4e3f40) at fs/namei.c:2166
#10 vfs_fstatat (dfd=dfd@entry=-100, filename=filename@entry=0x9751188 "/work", stat=stat@entry=0xcc4e3f60, flag=flag@entry=0) at fs/stat.c:88
#11 vfs_stat (stat=0xcc4e3f60, name=0x9751188 "/work") at fs/stat.c:384
#12 sys_stat64 (filename=0x9751188 "/work", statbuf=0xbf9b40f0) at fs/stat.c:386
#13 <signal handler called>

Stack 2:



#0  ext3_lookup (dir=0xcd61ee50, dentry=0xcb83d380, flags=257) at fs/ext3/namei.c:1019
#1  lookup_real (dir=dir@entry=0xcd61ee50, dentry=dentry@entry=0xcb83d380, flags=<optimised out>) at fs/namei.c:1317
#2  lookup_open (opened=0xcc4c9eac, got_write=false, op=0xc135aa6c <open_exec_flags>, file=0xce26f500, path=0xcc4c9eb0, nd=0xcc4c9edc) at fs/namei.c:2641
#3  do_last (nd=nd@entry=0xcc4c9edc, path=path@entry=0xcc4c9eb0, file=file@entry=0xce26f500, op=op@entry=0xc135aa6c <open_exec_flags>, 
    opened=opened@entry=0xcc4c9eac, name=0xcc4c9f50) at fs/namei.c:2771
#4  path_openat (dfd=dfd@entry=-100, nd=nd@entry=0xcc4c9edc, op=0xc135aa6c <open_exec_flags>, flags=flags@entry=65, pathname=0xcc4c9f50) at fs/namei.c:2956
#5  do_filp_open (dfd=dfd@entry=-100, pathname=pathname@entry=0xcc4c9f50, op=op@entry=0xc135aa6c <open_exec_flags>, flags=flags@entry=1) at fs/namei.c:3004
#6  open_exec (name=name@entry=0xcc4b7010 "/bin/login") at fs/exec.c:762
#7  do_execve_common (filename=<optimised out>, argv=..., argv@entry=..., envp=..., envp@entry=...) at fs/exec.c:1499
#8  do_execve (__envp=0xbfb7cba0, __argv=0xbfb7a758, filename=<optimised out>) at fs/exec.c:1585
#9  sys_execve (filename=0x804d172 "/bin/login", argv=0xbfb7a758, envp=0xbfb7cba0) at fs/exec.c:1681
#10 <signal handler called>

Stack 3:


#0  ext3_lookup (dir=0xcb83fae0, dentry=0xcb844480, flags=257) at fs/ext3/namei.c:1019
#1  lookup_real (dir=dir@entry=0xcb83fae0, dentry=dentry@entry=0xcb844480, flags=<optimised out>) at fs/namei.c:1317
#2  0xc10e4e31 in lookup_open (opened=0xce0f5ec8, got_write=false, op=0xce0f5f78, file=0xce87a980, path=0xce0f5ecc, nd=0xce0f5ef8) at fs/namei.c:2641
#3  do_last (nd=nd@entry=0xce0f5ef8, path=path@entry=0xce0f5ecc, file=file@entry=0xce87a980, op=op@entry=0xce0f5f78, opened=opened@entry=0xce0f5ec8, name=0xcc4b7000) at fs/namei.c:2771
#4  path_openat (dfd=dfd@entry=-100, nd=nd@entry=0xce0f5ef8, op=0xce0f5f78, flags=flags@entry=65, pathname=0xcc4b7000) at fs/namei.c:2956
#5  do_filp_open (dfd=dfd@entry=-100, pathname=pathname@entry=0xcc4b7000, op=op@entry=0xce0f5f78, flags=flags@entry=1) at fs/namei.c:3004
#6  do_sys_open (dfd=dfd@entry=-100, filename=filename@entry=0xbf877ef0 "/work/1", flags=flags@entry=32768, mode=mode@entry=0) at fs/open.c:956
#7  0xc10d900b in sys_open (filename=0xbf877ef0 "/work/1", flags=32768, mode=0) at fs/open.c:977
#8  <signal handler called>


Monday, April 11, 2016

Linux kernel stacks. Creating an inode.

A call stack when ext3 file system driver creates an inode. The related kernel source is Quark X1000 Linux kernel




#0  ext3_new_inode (handle=handle@entry=0xcd5ea000, dir=dir@entry=0xcb83f928, qstr=qstr@entry=0xcb81df94, mode=mode@entry=33206) at fs/ext3/ialloc.c:348
#1  ext3_create (dir=0xcb83f928, dentry=0xcb81df80, mode=33206, excl=false) at fs/ext3/namei.c:1715
#2  vfs_create (dir=0xcb83f928, dentry=0xcb81df80, mode=<optimised out>, want_excl=false) at fs/namei.c:2338
#3  lookup_open (opened=0xce265ec8, got_write=true, op=0xce265f78, file=0xce182400, path=0xce265ecc, nd=0xce265ef8) at fs/namei.c:2666
#4  do_last (nd=nd@entry=0xce265ef8, path=path@entry=0xce265ecc, file=file@entry=0xce182400, op=op@entry=0xce265f78, opened=opened@entry=0xce265ec8, name=0xcc4b2000) at fs/namei.c:2771
#5  path_openat (dfd=dfd@entry=-100, nd=nd@entry=0xce265ef8, op=0xce265f78, flags=flags@entry=65, pathname=0xcc4b2000) at fs/namei.c:2956
#6  do_filp_open (dfd=dfd@entry=-100, pathname=pathname@entry=0xcc4b2000, op=op@entry=0xce265f78, flags=flags@entry=1) at fs/namei.c:3004
#7  do_sys_open (dfd=dfd@entry=-100, filename=filename@entry=0x83b3568 "/work/1", flags=flags@entry=33345, mode=mode@entry=438) at fs/open.c:956
#8  sys_open (filename=0x83b3568 "/work/1", flags=33345, mode=438) at fs/open.c:977
#9  <signal handler called>

Linux kernel stacks. x86 registers .

A contents of the IA-32 registers when running Linux on Intel Quark X1000. Note CR3 with a physical address for PD.

(0) eax (/32): 0x00000000
(1) ecx (/32): 0x00000000
(2) edx (/32): 0xC14B4000
(3) ebx (/32): 0xC14B4000
(4) esp (/32): 0xC14B5F94
(5) ebp (/32): 0xC14B5FA0
(6) esi (/32): 0x00096800
(7) edi (/32): 0xC14B7800
(8) eip (/32): 0xC10093C5
(9) eflags (/32): 0x00000246
(10) cs (/32): 0x00000060
(11) ss (/32): 0x00000068
(12) ds (/32): 0x0000007B
(13) es (/32): 0x0000007B
(14) fs (/32): 0x00000000
(15) gs (/32): 0x00000000
(16) st0 (/32)
(17) st1 (/32)
(18) st2 (/32)
(19) st3 (/32)
(20) st4 (/32)
(21) st5 (/32)
(22) st6 (/32)
(23) st7 (/32)
(24) fctrl (/32)
(25) fstat (/32)
(26) ftag (/32)
(27) fiseg (/32)
(28) fioff (/32)
(29) foseg (/32)
(30) fooff (/32)
(31) fop (/32)
(32) cr0 (/32): 0x8005003B
(33) cr2 (/32): 0xB75E1000
(34) cr3 (/32): 0x0C4F9000
(35) cr4 (/32): 0x00100030
(36) dr0 (/32): 0x00000000
(37) dr1 (/32): 0x00000000
(38) dr2 (/32): 0x00000000
(39) dr3 (/32): 0x00000000
(40) dr6 (/32): 0xFFFF0FF0
(41) dr7 (/32): 0x00000000
(42) idtbase (/32): 0xFFFBA000
(43) idtlimit (/32): 0x000007FF
(44) idtar (/32): 0xFFFFFFFF
(45) gdtbase (/32): 0xC14B8000
(46) gdtlimit (/32): 0x000000FF
(47) gdtar (/32): 0xFFFFFFFF
(48) tr (/32): 0x00000080
(49) ldtr (/32): 0x00000000
(50) ldbase (/32): 0x00000000
(51) ldlimit (/32): 0x0000FFFF
(52) ldtar (/32): 0xFFFF7FFF
(53) csbase (/32): 0x00000000
(54) cslimit (/32): 0xFFFFFFFF
(55) csar (/32): 0xFFFF9BFF
(56) dsbase (/32): 0x00000000
(57) dslimit (/32): 0xFFFFFFFF
(58) dsar (/32): 0xFFFFF3FF
(59) esbase (/32): 0x00000000
(60) eslimit (/32): 0xFFFFFFFF
(61) esar (/32): 0xFFFFF3FF
(62) fsbase (/32): 0x00000000
(63) fslimit (/32): 0xFFFFFFFF
(64) fsar (/32): 0xFF3F11FF
(65) gsbase (/32): 0xB7741700
(66) gslimit (/32): 0xFFFFFFFF
(67) gsar (/32): 0xFF3F11FF
(68) ssbase (/32): 0x00000000
(69) sslimit (/32): 0xFFFFFFFF
(70) ssar (/32): 0xFFFF93FF
(71) tssbase (/32): 0xC14C17C0
(72) tsslimit (/32): 0x0000206B
(73) tssar (/32): 0xFFFFFFFF
(74) pmcr (/32): 0x00000000

Sunday, April 10, 2016

Debugging Linux kernel on Intel Quark X100 .

For source level debugging I use OLIMEX JTAG . This is a fast guide how to configure it on Linux machine.  A more detailed description can be found in Intel documentation Source Level Debug using OpenOCD/GDB/Eclipse on Intel ® QuarkTM SoC X1000

Compile and install OpenOCD with Quark support.

$ git clone git://git.code.sf.net/p/openocd/code openocd-code
$ cd openocd-code
$ git branch quark v0.8.0
$ git checkout quark
$ ./bootstrap
$ ./configure --enable-ftdi
$ make
$ sudo make install

By default OpenOCD is installed in the following folders

/usr/local/bin/openocd
/usr/local/share/openocd/scripts

Connect an OLIMEX JTAG to a board and your machine

Intel Galileo with OLIMEX JTAG
Intel Galileo with OLIMEX JTAG

Power on a board and start an OpenOCD session for Quark X1000. This will start a GDB server on 3333 port.

$ openocd -f ./interface/ftdi/olimex-arm-usb-ocd-h.cfg -f target/quark_x10xx.cfg

If OpenOCD managed to locate a JTAG and connect to a board the output will be


Start GDB and attach to a GDB server started by OpenOCD. Provide GDB with a path to a kernel compiled with debug symbols.

$ gdb
(gdb) target remote localhost:3333
(gdb) monitor halt
(gdb) symbol-file <PATH TO LINUX SOURCE>/vmlinux
(gdb) c

Ctrl-C will break kernel execution into the debugger.

Linux kernel stacks. CFS scheduler deques a task .


A set of call stacks when a scheduler switches task. The related kernel source is Quark X1000 Linux kernel

Stack 1:

#0  __dequeue_entity (se=0xce2edfac, cfs_rq=0xc14d4700 <runqueues+64>) at kernel/sched/fair.c:531
#1  set_next_entity (cfs_rq=cfs_rq@entry=0xc14d4700 <runqueues+64>, se=se@entry=0xce2edfac) at kernel/sched/fair.c:1868
#2  pick_next_task_fair (rq=0xc14d46c0 <runqueues>) at kernel/sched/fair.c:3608
#3  pick_next_task (rq=0xc14d46c0 <runqueues>) at kernel/sched/core.c:2832
#4  __schedule () at kernel/sched/core.c:2934
#5  schedule () at kernel/sched/core.c:2979
#6  cpu_idle () at arch/x86/kernel/process.c:367
#7  rest_init () at init/main.c:385
#8  start_kernel () at init/main.c:643
#9  i386_start_kernel () at arch/x86/kernel/head32.c:66


Stack 2:

#0  __dequeue_entity (se=0xc00623fc, cfs_rq=0xc14d4700 <runqueues+64>) at kernel/sched/fair.c:531
#1  set_next_entity (cfs_rq=cfs_rq@entry=0xc14d4700 <runqueues+64>, se=se@entry=0xc00623fc) at kernel/sched/fair.c:1868
#2  pick_next_task_fair (rq=0xc14d46c0 <runqueues>) at kernel/sched/fair.c:3608
#3  pick_next_task (rq=0xc14d46c0 <runqueues>) at kernel/sched/core.c:2832
#4  __schedule () at kernel/sched/core.c:2934
#5  schedule () at kernel/sched/core.c:2979
#6  schedule_hrtimeout_range_clock (expires=expires@entry=0xce30ff7c, delta=delta@entry=999997, 
    mode=mode@entry=HRTIMER_MODE_ABS, clock=clock@entry=1) at kernel/hrtimer.c:1809
#7  schedule_hrtimeout_range (expires=expires@entry=0xce30ff7c, delta=delta@entry=999997, mode=mode@entry=HRTIMER_MODE_ABS)
    at kernel/hrtimer.c:1850
#8  ep_poll (timeout=1000, maxevents=<optimised out>, events=0x97062b8, ep=<optimised out>) at fs/eventpoll.c:1546
#9  sys_epoll_wait (epfd=6, events=0x97062b8, maxevents=1025, timeout=1000) at fs/eventpoll.c:1892
#10 <signal handler called>

Stack 3:

#0  __dequeue_entity (se=0xce288c1c, cfs_rq=0xc14d4700 <runqueues+64>) at kernel/sched/fair.c:531
#1  set_next_entity (cfs_rq=cfs_rq@entry=0xc14d4700 <runqueues+64>, se=se@entry=0xce288c1c) at kernel/sched/fair.c:1868
#2  pick_next_task_fair (rq=0xc14d46c0 <runqueues>) at kernel/sched/fair.c:3608
#3  pick_next_task (rq=0xc14d46c0 <runqueues>) at kernel/sched/core.c:2832
#4  __schedule () at kernel/sched/core.c:2934
#5  schedule () at kernel/sched/core.c:2979
#6  worker_thread (__worker=0xcc4f9c40) at kernel/workqueue.c:2407
#7  kthread (_create=0xcea53ec4) at kernel/kthread.c:168
#8  ret_from_kernel_thread () at arch/x86/kernel/entry_32.S:311
#9  ?? () at kernel/kthread.c:420

Stack 4:

#0  __dequeue_entity (se=0xce9f3bdc, cfs_rq=0xc14d4700 <runqueues+64>) at kernel/sched/fair.c:531
#1  set_next_entity (cfs_rq=cfs_rq@entry=0xc14d4700 <runqueues+64>, se=se@entry=0xce9f3bdc) at kernel/sched/fair.c:1868
#2  pick_next_task_fair (rq=0xc14d46c0 <runqueues>) at kernel/sched/fair.c:3608
#3  pick_next_task (rq=0xc14d46c0 <runqueues>) at kernel/sched/core.c:2832
#4  __schedule () at kernel/sched/core.c:2934
#5  schedule () at kernel/sched/core.c:2979
#6  futex_wait_queue_me (hb=<optimised out>, q=q@entry=0xce183e6c, timeout=timeout@entry=0xce183ea4) at kernel/futex.c:1808
#7  futex_wait (uaddr=uaddr@entry=0x807f524, flags=flags@entry=2, val=val@entry=4733, abs_time=abs_time@entry=0xce183f98, 
    bitset=4294967295) at kernel/futex.c:1923
#8  do_futex (uaddr=uaddr@entry=0x807f524, op=op@entry=393, val=val@entry=4733, timeout=0xce183f98, 
    uaddr2=uaddr2@entry=0x807f550, val2=0, val3=<optimised out>, val3@entry=4294967295) at kernel/futex.c:2669
#9  sys_futex (uaddr=0x807f524, op=393, val=4733, utime=0x8081d5c, uaddr2=0x807f550, val3=4294967295) at kernel/futex.c:2727
#10 <signal handler called>



Saturday, March 26, 2016

Building Linux kernel for Intel Galileo QuarkX1000

Intel Galileo board features Intel Quark X1000 SoC with a single core single threaded x86 CPU similar to Pentium.

Intel Galileo with OLIMEX JTAG and FTDI USB-to-Serial


I will show how to build a 3.8.7 version of the Linux kernel for a Linux image iot-devkit-201510010757-mmcblkp0-galileo.direct.xz that can be downloaded from  Intel® Galileo Board Downloads under Intel® Galileo Board microSD Card Linux* Operating System Image caption, at the time of writing this was the latest stable OS image build for the board.

Below are the steps to build a kernel and modules

1. Get BSPv1.1.0 package from Intel® Quark™ BSP Release Archive or from my GitHub BSPv1.1.0 where it is stored for convinience, you need v1.1.0. Its file name is Board_Support_Package_Sources_for_Intel_Quark_v1.1.0.7z . The newer BSP versions contain bugs and I believe were never used by Intel to build any kernel, for example patch list in BSPv1.2.1.7 doesn't match with the kernel version 3.14 used in upstream.cfg file and gitsetup.py script is missing, though it can be borrowed from an older BSP package it shows that BSPv1.2.1.7 was released without any sanity check.

2. Extract BSPv1.1.0 . Below BSP_DIR must be replaced with a full path to the directory where BSP package was extracted, for example /work/IntelGalileo/BSPv1.1.0/ .

3. Build the tools for cross compilation. Replace BSP_DIR with a path to extracted BSP package. This could take a couple of hours depending on Internet connection and CPUs power. Replace BSP_DIR with a path to extracted BSP package.

   $ cd BSP_DIR
   $ tar zxf meta-clanton_v1.1.0-dirty.tar.gz
   $ cd meta-clanton_v1.1.0-dirty
   $ ./setup.sh -e meta-clanton-bsp
   $ source iot-devkit-init-build-env build
   $ bitbake image-full

4. Build the kernel. To build the kernel you need a .config file. Intel documents will direct you to use  meta/cfg/kernel-cache/bsp/quark/quark.cfg from the kernel sources downloaded and patched by running gitsetup.py but a kernel built with this config file doesn't contain modules for memory card interface so the kernel unable to mount file system from the card. The better way is to borrow a .config file from a running system created from iot-devkit-201510010757-mmcblkp0-galileo.direct.xz image by executing 
   $ zcat /proc/config.gz > .config
and copying the file to a building machine. Alternatively the file can be downloaded from my GitHub repository by this link .config .

Extract the kernel package. It contains a script to obtain the source code and patches.

   $ cd BSP_DIR
   $ tar zxf quark_linux_v3.8.7+v1.1.0.tar.gz
   $ mv quark_linux_v3.8.7+v1.1.0 linux_v3.8.7
   $ cd linux_v3.8.7

The following two steps can be skipped if you already have set git user name and email, if a user and email are not defined there will be errors while executing gitsetup.py, note that a user and email are not required to be genuine

   $ git config –-global user.name  "user"  
   $ git config –-glob al user.email "user@hotmail.com"

Execute gitsetup.py, this will clone and patch the required kernel version from the kernel repository.

   $ ./gitsetup.py

Change a directory to ./work that contains the source code and copy a config file

   $ cd work
   $ cp <A .config file obtained as described above> .config

Set a path for cross-compiler. Replace BSP_DIR with a path to extracted BSP package.

   $ export PATH=BSP_DIR/meta-clanton_v1.1.0-dirty/build/tmp/sysroots/x86_64-linux/usr/bin/i586-poky-linux:$PATH

The following command will build a kernel and modules.

   $ ARCH=i386 CROSS_COMPILE=i586-poky-linux- make -j4

Thursday, March 24, 2016

What happens with outstanding IRPs when a process terminates.

When Windows kernel terminates a process it inserts APC in each thread, in turn this APC calls PspExitThread that calls IoCancelThreadIo to cancel if possible all outstanding IRPs by calling IoCancelIrp and waits for IRP cancelation or completion. A thread waits only for IRPs that have been associated with a thread by calling IopQueueThreadIrp that adds IRP in a list of IRPs associated with a thread, the list head is IrpList field of the ETHREAD structure.

A thread will be blocked until any of the two conditions takes place
 - IrpList becomes empty, that means all outstanding IRPs completed in a normal way or were cancelled
 - 5 minutes timeout expired, in that case IopDisassociateThreadIrp is called to perform IRPs disassociation by removing them from IrpList and setting IRP->Tail.Overlay.Thread to NULL

Below is a call stack for a terminating thread with four outstanding IRPs ( marked yellow ).

        THREAD 870788e8  Cid 1100.0f5c  Teb: 7ffab000 Win32Thread: 00000000 WAIT: (DelayExecution) KernelMode Non-Alertable
            86d29460  SynchronizationEvent
        IRP List:
            87305dc8: (0006,0100) Flags: 00060a00  Mdl: 00000000
            86ecd6f8: (0006,0100) Flags: 00060a00  Mdl: 00000000
            862fd7b8: (0006,0100) Flags: 00060a00  Mdl: 00000000
            87221d80: (0006,0100) Flags: 00060a00  Mdl: 00000000
        Not impersonating
        DeviceMap                 975b0820
        Owning Process            8514a030       Image:         explorer.exe
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      22339          Ticks: 1 (0:00:00:00.015)
        Context Switch Count      2734           IdealProcessor: 1          
        UserTime                  00:00:00.000
        KernelTime                00:00:00.031
        Win32 Start Address 0x769842ed
        Stack Init a86bfed0 Current a86bfa38 Base a86c0000 Limit a86bd000 Call 0
        Priority 10 BasePriority 8 UnusualBoost 0 ForegroundBoost 2 IoPriority 2 PagePriority 2
        ChildEBP RetAddr
        a86bfa50 82ad269d nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
        a86bfa88 82ad14f7 nt!KiSwapThread+0x266
        a86bfab0 82ad11d5 nt!KiCommitThreadWait+0x1df
        a86bfb0c 82cb9171 nt!KeDelayExecutionThread+0x2aa
        a86bfb40 82cbe519 nt!IoCancelThreadIo+0x70
        a86bfbb4 82cd2051 nt!PspExitThread+0x48e
        a86bfbcc 82b058c0 nt!PsExitSpecialApc+0x22
        a86bfc1c 82a922a4 nt!KiDeliverApc+0x28b
        a86bfc1c 778770b4 nt!KiServiceExit+0x64 (FPO: [0,3] TrapFrame @ a86bfc34)
        074fe3e4 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

A main process thread waits for child threads termination.

        THREAD 86e307f0  Cid 1100.1104  Teb: 7ffdf000 Win32Thread: fe9c8a88 WAIT: (Executive) KernelMode Non-Alertable
            870788e8  Thread
        Not impersonating
        DeviceMap                 975b0820
        Owning Process            8514a030       Image:         explorer.exe
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      21665          Ticks: 675 (0:00:00:10.530)
        Context Switch Count      15565          IdealProcessor: 1          
        UserTime                  00:00:00.249
        KernelTime                00:00:00.530
        Win32 Start Address 0x00e50efa
        Stack Init a87b5ed0 Current a87b5a38 Base a87b6000 Limit a87b3000 Call 4f0
        Priority 12 BasePriority 8 UnusualBoost 0 ForegroundBoost 2 IoPriority 2 PagePriority 5

        ChildEBP RetAddr
        a87b5a50 82ad269d nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
        a87b5a88 82ad14f7 nt!KiSwapThread+0x266
        a87b5ab0 82acb0cf nt!KiCommitThreadWait+0x1df
        a87b5b2c 82cbe28e nt!KeWaitForSingleObject+0x393
        a87b5bb4 82cd2051 nt!PspExitThread+0x203
        a87b5bcc 82b058c0 nt!PsExitSpecialApc+0x22
        a87b5c1c 82a922a4 nt!KiDeliverApc+0x28b
        a87b5c1c 77876fc0 nt!KiServiceExit+0x64 (FPO: [0,3] TrapFrame @ a87b5c34)
        000ffb18 00000000 ntdll!KiUserCallbackDispatcher (FPO: [0,0,0])