Monday, April 3, 2017

TLB flushing call on Windows

nt!KiRetireDpcList+0xd7
nt!KxRetireDpcList+0x5 (TrapFrame @ fffff800`cc332e70)
nt!KiDispatchInterruptContinue
nt!KiDpcInterrupt+0xca (TrapFrame @ ffffd000`a9b34d90)
nt!MiFlushTbList+0x20c
nt!MiDeleteSystemPagableVm+0x4d9
nt!MiPurgeSpecialPoolPaged+0x18
nt!MmFreeSpecialPool+0x3cf
nt!ExDeferredFreePool+0x677
nt!VerifierExFreePoolWithTag+0x44

Tuesday, March 7, 2017

RISC-V Linux memory regions on boot

This text is based on memory_areas_on_boot.md from my GitHub repo  riscv-notes

On boot the kernel has the following memory areas required for code execution
  • vmlinux ELF code and data sections mapped by the bootloader
  • the page tables for virtual memory support created by the bootloader
  • initial stack
The pages used by the above regions must be marked as reserved so they are not used for memory allocations.
As shown here https://github.com/slavaim/riscv-notes/blob/master/linux/memory-initialization.md the kernel makes the following calls for memory reservation.
    memblock_reserve(info.base, __pa(_end) - info.base);
    reserve_boot_page_table(pfn_to_virt(csr_read(sptbr)));
The first call to memblock_reserve is to reserve the area from &_start to &_end , this area is defined in the following linker script.
SECTIONS
{
    /* Beginning of code and text segment */
    . = LOAD_OFFSET;
    _start = .;
    __init_begin = .;
    HEAD_TEXT_SECTION
    INIT_TEXT_SECTION(PAGE_SIZE)
    INIT_DATA_SECTION(16)
    /* we have to discard exit text and such at runtime, not link time */
    .exit.text :
    {
        EXIT_TEXT
    }
    .exit.data :
    {
        EXIT_DATA
    }
    PERCPU_SECTION(L1_CACHE_BYTES)
    __init_end = .;

    .text : {
        _text = .;
        _stext = .;
        TEXT_TEXT
        SCHED_TEXT
        LOCK_TEXT
        KPROBES_TEXT
        ENTRY_TEXT
        IRQENTRY_TEXT
        *(.fixup)
        _etext = .;
    }

    /* Start of data section */
    _sdata = .;
    RO_DATA_SECTION(PAGE_SIZE)
    RW_DATA_SECTION(0x40, PAGE_SIZE, THREAD_SIZE)
    .sdata : {
        _gp = . + 0x800;
        *(.sdata*)
    }
    .srodata : {
        *(.srodata*)
    }
    /* End of data section */
    _edata = .;

    BSS_SECTION(0x20, 0, 0x20)

    EXCEPTION_TABLE(0x10)
    NOTES

    .rel.dyn : {
        *(.rel.dyn*)
    }

    _end = .;

    STABS_DEBUG
    DWARF_DEBUG

    DISCARDS
}
As you can see this area encompasses all kernel code and data excluding debug information. This area starts at ffffffff80000000. You can easily find the start and end addresses from the System.map file. These values for my test kernel 
ffffffff80000000 T _start
ffffffff803b10b4 R _end
The second call to reserve_boot_page_table reserves the initial page table pages.
Where is a stack reservation? The stack is reserved by the first call to memblock_reserve as the initial stack is allocated from the kernel data section. The initial stack is staically allocated as init_thread_union.stack . The init_thread_union has the following type definition in linux/linux-4.6.2/include/linux/sched.h
union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};
For my test kernel the address of the init_thread_union is again extracted from System.map as
ffffffff8035e000 D init_thread_union
As you can see it is in the range of the region [&_start,&_end) and is in the data section.
The stack register is set on boot by the kernel entry routine _start defined in linux/linux-4.6.2/arch/riscv/kernel/head.S
__INIT
ENTRY(_start)
...
    /* Initialize stack pointer */
    la sp, init_thread_union + THREAD_SIZE
    /* Initialize current task_struct pointer */
    la tp, init_task
 ...
 END(_start)

RISC-V Linux kernel memory initialization on boot.

This text is based on memory-initialization.md from my GitHub repo  riscv-notes
The kernel is started with virtual memory initialized by machine level bootloader BBL. The more detailed description can be found in this document - supervisor_vm_init.md .
The kernel start offset is defined in linux/linux-4.6.2/arch/riscv/include/asm/page.h
/*
 * PAGE_OFFSET -- the first address of the first page of memory.
 * When not using MMU this corresponds to the first free page in
 * physical memory (aligned on a page boundary).
 */
#ifdef CONFIG_64BIT
#define PAGE_OFFSET     _AC(0xffffffff80000000,UL)
#else
#define PAGE_OFFSET     _AC(0xc0000000,UL)
#endif
BBL initializes virtual memory for supervisor mode, maps the Linux kernel at PAGE_OFFSET, sets sptbr register value to a root page table physical address, switches to the supervisor mode with $pc set to the entry point _start. BBL does this in enter_supervisor_mode function defined in riscv-tools/riscv-pk/machine/minit.c
void enter_supervisor_mode(void (*fn)(uintptr_t), uintptr_t stack)
{
  uintptr_t mstatus = read_csr(mstatus);
  mstatus = INSERT_FIELD(mstatus, MSTATUS_MPP, PRV_S);
  mstatus = INSERT_FIELD(mstatus, MSTATUS_MPIE, 0);
  write_csr(mstatus, mstatus);
  write_csr(mscratch, MACHINE_STACK_TOP() - MENTRY_FRAME_SIZE);
  write_csr(mepc, fn);
  write_csr(sptbr, (uintptr_t)root_page_table >> RISCV_PGSHIFT);
  asm volatile ("mv a0, %0; mv sp, %0; mret" : : "r" (stack));
  __builtin_unreachable();
}
The important difference between RISC-V case and many other CPUs( e.g. x86 )is that Linux kernel's entry point is called with virtual memory initialized by boot loader executing at higher privilege mode.
The memory management is initialized inside setup_arch routine defined in linux/linux-4.6.2/arch/riscv/kernel/setup.c, below only memory management relevant part of the function is shown
void __init setup_arch(char **cmdline_p)
{
...
    init_mm.start_code = (unsigned long) _stext;
    init_mm.end_code   = (unsigned long) _etext;
    init_mm.end_data   = (unsigned long) _edata;
    init_mm.brk        = (unsigned long) _end;

    setup_bootmem();
    ....
    paging_init();
    ....
}
The _stext, _etext, _edata, _end global variables are defined in the linker script linux/linux-4.6.2/arch/riscv/kernel/vmlinux.lds.S which defines the kernel memory layout. These variables defines the kernel section borders. The thorough description regarding linkers scripts can be found here https://sourceware.org/binutils/docs/ld/Scripts.html .
The first function being called is setup_bootmem
static void __init setup_bootmem(void)
{
    unsigned long ret;
    memory_block_info info;

    ret = sbi_query_memory(0, &info);
    BUG_ON(ret != 0);
    BUG_ON((info.base & ~PMD_MASK) != 0);
    BUG_ON((info.size & ~PMD_MASK) != 0);
    pr_info("Available physical memory: %ldMB\n", info.size >> 20);

    /* The kernel image is mapped at VA=PAGE_OFFSET and PA=info.base */
    va_pa_offset = PAGE_OFFSET - info.base;
    pfn_base = PFN_DOWN(info.base);

    if ((mem_size != 0) && (mem_size < info.size)) {
        memblock_enforce_memory_limit(mem_size);
        info.size = mem_size;
        pr_notice("Physical memory usage limited to %lluMB\n",
            (unsigned long long)(mem_size >> 20));
    }
    set_max_mapnr(PFN_DOWN(info.size));
    max_low_pfn = PFN_DOWN(info.base + info.size);

#ifdef CONFIG_BLK_DEV_INITRD
    setup_initrd();
#endif /* CONFIG_BLK_DEV_INITRD */

    memblock_reserve(info.base, __pa(_end) - info.base);
    reserve_boot_page_table(pfn_to_virt(csr_read(sptbr)));
    memblock_allow_resize();
}
The Linux kernel queries the available memory size in setup_bootmem by invoking SBI interface's sbi_query_memorywhich results in a call to __sbi_query_memory BBL routine executed (suprisingly) in supervisor mode as SBI has been mapped to the supervisor virtual address space and ecall instruction is not invoked for sbi_query_memory
uintptr_t __sbi_query_memory(uintptr_t id, memory_block_info *p)
{
  if (id == 0) {
    p->base = first_free_paddr;
    p->size = mem_size + DRAM_BASE - p->base;
    return 0;
  }

  return -1;
}
More about SBI can be found here https://github.com/slavaim/riscv-notes/blob/master/bbl/sbi-to-linux.md .
The kernel reserves the pages occupied by the kernel with a call to memblock_reserve(info.base, __pa(_end) - info.base); . Then a call to reserve_boot_page_table(pfn_to_virt(csr_read(sptbr))); reserves the pages occupied by the page table allocated by the bootloader, i.e. BBL.The Linux kernel retrieves the page table allocated and initialized by BBL by reading a physical address from the sptbr register and converting it to a virtual address. The page table virtual address is also saved at the master kernel Page Tables init_mm.pgd. The snippet is from linux/linux-4.6.2/arch/riscv/mm/init.c
void __init paging_init(void)
{
    init_mm.pgd = (pgd_t *)pfn_to_virt(csr_read(sptbr));
  ....
}

Sunday, March 5, 2017

RISC-V SBI mapping to Linux

This text is based on sbi-to-linux.md from my GitHub repo  riscv-notes
The machine level SBI ( Supervisor Binary Interface ) is exported to the Linux kernel by mapping it at the top of the address space.
The mapping is performed by BBL in supervisor_vm_init defined in riscv-tools/riscv-pk/bbl/bbl.c.
  // map SBI at top of vaddr space
  extern char _sbi_end;
  uintptr_t num_sbi_pages = ((uintptr_t)&_sbi_end - DRAM_BASE - 1) / RISCV_PGSIZE + 1;
  assert(num_sbi_pages <= (1 << RISCV_PGLEVEL_BITS));
  for (uintptr_t i = 0; i < num_sbi_pages; i++) {
    uintptr_t idx = (1 << RISCV_PGLEVEL_BITS) - num_sbi_pages + i;
    sbi_pt[idx] = pte_create((DRAM_BASE / RISCV_PGSIZE) + i, PTE_G | PTE_R | PTE_X);
  }
  pte_t* sbi_pte = middle_pt + ((num_middle_pts << RISCV_PGLEVEL_BITS)-1);
  assert(!*sbi_pte);
  *sbi_pte = ptd_create((uintptr_t)sbi_pt >> RISCV_PGSHIFT);
You can read more on supervisor_vm_init here https://github.com/slavaim/riscv-notes/blob/master/bbl/supervisor_vm_init.md . From the code above you can see that the last page ending at _sbi_end physical address is mapped at the last page of the supervisor virtual address space.
The offsets to SBI entry points are defined in riscv-tools/riscv-pk/machine/sbi.S as
.globl sbi_hart_id; sbi_hart_id = -2048
.globl sbi_num_harts; sbi_num_harts = -2032
.globl sbi_query_memory; sbi_query_memory = -2016
.globl sbi_console_putchar; sbi_console_putchar = -2000
.globl sbi_console_getchar; sbi_console_getchar = -1984
.globl sbi_send_ipi; sbi_send_ipi = -1952
.globl sbi_clear_ipi; sbi_clear_ipi = -1936
.globl sbi_timebase; sbi_timebase = -1920
.globl sbi_shutdown; sbi_shutdown = -1904
.globl sbi_set_timer; sbi_set_timer = -1888
.globl sbi_mask_interrupt; sbi_mask_interrupt = -1872
.globl sbi_unmask_interrupt; sbi_unmask_interrupt = -1856
.globl sbi_remote_sfence_vm; sbi_remote_sfence_vm = -1840
.globl sbi_remote_sfence_vm_range; sbi_remote_sfence_vm_range = -1824
.globl sbi_remote_fence_i; sbi_remote_fence_i = -1808
These definitions are offsets from the top of the address space for the SBI trampoline stubs defined in riscv-tools/riscv-pk/machine/sbi_entry.S
  # hart_id
  .align 4
  li a7, MCALL_HART_ID
  ecall
  ret

  # num_harts
  .align 4
  lw a0, num_harts
  ret

  # query_memory
  .align 4
  tail __sbi_query_memory
The SBI trampoline stubs code start is defined as sbi_base and is aligned to a page boundary by align RISCV_PGSHIFTdirective. The first RISCV_PGSIZE - 2048 bytes are reserved by .skip RISCV_PGSIZE - 2048 directive so the first instruction starts at 2048 bytes offset from the page top defined as
.align RISCV_PGSHIFT
  .globl sbi_base
sbi_base:

  # TODO: figure out something better to do with this space.  It's not
  # protected from the OS, so beware.
  .skip RISCV_PGSIZE - 2048
The end of the section is also aligned at the page boundary and is defined as
  .align RISCV_PGSHIFT
  .globl _sbi_end
_sbi_end:
The SBI trampoline stubs section .sbi is placed at the end of BBL just before the payload by defining the layout in riscv-tools/riscv-pk/bbl/bbl.lds as
   .sbi :
  {
    *(.sbi)
  }

  .payload :
  {
    *(.payload)
  }

  _end = .;
So the supervisor_vm_init code that maps machine level physical addresses to supervisor virtuall addresses maps the BBL .sbi section which contains SBI trampoline stubs at the top of the supervisor virtual address space.
Linux kernel access SBI trampoline stubs by a call with offsets defined in linux/linux-4.6.2/arch/riscv/kernel/sbi.Swhich is a carbon copy of riscv-tools/riscv-pk/machine/sbi.S . For example a snippet from Linux kernel entry point _start defined in linux/linux-4.6.2/arch/riscv/kernel/head.S
    /* See if we're the main hart */
    call sbi_hart_id
    bnez a0, .Lsecondary_start
This code is translated by GCC to
   0xffffffff80000018 <+24>:    jalr    -2048(zero) # 0xfffffffffffff800
   0xffffffff8000001c <+28>:    bnez    a0,0xffffffff80000054 <_start+84>
The address 0xfffffffffffff800 is 2048 bytes offset from the top of the virtual address space last page. As we saw above this page is backed by a physical page with SBI trampoline stubs code starting at sbi_base machine level physical address. The dissassembling shows the SBI trampoline stubs at offsets
 0xfffffffffffff800 which is sbi_hart_id = -2048
 0xfffffffffffff810 which is sbi_num_harts = -2032
 0xfffffffffffff820 which is sbi_query_memory = -2016
 0xfffffffffffff830 which is sbi_console_putchar = -2000
 etc
(gdb) x/48i 0xfffffffffffff800
   0xfffffffffffff800:  li  a7,0
   0xfffffffffffff804:  ecall
   0xfffffffffffff808:  ret
   0xfffffffffffff80c:  nop
   0xfffffffffffff810:  auipc   a0,0xfffff
   0xfffffffffffff814:  lw  a0,-1888(a0)
   0xfffffffffffff818:  ret
   0xfffffffffffff81c:  nop
   0xfffffffffffff820:  j   0xffffffffffff92c0
   0xfffffffffffff824:  nop
   0xfffffffffffff828:  nop
   0xfffffffffffff82c:  nop
   0xfffffffffffff830:  li  a7,1
   0xfffffffffffff834:  ecall
   0xfffffffffffff838:  ret
   0xfffffffffffff83c:  nop
   0xfffffffffffff840:  li  a7,2
   0xfffffffffffff844:  ecall
   0xfffffffffffff848:  ret
   0xfffffffffffff84c:  nop
   0xfffffffffffff850:  unimp
   0xfffffffffffff854:  nop
   0xfffffffffffff858:  nop
   0xfffffffffffff85c:  nop
   0xfffffffffffff860:  li  a7,4
   0xfffffffffffff864:  ecall
   0xfffffffffffff868:  ret
   0xfffffffffffff86c:  nop
   0xfffffffffffff870:  li  a7,5
   0xfffffffffffff874:  ecall
   0xfffffffffffff878:  ret
   0xfffffffffffff87c:  nop
   0xfffffffffffff880:  lui a0,0x989
   0xfffffffffffff884:  addiw   a0,a0,1664
   0xfffffffffffff888:  ret
   0xfffffffffffff88c:  nop
   0xfffffffffffff890:  li  a7,6
   0xfffffffffffff894:  ecall
   0xfffffffffffff898:  nop
   0xfffffffffffff89c:  nop
   0xfffffffffffff8a0:  li  a7,7
   0xfffffffffffff8a4:  ecall
   0xfffffffffffff8a8:  ret
   0xfffffffffffff8ac:  nop
   0xfffffffffffff8b0:  j   0xffffffffffff92f8
   0xfffffffffffff8b4:  nop
   0xfffffffffffff8b8:  nop
   0xfffffffffffff8bc:  nop
As you can see not all SBI trampolines stubs invoke ecall system call to enter a higher privilege level, the machine level in this case. For example query_memory is just an unconditional jump to the SBI code mapped to the Linux kernel space.
 0xfffffffffffff820:    j   0xffffffffffff92c0
In that case the CPU doesn't switch to machine level and continues in the supervisor mode with virtual memory enabled. When CPU switches to the machine mode it disables virtual address translation and switches back to physical addresses. Below is a call stack when query_memory is called. A you can see the CPU continues with virtual address memory enabled and uses virtual addresses. The debugger was unable to resolve a call to query_memory in BBL as it was not aware about the code being remapped to the Linux system address space.
#0  0xffffffffffff92c8 in ?? ()
#1  0xffffffff80002c38 in setup_bootmem () at arch/riscv/kernel/setup.c:149
#2  setup_arch (cmdline_p=<optimized out>) at arch/riscv/kernel/setup.c:152
#3  0xffffffff80000898 in start_kernel () at init/main.c:500
#4  0xffffffff80000040 in _start () at arch/riscv/kernel/head.S:36
I guess that one of the possible reasons for such query_memory implementation is to simplify development as this function returns the structure which would require either packing it in registers or translating addresses either in the Linux kernel or in BBL SDI.
The call stack for sbi_hart_id looks differently
#0  0x0000000080000c90 in mcall_trap (regs=0x82660ec0, mcause=9, mepc=18446744073709549572) at ../machine/mtrap.c:210
#1  0x00000000800000ec in trap_vector () at ../machine/mentry.S:116
Backtrace stopped: frame did not save the PC
(gdb) p/x mepc
$4 = 0xfffffffffffff804
The virtual address translation is disabled and the CPU works with physical addresses. The debugger was unable to cross the boundary back to the Linux kernel stack that requires processing address space translation switching. As you can see the mpec register points to the ecall instruction virtual address in supervisor mode
   0xfffffffffffff800:  li  a7,0
   0xfffffffffffff804:  ecall
   0xfffffffffffff808:  ret

RISC-V BBL supervisor_vm_init


This text is based on supervisor_vm_init.md from my GitHub repo  riscv-notes
The function builds page table structures to map RISC-V BBL payload to supervisor mode. The function operates in machine level physical address space. You should not be fooled by presence of supervisor virtual addresses as they are adjusted to machine level physical address before being accessed.
static void supervisor_vm_init()
{
  uintptr_t highest_va = DRAM_BASE - first_free_paddr;
  mem_size = MIN(mem_size, highest_va - info.first_user_vaddr) & -MEGAPAGE_SIZE;

  pte_t* sbi_pt = (pte_t*)(info.first_vaddr_after_user + info.load_offset);
  memset(sbi_pt, 0, RISCV_PGSIZE);
  pte_t* middle_pt = (void*)sbi_pt + RISCV_PGSIZE;
#if __riscv_xlen == 32
  size_t num_middle_pts = 1;
  pte_t* root_pt = middle_pt;
  memset(root_pt, 0, RISCV_PGSIZE);
#else
  size_t num_middle_pts = (-info.first_user_vaddr - 1) / GIGAPAGE_SIZE + 1;
  pte_t* root_pt = (void*)middle_pt + num_middle_pts * RISCV_PGSIZE;
  memset(middle_pt, 0, (num_middle_pts + 1) * RISCV_PGSIZE);
  for (size_t i = 0; i < num_middle_pts; i++)
    root_pt[(1<<RISCV_PGLEVEL_BITS)-num_middle_pts+i] = ptd_create(((uintptr_t)middle_pt >> RISCV_PGSHIFT) + i);
#endif

  for (uintptr_t vaddr = info.first_user_vaddr, paddr = vaddr + info.load_offset, end = info.first_vaddr_after_user;
       paddr < DRAM_BASE + mem_size; vaddr += MEGAPAGE_SIZE, paddr += MEGAPAGE_SIZE) {
    int l2_shift = RISCV_PGLEVEL_BITS + RISCV_PGSHIFT;
    size_t l2_idx = (info.first_user_vaddr >> l2_shift) & ((1 << RISCV_PGLEVEL_BITS)-1);
    l2_idx += ((vaddr - info.first_user_vaddr) >> l2_shift);
    middle_pt[l2_idx] = pte_create(paddr >> RISCV_PGSHIFT, PTE_G | PTE_R | PTE_W | PTE_X);
  }

  // map SBI at top of vaddr space
  extern char _sbi_end;
  uintptr_t num_sbi_pages = ((uintptr_t)&_sbi_end - DRAM_BASE - 1) / RISCV_PGSIZE + 1;
  assert(num_sbi_pages <= (1 << RISCV_PGLEVEL_BITS));
  for (uintptr_t i = 0; i < num_sbi_pages; i++) {
    uintptr_t idx = (1 << RISCV_PGLEVEL_BITS) - num_sbi_pages + i;
    sbi_pt[idx] = pte_create((DRAM_BASE / RISCV_PGSIZE) + i, PTE_G | PTE_R | PTE_X);
  }
  pte_t* sbi_pte = middle_pt + ((num_middle_pts << RISCV_PGLEVEL_BITS)-1);
  assert(!*sbi_pte);
  *sbi_pte = ptd_create((uintptr_t)sbi_pt >> RISCV_PGSHIFT);

  mb();
  root_page_table = root_pt;
  write_csr(sptbr, (uintptr_t)root_pt >> RISCV_PGSHIFT);
}

Lets look on this function code flow.
uintptr_t highest_va = DRAM_BASE - first_free_paddr;
The above operation calculates the highest supervisor VA(virtual address) highest_va value. DRAM_BASE is less than first_free_paddr which is the address of the first free megapage after BBL+payload was loaded to DRAM starting at DRAM_BASE machine level address. On my test system DRAM_BASE = 0x80000000 and first_free_paddr = 0x82800000these are machine level physical adresses as CPU starts at machine level mode. The difference is a negative number which in two's complement arithmetic gives the valid virtual address at the top of the 64 bit address range highest_va = 0xfffffffffd800000 for supervisor mode. This leaves intact a top VA range in supervider mode thus preserving the machine level code which is mapped at this range, see below.
The info structure describes the payload with an ELF header. Typical values on my system for the Linux kernel as a payload as they are shown by GDB print command are
(gdb) p/x info
$6 = {entry = 0xffffffff80000000, first_user_vaddr = 0xffffffff80000000, first_vaddr_after_user = 0xffffffff803b2000, load_offset = 0x102800000}
The memory size available for machine level mode is
(gdb) p/x mem_size
$16 = 0x100000000
It should be adjusted for supervisor. The memory size available for supervisor is calculated as
mem_size = MIN(mem_size, highest_va - info.first_user_vaddr) & -MEGAPAGE_SIZE;
On my system this value is
(gdb) p/x $a5
$11 = 0x7d800000
Then the SBI page table is allocated. This page table is used to map the SBI BBL at the top of the address space.
pte_t* sbi_pt = (pte_t*)(info.first_vaddr_after_user + info.load_offset);
The sbi_pt value on my machine is
(gdb) p/x $s1
$15 = 0x82bb2000
As you can see the CPU works with machine level addresses while info.first_vaddr_after_user is a supervisor virtual address. The info.load_offset value is used to adjust the supervisor virtual address to machine level physical address.
Then the real supervisor page table address is calculated by allocating a middle/directory table just after sbi_pt
pte_t* middle_pt = (void*)sbi_pt + RISCV_PGSIZE;
Then the root page table pointer is initialized.
#if __riscv_xlen == 32
  size_t num_middle_pts = 1;
  pte_t* root_pt = middle_pt;
  memset(root_pt, 0, RISCV_PGSIZE);
#else
  size_t num_middle_pts = (-info.first_user_vaddr - 1) / GIGAPAGE_SIZE + 1;
  pte_t* root_pt = (void*)middle_pt + num_middle_pts * RISCV_PGSIZE;
  memset(middle_pt, 0, (num_middle_pts + 1) * RISCV_PGSIZE);
  for (size_t i = 0; i < num_middle_pts; i++)
    root_pt[(1<<RISCV_PGLEVEL_BITS)-num_middle_pts+i] = ptd_create(((uintptr_t)middle_pt >> RISCV_PGSHIFT) + i);
#endif
The supervisor page table structure is then initialized to map supervisor virtual addresses to machine level physical addresses. Look how info.load_offset is used again to translate supervisor virtual address to machine level physical address.
  for (uintptr_t vaddr = info.first_user_vaddr, paddr = vaddr + info.load_offset, end = info.first_vaddr_after_user;
       paddr < DRAM_BASE + mem_size; vaddr += MEGAPAGE_SIZE, paddr += MEGAPAGE_SIZE) {
    int l2_shift = RISCV_PGLEVEL_BITS + RISCV_PGSHIFT;
    size_t l2_idx = (info.first_user_vaddr >> l2_shift) & ((1 << RISCV_PGLEVEL_BITS)-1);
    l2_idx += ((vaddr - info.first_user_vaddr) >> l2_shift);
    middle_pt[l2_idx] = pte_create(paddr >> RISCV_PGSHIFT, PTE_G | PTE_R | PTE_W | PTE_X);
  }
The machine level SBI BBL code is remapped at the top of the range reserved above highest_va through sbi_pt page table allocated early. The BBL has been loaded at DRAM_BASE machine level physical address. This address range is mapped as a read only range for supervisor mode. The PTE are also marked as global so they are visible in all address spaces.
  // map SBI at top of vaddr space
  extern char _sbi_end;
  uintptr_t num_sbi_pages = ((uintptr_t)&_sbi_end - DRAM_BASE - 1) / RISCV_PGSIZE + 1;
  assert(num_sbi_pages <= (1 << RISCV_PGLEVEL_BITS));
  for (uintptr_t i = 0; i < num_sbi_pages; i++) {
    uintptr_t idx = (1 << RISCV_PGLEVEL_BITS) - num_sbi_pages + i;
    sbi_pt[idx] = pte_create((DRAM_BASE / RISCV_PGSIZE) + i, PTE_G | PTE_R | PTE_X);
  }
After sbi_pt has been filled it is inserted in the superviser page directory. This establishes the mapping visible from supervisor level.
  pte_t* sbi_pte = middle_pt + ((num_middle_pts << RISCV_PGLEVEL_BITS)-1);
  assert(!*sbi_pte);
  *sbi_pte = ptd_create((uintptr_t)sbi_pt >> RISCV_PGSHIFT);
The last page ending at _sbi_end physical address is mapped at the last page of the virtual address space. SBI mapping in detailes is descibed here https://github.com/slavaim/riscv-notes/blob/master/bbl/sbi-to-linux.md
Before returning to a caller the function sets page table base register for supervisor virtual address translation. The memory barrier guaranties that all memory writes has completed so the page table is in a consistent state.
  mb();
  root_page_table = root_pt;
  write_csr(sptbr, (uintptr_t)root_pt >> RISCV_PGSHIFT);
P.S. BBL sets sptbr to the root_page_table value in enter_supervisor_mode which makes redundant the above call to write_csr(sptbr, (uintptr_t)root_pt >> RISCV_PGSHIFT);.

Saturday, March 4, 2017

rv64 RISC-V booting

This text is based on boot.md from my GitHub repo  riscv-notes

After reset a rv64 RISC-V CPU fetches the first instruction from DEFAULT_RSTVEC = 0x00001000. For example below is QEMU CPU reset emulation from riscv-qemu/target-riscv/cpu.c
static void riscv_cpu_reset(CPUState *s)
{
    RISCVCPU *cpu = RISCV_CPU(s);
    RISCVCPUClass *mcc = RISCV_CPU_GET_CLASS(cpu);
    CPURISCVState *env = &cpu->env;
    CPUState *cs = CPU(cpu);

    mcc->parent_reset(s);
#ifndef CONFIG_USER_ONLY
    tlb_flush(s, 1);
    env->priv = PRV_M;
    env->mtvec = DEFAULT_MTVEC;
#endif
    env->pc = DEFAULT_RSTVEC;
    cs->exception_index = EXCP_NONE;
}
The 0x00001000 address is mapped to ROM with a trampoline code to 0x80000000. AUIPC instruction moves its immediate value 12 bits to the left and adds to the current PC , so t0 = 0(x7ffff<<12)+ 0x1000 = 0x80000000
(gdb) x/2i 0x1000
   0x1000:  auipc   t0,0x7ffff
   0x1004:  jr  t0
For QEMU the above code is defined in riscv-qemu/hw/riscv/riscv_board.c
    uint32_t reset_vec[8] = {
        0x297 + 0x80000000 - 0x1000, /* reset vector */
        0x00028067,                  /* jump to DRAM_BASE */
        0x00000000,                  /* reserved */
        0x0,                         /* config string pointer */
        0, 0, 0, 0                   /* trap vector */
    };
The 0x80000000 address is a start of DRAM where the BBL is loaded. Below are definitions that are the same for the both QEMU and spike simulator.
#define DEFAULT_RSTVEC     0x00001000
#define DRAM_BASE          0x80000000
After jr t0 has been execute the register content is as follows ( t0 and pc are equal )
(gdb) info registers
ra             0x0000000000000000   0
sp             0x0000000000000000   0
gp             0x0000000000000000   0
tp             0x0000000000000000   0
t0             0x0000000080000000   2147483648
t1             0x0000000000000000   0
t2             0x0000000000000000   0
fp             0x0000000000000000   0
s1             0x0000000000000000   0
a0             0x0000000000000000   0
a1             0x0000000000000000   0
a2             0x0000000000000000   0
a3             0x0000000000000000   0
a4             0x0000000000000000   0
a5             0x0000000000000000   0
a6             0x0000000000000000   0
a7             0x0000000000000000   0
s2             0x0000000000000000   0
s3             0x0000000000000000   0
s4             0x0000000000000000   0
s5             0x0000000000000000   0
s6             0x0000000000000000   0
s7             0x0000000000000000   0
s8             0x0000000000000000   0
s9             0x0000000000000000   0
s10            0x0000000000000000   0
s11            0x0000000000000000   0
t3             0x0000000000000000   0
t4             0x0000000000000000   0
t5             0x0000000000000000   0
t6             0x0000000000000000   0
pc             0x0000000080000000   2147483648
BBL assigns 0x80000000 address to reset_vector by placing it in the beginning of the .text.init section which is linked to 0x80000000 .
From riscv-tools/riscv-pk/machine/mentry.S
  .section .text.init,"ax",@progbits
  .globl reset_vector
reset_vector:
  j do_reset
From riscv-tools/riscv-pk/pk/pk.lds
  /* Begining of code and text segment */
  . = 0x80000000;
  _ftext = .;
  PROVIDE( eprol = . );

  .text :
  {
    *(.text.init)
  }