Kernel Memory

sbrk() and mmap()

kalloc() -- allocate a regular page.

kfree()

Virtual memory

The kernel has its own virtual memory space. Kernel runs at above KERNBASE.

When kernel run this code:

CPU sees that an instruction is trying to access memory address (0xFFFF800000001000).

Since the virtual addressing mode is enabled, every address is considered virtual (VA) add must be translated into physical address (PA) before use.

CPU traverses the page table using this address (0xFFFF800000001000). The result of this translation should be a physical address.

The page table is essentially a prefix tree. The traversal takes a few bits at a time to navigate on the tree, until reaching the leaf level (or abort translation if a subtree does not exist). 

The root of the page table is stored in system register %cr3. 

Once the CPU finds the PA, it tries to cache the VA-to-PA translation in TLB (translation look-aside buffer). TLB is a small cache siting next to the L1/L2 caches on the CPU.

Accessing the TLB cache is much faster than traversing the prefix tree on DRAM. Virtual addressing can be extremely slow if there is no TLB.

Note that there is no cache coherency between the TLB and the page table. Because of this, the kernel needs to explicitly "flush" the TLB if something is changed on table table.

The following figure depicts the virtual memory layout of coexisting processes. Each process has its own page table (its VA).

The kernel also has a private page table (on the top, kernel VA).

However, kernel quite often borrows a process' page table to run. In fact, every process contains the same (and authentic) kernel page table above KERNBASE. Why?

The story begins with efficiency. Switching page table is very slow because the TLB needs to be flushed. A simple system call, such as reading system time, could waste thousands of cpu cycles to make the switches.

With the kernel page table mapped at the high address space:

Read the CPU simulators' source code and see how a "real" CPU does address translation.

https://github.com/riscv/riscv-angel/blob/release/mmu.js

Page table management (hw4)

Essentially it's a prefix tree. Its job is to map virtual addresses to physical addresses, if any.

The difficulty mostly comes from the fact that we are dealing with MULTIPLE spaces at the same time:

mappages()

parameters:

With some round-down calculations, we get the VAs of the first and the last page (a and last).

For each page between a and last:

Recall that the page table is a (prefix) tree.

Bits in page table entries (and all other nodes in the prefix tree)

Access permission is checked (by logical AND) for all nodes along the translation path. For simplicity xv6 kernel simply sets both PTE_W and PTE_U at any non-leaf nodes.

switchuvm()

"switch to this user's virtual memory". Usually, we may already have this process' page table in the CPU.

In this function, it will call lcr3() to reload the page table. This (re)setting the %cr3 register will cause the TLB cache to be flushed, thus the up-to-date mapping information in the page table will be used by the CPU.

Extended reading: Linux Kernel memory layout. https://www.kernel.org/doc/html/latest/x86/x86_64/mm.html

Page Fault

Let's first recall an important structure for interrupt handling: the trapframe (defined in x86.h)

struct trapframe {

  ......

   uint64 r15;


   uint64 trapno;

   uint64 err;

   uint64 rip;

   uint64 cs;

   uint64 rflags;

   uint64 rsp;

   uint64 ss;

};

What's happening in the interrupt handler context?

https://www.amd.com/system/files/TechDocs/24593.pdf  Chapter 8 "Exceptions and Interrupts"

vectors.S

%cr2 register contains the virtual address which caused the fault. This value is almost always likely to be used by the page fault handler.

https://en.wikipedia.org/wiki/Control_register#Control_registers_in_x86_series

rcr2() reads the %cr2 register, which can be used in your page fault handler. (see the figure below)

Handling page fault (for lazy mmap)

When there is a page fault, the handler could choose to terminate the process, or establish a mapping at the fault address so the program can continue to run.

There must be a good reason for not terminating the process:

On the above conditions, the page fault handler needs to first allocate a page and map it at the fault address.

For mmap and swapping, some information should have been recorded so the handler can figure out what to fill into the page.

Swapping (not available in xv6)

Total memory demand of user applications could be greater than the actual physical memory. (After opening 30+ tabs in your browser, you decided to play Warcraft 3)

The page-fault handling mechanism greatly expands the flexibility in the OS while providing the user a huge and transparent image of memory space.

Swapping:

Real world: the "swapfile" on windows, and the swap partition on *nix OSes.

Tradition view: swapping devices are SLOW. As a result, how to choose the victim pages is of great importance.

zram: emulates a in-memory block device for swapping. pages are compressed in the "zram device". 

Stack overflow protection and lazy stack space allocation

xv6-64 user memory layout (the master branch):

Real systems:

stackoverflow1.c

#include <stdio.h>


void

rec_print(char * ptr_a)

{

  char b;

  printf("a %p b %p a-b %lu\n", ptr_a, &b, ptr_a - &b);

  rec_print(ptr_a);

}


int

main()

{

  char a;

  rec_print(&a);

  return 0;

}

stackoverflow2.c (Linux only)

#include <stdio.h>

#include <sys/resource.h>


void

overflow(char * ptr_a)

{

  ptr_a[-8000000] = 'A';

  printf("%c\n", ptr_a[-8000000]);

}


int

main()

{

  char a;

  struct rusage rs;


  getrusage(RUSAGE_SELF, &rs);

  printf("%lu\n", rs.ru_minflt);

  getrusage(RUSAGE_SELF, &rs);

  printf("%lu\n", rs.ru_minflt);

  getrusage(RUSAGE_SELF, &rs);

  printf("%lu\n", rs.ru_minflt);


  overflow(&a);


  getrusage(RUSAGE_SELF, &rs);

  printf("%lu\n", rs.ru_minflt);


  return 0;

}

retcall.c

bash$ gcc -o retcall -fno-stack-protector retcall.c


#include <stdio.h>


void

print()

{

  printf("A\n");

}


void

f1()

{

  long long a[1];

  a[2] = print;

  a[3] = print;

}


int

main()

{

  f1();

  return 0;

}