Kernel Memory

sbrk() and mmap()

sys_sbrk() -> growproc() -> allocuvm() -> (kalloc(); mappages(); kfree())
sys_mmap() -> ... -> (kalloc(); mappages(); kfree())

kalloc() -- allocate a regular page.

acquire kmem.lock -- multiple CPUs may try to allocate a page.
take one page from the head of the freelist
release the lock

kfree()

first check if the VA is legitimate (cannot be a page at the kernel binary image, and must correspond to a physical page)
acquire kmem.lock
insert the page to the head of the freelist
release the lock

Virtual memory

The kernel has its own virtual memory space. Kernel runs at above KERNBASE.

When kernel run this code:

int * ptr = KERNBASE+0x1000; // ptr == 0xFFFF800000001000
*ptr = 100;

CPU sees that an instruction is trying to access memory address (0xFFFF800000001000).

Since the virtual addressing mode is enabled, every address is considered virtual (VA) add must be translated into physical address (PA) before use.

CPU traverses the page table using this address (0xFFFF800000001000). The result of this translation should be a physical address.

The page table is essentially a prefix tree. The traversal takes a few bits at a time to navigate on the tree, until reaching the leaf level (or abort translation if a subtree does not exist).

The root of the page table is stored in system register %cr3.

Once the CPU finds the PA, it tries to cache the VA-to-PA translation in TLB (translation look-aside buffer). TLB is a small cache siting next to the L1/L2 caches on the CPU.

Accessing the TLB cache is much faster than traversing the prefix tree on DRAM. Virtual addressing can be extremely slow if there is no TLB.

Note that there is no cache coherency between the TLB and the page table. Because of this, the kernel needs to explicitly "flush" the TLB if something is changed on table table.

The following figure depicts the virtual memory layout of coexisting processes. Each process has its own page table (its VA).

The kernel also has a private page table (on the top, kernel VA).

However, kernel quite often borrows a process' page table to run. In fact, every process contains the same (and authentic) kernel page table above KERNBASE. Why?

The story begins with efficiency. Switching page table is very slow because the TLB needs to be flushed. A simple system call, such as reading system time, could waste thousands of cpu cycles to make the switches.

With the kernel page table mapped at the high address space:

The kernel can directly execute kernel code without switching page tables. (this is also very important for interrupt handling)
If returning to the calling user process without switching back to the user page table. (two flushes saved).
Permissions are carefully set, so user cannot access kernel code/data. (But... we will discuss the Meltdown vulnerability)
kernel can also directly access the current user's data. For example, user want to write some data to a file, the data buffer is at the user's address space.

Read the CPU simulators' source code and see how a "real" CPU does address translation.

https://github.com/riscv/riscv-angel/blob/release/mmu.js

Page table management (hw4)

Essentially it's a prefix tree. Its job is to map virtual addresses to physical addresses, if any.

The difficulty mostly comes from the fact that we are dealing with MULTIPLE spaces at the same time:

There is always one physical address space.
There can be one or two user address spaces. (especially when you fork()).
The kernel virtual address space is above KERNBASE. It has the same layout on any user or kernel page tables.
When looking at a pointer or an address, be very careful about what it is: PA? KVA? UVA?

mappages()

Called by inituvm() and allocuvm() -- and will be call by your code in hw4

parameters:

pde_t * pgdir -- the root of the page table
void * va -- the virtual address of the mapping
addr_t size -- size of the mapping
addr_t pa -- the physical address of the mapping
int perm -- the permission bits, see the PTE_* macros in mmu.h

With some round-down calculations, we get the VAs of the first and the last page (a and last).

For each page between a and last:

call walkpgdir() to make sure the corresponding structures are ready to use in the page table -- see the calls to kalloc() in walkpgdir(), that's all for the page table itself.
We don't handle remapping in xv6, so panic() if a page is already present
setup the page table entry (pte): pa | perm | PTE_P;

Recall that the page table is a (prefix) tree.

Each non-leaf node contains some pointers to its children
Should we use VA or PA in these pointers? It's another scenario of "chicken and eggs". (1) Intel's CPU told us to use physical addresses. (2) If you're going to design a new CPU, using VA might be possible, but definitely more expensive as the hardware has to do recursive virtual-to-physical translations.
When allocating the entries in the page table, the kalloc() returns the VA of the new page. We need to use the macro V2P() when assigning the pointers at their parent nodes.

Bits in page table entries (and all other nodes in the prefix tree)

PTE_P: this page is present
PTE_W: write access is allowed
PTE_U: user can access this page

Access permission is checked (by logical AND) for all nodes along the translation path. For simplicity xv6 kernel simply sets both PTE_W and PTE_U at any non-leaf nodes.

switchuvm()

"switch to this user's virtual memory". Usually, we may already have this process' page table in the CPU.

In this function, it will call lcr3() to reload the page table. This (re)setting the %cr3 register will cause the TLB cache to be flushed, thus the up-to-date mapping information in the page table will be used by the CPU.

Extended reading: Linux Kernel memory layout. https://www.kernel.org/doc/html/latest/x86/x86_64/mm.html

Page Fault

Let's first recall an important structure for interrupt handling: the trapframe (defined in x86.h)

struct trapframe {

......

uint64 r15;

uint64 trapno;

uint64 err;

uint64 rip;

uint64 cs;

uint64 rflags;

uint64 rsp;

uint64 ss;

};

What's happening in the interrupt handler context?

We need to identify the first instruction being executed after the interrupt/exception.
We first get some clue at the beginning of trapasm.S: alltraps -- then the trap() function in C is called.
in alltraps, it begins to push %r15, which indicates the values below it (from trapno to ss) are already on the stack (they are actually above %r15 in the address space).
Read the comment: vectors.S sends all traps here.
Where is vectors.S? if you haven't built the kernel, the file could be missing. Search "vectors.S" in Makefile to get a clue (it depends on vectors.pl)
To get the vectors.S, type "make vectors.S" and enter.
These trap entries are configured by tvinit() during bootstrap.

https://www.amd.com/system/files/TechDocs/24593.pdf Chapter 8 "Exceptions and Interrupts"

vectors.S

In each trap entry code (vector0, etc..), a trap number is pushed onto the stack.
Some entries have an error code of 0 pushed before the trap number, depending on the corresponding behaviors of the hardware. -- vectors.pl: if(!($i == 8 || ($i >= 10 && $i <= 14) || $i == 17)) print " push \$0\n";
page-fault has the trap number 14 (T_PGFLT). CPU is responsible for pushing an error code to the trap frame.
See more details in Intel's manual, Volume 3, Section 4.7 and 6.15
Now we know that the CPU is responsible for filling the last six values of trapframe upon a page fault exception.
Recall the entry code for syscalls (syscall_entry), the CPU only pushes the last two values (rsp & ss).

%cr2 register contains the virtual address which caused the fault. This value is almost always likely to be used by the page fault handler.

https://en.wikipedia.org/wiki/Control_register#Control_registers_in_x86_series

rcr2() reads the %cr2 register, which can be used in your page fault handler. (see the figure below)

Handling page fault (for lazy mmap)

When there is a page fault, the handler could choose to terminate the process, or establish a mapping at the fault address so the program can continue to run.

There must be a good reason for not terminating the process:

An address below the current user stack. In this case, the kernel may allow the user process's stack to grow.
The kernel has previously promised to give the process some memory at the fault address -- consider the financial concept "futures".
The kernel has temporarily swapped out the page (to a paging device)

On the above conditions, the page fault handler needs to first allocate a page and map it at the fault address.

For mmap and swapping, some information should have been recorded so the handler can figure out what to fill into the page.

Swapping (not available in xv6)

Total memory demand of user applications could be greater than the actual physical memory. (After opening 30+ tabs in your browser, you decided to play Warcraft 3)

The browser does a lot of caching for site contents. Can the OS ask the user to release some memory used for caching?
OOM killer? (out-of-memory killer)
swapping

The page-fault handling mechanism greatly expands the flexibility in the OS while providing the user a huge and transparent image of memory space.

Swapping:

Save the content of some "victim" pages on the swapping device temporarily.
Recycle these physical pages to serve current or upcoming allocations.
Mark on the victim's page table that where can we recover the original content if that page is to be used by the program. Unset the "PTE_P" bit as the page is not present.
Upon page fault, reallocate a page for the victim VA and recover their content from the swapping device.

Real world: the "swapfile" on windows, and the swap partition on *nix OSes.

Tradition view: swapping devices are SLOW. As a result, how to choose the victim pages is of great importance.

A general assumption of the real world workloads: access locality.
page replacement algorithms: First-in First-out (FIFO), Least-recently used (LRU), Least-frequently used (LFU), ARC, LIRS, etc..
# of physical pages = 3; Sample access sequence 1 2 3 4 1 5 2 6 1 7 2 8 1 9 2 ......
LRU: 1, 21, 321, 432, 143, 514, 251, 625, 162, 716, 271, 827...... (there is no hit because the reuse distances of 1 or 2 is always >= 4, 4 > #pages.
Advanced algorithms see a longer access history.

zram: emulates a in-memory block device for swapping. pages are compressed in the "zram device".

Stack overflow protection and lazy stack space allocation

xv6-64 user memory layout (the master branch):

The first page (0..4095) is not mapped. Access to this page will cause page fault. Example: NULL pointer dereference.
Starting from the second page (4096..), the program binary is copied there. The "-Ttext 0x1000" in Makefile tells the compiler to generate code with this assumption.
The next free page will be unmapped and the page next to it will be the user stack.
If the user stack grow beyond the stack page, the process will be killed.

Real systems:

Usually allow the stack to grow to a reasonable size (8MB on Linux by default).
stack pages will be allocated on demand, similar to that on lazy_mmap (hw4).

stackoverflow1.c

#include <stdio.h>

void

rec_print(char * ptr_a)

{

char b;

printf("a %p b %p a-b %lu\n", ptr_a, &b, ptr_a - &b);

rec_print(ptr_a);

}

int

main()

{

char a;

rec_print(&a);

return 0;

}

stackoverflow2.c (Linux only)

#include <stdio.h>

#include <sys/resource.h>

void

overflow(char * ptr_a)

{

ptr_a[-8000000] = 'A';

printf("%c\n", ptr_a[-8000000]);

}

int

main()

{

char a;

struct rusage rs;

getrusage(RUSAGE_SELF, &rs);

printf("%lu\n", rs.ru_minflt);

getrusage(RUSAGE_SELF, &rs);

printf("%lu\n", rs.ru_minflt);

getrusage(RUSAGE_SELF, &rs);

printf("%lu\n", rs.ru_minflt);

overflow(&a);

getrusage(RUSAGE_SELF, &rs);

printf("%lu\n", rs.ru_minflt);

return 0;

}

retcall.c

bash$ gcc -o retcall -fno-stack-protector retcall.c

#include <stdio.h>

void

print()

{

printf("A\n");

}

void

f1()

{

long long a[1];

a[2] = print;

a[3] = print;

}

int

main()

{

f1();

return 0;

}