System Calls
The sbrk() system call
A wrapper function at the userspace: sbrk()
declared in user.h, together with all other system calls.
implemented in usys.S, by playing tricks!
Read about calling conventions.
A caller and its callee must follow the same calling convention.
Parameters can be passed in registers, on the stack, or using a mix of registers and stack values.
For system calls, the parameters are passed to the kernel from the "user context" (can be a thread or process) to the kernel handler.
The user context contains the user's current "progress" of its execution, which contains a snapshot of its registers (the trap frame), and all other related information, such as file descriptor table and page table.
The kernel knows everything about the user process/thread. As a result, the kernel can let users to pass parameters using either registers and/or memory (stack).
If passing on the stack, the kernel needs to check if the user's stack (%rsp) is in a valid memory area. This is more expensive than using registers (no checking at all).
Syscall convetions:
On x86-32, parameters for Linux system call are passed using registers. %eax for syscall_number. %ebx, %ecx, %edx, %esi, %edi, %ebp are used for passing up to 6 parameters to system calls.
x86-64 (Linux, BSD, OSX, etc.): %rdi, %rsi, %rdx, %r10, %r8 and %r9, syscall number in %rax.
In comparison, regular function calls on x86-64 use %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The fourth register is different.
The System ABI specifications (very readable):
System V Application Binary Interface AMD64 Architecture Processor Supplement
Figure 3.4: Register Usage (Page 23)
A.2 AMD64 Linux Kernel Conventions (Page 145)
The syscall calling convention was intentionally designed to match the function calling convention. This minimizes the syscall complexity at the user side.
glibc's syscall wrapper: syscall.S
System calls in the 64-bit xv6 kernel
What we know about the kernel part
sys_sbrk() in sysproc.c
Who called sys_sbrk() in the kernel?
syscall() in syscall.c
syscall() retrieves the user's %rax, which specify the syscall number, from the trapframe and use the table of function pointers to call sys_sbrk().
The return value of sys_sbrk() is saved to the %rax in the trapframe.
"proc" is a per-cpu global variable defined at the top of vm.c. The use of __thread (thread-local storage, or TLS) is a trick. Here kernel effectively uses a per-cpu variable to access its current user process. With multiple CPUs (cores), each CPU has its own current user process.
during the syscall handling, the cpu may make a context-switch to resume another process, by calling yield().
Who called syscall()?
syscall_entry in trapasm.S
Why this is called from the assembly code? -- manipulating registers
What are those pushes?
struct trapframe in x86.h -- architecture-dependent definitions.
Something is already on the trapframe before it starts to push.
there is no free register at the entry point. %rax needs to be saved in a strange place before it can be freely used.
Why CPU entered syscall_entry?
See syscallinit(), vm.c
After executing the syscall instruction, CPU switches to ring 0 (the "kernel mode", or "privileged mode"), and set up a few registers (%rcx, %r11).
xv6 does not support user-space %fs/%gs. The kernel uses %fs.
How Linux saves the user context in this situation:
arch/x86/entry/entry_64.S, entry_SYSCALL_64 (corresponds to syscall_entry in xv6)
swapgs -- why we need this instruction? (user's gs, no free registers at the entry point, ...)
Making a syscall from userspace
Recap: kernel can retrieve syscall parameters, passed by registers, from the user's trapframe.
How does a user program pass it to the kernel?
gdb demo with hellobrk
$ nm _hellobrk # find the address of sbrk() at 0x498
in term1 $ make qemu-gdb # listens at port 26000, (or could be 260xx. use $ ss -tln to check for listening ports)
in term2 $ gdb # this will automaticallly attach to qemu, which made possible by the magical Makefile...
>>> c # first let it boot and bring up the shell
>>> use "Ctrl + c" to interrupt the execution
>>> sym _hellobrk # switch to _hellobrk symbols
>>> b sbrk # break at sbrk()
>>> disass sbrk (or x/10i sbrk) # see the assembly code
>>> si # for single-step instruction
>>> info reg # see registers (the first parameter is in %rdi)
...
>>> sym kernel # when entering the kernel space, switch back to use the kernel debugging information.
GDB TUI commands https://sourceware.org/gdb/current/onlinedocs/gdb/TUI-Commands.html
Other OSes:
the 32-bit xv6 uses the user stack to pass syscall parameters, same usys.S, different argint()
jos (the os used in mit's 6.828 labs) uses registers to pass syscall parameters, extra settings at the user space, user code, kernel code
// hellobrk.c
#include "types.h"
#include "user.h"
void onesbrk(int inc) {
printf(1, "brk(%d) old brk is %d\n", inc, sbrk(inc));
}
int main(int argc, char ** argv) {
// sbrk(n): set the new brk at old brk + n. n can be positive, zero, or negative.
// returns the old brk on success (non-negative), returns negative number on error
int xs[10] = {0, 1, 21, 64, 50000, 0, -50000, -50000, 0, -5000};
for (int i = 0; i < 10; i++)
onesbrk(xs[i]);
exit();
}
Hunt for a kernel bug (or a new feature) with this tiny program.
The brk system call in Linux http://man7.org/linux/man-pages/man2/brk.2.html
Two functions are provided to user space programs: brk() and sbrk()
IDT
Interrupts: external events that need attention. e.g., received keystroke, timer alarm.
Exceptions: internal events caused by the running instruction flow. e.g., divide-by-zero, memory access violation.
When interrupts or exceptions are sent to the processor, the current execution flow will be immediately interrupted.
The CPU (core) will need to transfer control to a pre-defined handling procedure. The execution flow must be preserved so it can be resumed later (unless it will be terminated).
On x86, the Interrupt Descriptor Table (IDT) tells the CPU where to go for an interrupt or exception.
vectors.S (generated by vectors.pl) # the entry points for different event types
traps.h # what can be handled
trap.c:tvinit() -> mkgate() # the IDT is filled by the boot processor
trap.c:idtinit() # the table is loaded by EVERY processor
trap() # the C function that handles all events. tf has been prepared by the alltraps in trapasm.S
lapic.c: lapicinit() # enable several interrupts and the controller
Some history of syscall
Originally, user can use the "software interrupt" instruction (int) to invoke system calls. It emulates an interrupt on the calling processor. The control is then transferred to the kernel.
int $0x80
System call borrows/reuses the interupt/exception handling mechanism.
However, the interrupt mechanism is very expensive as it's designed to interrupt execution flow at ANY moment. Few assumption can be made for the execution flow.
For example, user could be using any of the general-purpose registers. the interrupt handler must preserve EVERY general-purpose registers.
x86 has been providing a special mechanism for system calls.
sysenter on 32-bit CPUs (still old, you can ignore it)
syscall (64-bit CPU, the standard today)
The setup and usage of the syscall mechanism can be learned from xv6 and the instruction's documentations
vm.c: syscallinit() # let CPU know the entry points
trapasm.S: syscall_entry # the syscall entry point
On older versions of xv6-64, you may still see T_SYSCALL and it's handling on the regular trap path. With the recent updates the syscall code path has been separated from the trap code path to avoid unnecessary confusions (9/8/2019).
Extended reading: Linux/Kernel network flow