Concurrency and Locking

Concepts you already know:

race condition
critical section

Types of synchronization mechanisms

Atomic operations
spinlocks
Semaphores ("sleeplock")
Reader-writer locks
...

Locks in the Linux kernel: https://www.kernel.org/doc/html/latest/locking/index.html

Always think of race conditions when there are two or more execution flows accessing a shared resource.

Obviously, on multi-core & multi-threaded user programs
On multi-core kernels
Kernel on a single-core CPU? You still need to think of locking unless your task is non-preemptive.

Lock

The goal is to establish consensus between multiple independently-running threads.

At any given time, only one thread can have the lock.
The other unlucky threads must know that they didn't get the lock. But they usually don't care who has the lock right now.

Naively, A client-server model can achieve correct locking: the is one server who manages who has the lock. Every process talks to the server to acquire the lock. This model can be implemented in different ways. messages need to be passed between the clients and the server and the implementation is non-trivial.

In performance-critical systems, locking must be very efficient.

lock();

a += 1;

b -= 1; // a+b == constant

unlock();

Quite often a critical section has only a few instructions, can you afford to use a 10000-cycle lock()/unlock()?

To achieve fast locking (or other forms of synchronization), hardware provides the first layer of mechanism-- atomic instruction.

Atomic operations

C11 has native support for atomic operations: https://en.cppreference.com/w/c/atomic

Before C11, GCC has been providing primitives for atomic operations: __sync*

on x86_64, atomic operations can handle 64-bit and smaller values.

Let's first see some demo code

Atomicity not cheap:

One statement in C can be compiled into multiple instructions.
One instruction can be translated into multiple micro-operations by the CPU.
Special instructions must be used to enable atomic operations, which is usually 10x to 100x more expensive than regular arithmetical instructions.
https://stackoverflow.com/questions/10109679/how-come-inc-instruction-of-x86-is-not-atomic

C11 atomic operations can be directly used to implement the lock https://en.cppreference.com/w/c/atomic

simple operation on one atomic value: +, -, and, or, xor, etc..
xchg "exchange". this operation can be used to implement some (simple) locks, such as spinlock.
CAS, "compare and swap", or compare-exchange.

There is no magic in the code. Each of these operations corresponds to one or a few CPU instructions.

If your CPU does not provide some or all of these instructions, locking can be impossible.

x86: lock prefix + instruction. (It's still one instruction)

lock inc +1
lock add +n
lock xadd +n and return old value
lock xchg exchange value between reg and mem
lock cmpxchg compare mem with a reg, exchange if equal.

ARM: link

LDREX load mem and track its "exclusive" status
STREX store a value to mem if the status still holds

MIPS:

LL Load-linked
SC Store-Conditional

Similar instructions pairs are used by several RISC CPUs (PowerPC, RISC V, Alpha, ...)

Costs: https://www.agner.org/optimize/instruction_tables.pdf

spinlock on xv6

spinlock.c acquire()

void lock() { // pseudo code below

pushcli(); //disable interrupt

while (xchg(lock, 0, 1) == 1);

}

The thread who successfully received 0 gets the lock.

Unlock: set the lock back to 0; enable interrupt.

the use of acquire() and release() marks the a non-preemptive critical section.

The reason to disable the interrupt is that the execution may not recover after it's interrupted. This usually applies to the interrupt-handler context.
Sometimes the programmer could be spoiled -- for preemptable code, disabling the interrupt can make the code simple and lower the chance of introducing bugs, but at a cost of more CPU cycles.
See spinlocks in Linux: "They are the most safe ones, and the ones that work under all circumstances, but partly _because_ they are safe they are also fairly slow. They are slower than they'd need to be, because they do have to disable interrupts (which is just a single instruction on a x86, but it's an expensive one - and on other architectures it can be worse)"

pushcli/popcli

There is a counter recording how many times the pushcli has been called (assuming it has not been cancelled out by the corresponding popcli).
The first pushcli disables the interrupt; the last popcli enables the interrupt.
Example: pushcli(disable) -> pushcli(do-nothing) -> popcli(do-nothing) -> popcli(enable)

Compiler fence and memory fence

It prevents the compiler or the CPU to reorder the load/stores before and after the call.

Compiler reordering:

the compiler may reorder the statement "lk->cpu = cpu" above the while loop.
The fence tells the compiler to not reorder them.

Out-of-order execution:

Memory-fence instructions are used to prevent the CPU to reorder the load and stores. you can see "mfence" in kernel.asm
However, as the "locked" compare-exchange implicitly orders the memory operation, the mfence instruction is actually not needed here. Only the compiler fence is necessary on x86.

It's hard to elaborate the actual hardware behavior and the necessity of the fence instructions. "Architectural specifications are “loose” to cover past and future implementations"

Reading: x86-TSO

(Spinlock)

Pros:

Shortest response time, and turn-around time
Simple implementation

Cons:

Wastes CPU cycles with long critical sections or contention. Even worse when interrupt is disabled!

Reduce lock contention: multiple locks (e.g., for different slots in hash tables)

challenge: a large number of locks can compromise the cache efficiency. Each lock takes an entire cache line.

Extended reading:

Ticket lock: https://en.wikipedia.org/wiki/Ticket_lock
MCS lcok: https://lwn.net/Articles/590243/
queue spinlock in linux: https://lwn.net/Articles/533636/ current kernel code
glibc's pthread_spin_lock: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/nptl/pthread_spin_lock.S;hb=HEAD

pthread_spin_lock.S

the lock variable is initialized to 1.

ENTRY(pthread_spin_lock)

LOCK; decl 0(%rdi) # atomic --(*lock); %rdi (the 1st argument) is the pointer to the lock

jne 2f # jump if ZF!=0. same as jnz (Jump if Not Zero); 2f means the first label "2" forward

xor %eax, %eax # Return 0: lock acquired! This thread has changed the lock from 1 to 0 atomically.

ret # Challenge question: what can be found in (*lock) when this function returns?

.align 16 # align the jump entry point

rep; nop # this is actually a PAUSE instruction. Explained here

cmpl $0, 0(%rdi) # compare the 2nd and the 1st, which can be confusing.

jg 1b # if *lock > 0, then goto 1: (the first label "1" backward)

jmp 2b # still locked. goto 2: (spinning); 2b means the first label "2" backward

END(pthread_spin_lock)

ENTRY(pthread_spin_unlock)

movl $1, (%rdi) # lock = 1;

xorl %eax, %eax

retq

END(pthread_spin_unlock)

An example for ARM

https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/practical-uses/implementing-a-mutex
It is a spinlock if "WAIT_FOR_UPDATE" is no-op.

Sleeplock

When failed to acquire the lock at the first attempt, sleep lock puts itself to sleep since it does not expect the lock to be release very soon.

sleeplock must work in the context of a "thread" (either kernel or user), as it must save the current execution flow and resume later.
Expecting to wait for an uncertain amout of time, a sleeplock voluntarily yields to other threads so it doesn't waste the CPU cycles.
Sleeping also gives other threads better chance to finish their tasks, which MAY potentially reduce the lock-holder's critical section.

xv6's sleeplock: sleeplock.c

It uses a spinlock to protect the lock structure
With the spinlock, at any given time, only one thread can read and change the sleeplock's status -- lk->locked

acquiresleep()

wait until lk->locked == 0, which means no one have the lock.
sleep() will release the spinlock and sleep. After woke up, it regains the spinlock so at any given time, only one thread can run outside the call to sleep(), and between the acquire() and release().

pthread's mutex lock is a sleeplock

It first checks if the lock can be acquired with a few attempts of atomic compare-exchange.

If failed, it implies that someone has acquired the lock and would potentially spent some time with the lock.

it then calls futex() system call to put itself into sleep. It will then be woke up when the lock is release by the other thread, pretty much like the sleeplock in xv6.

In glibc:

nptl/pthread_mutex_lock.c
sysdeps/unix/sysv/linux/lowlevellock-futex.h

Linux Futexes explained: https://eli.thegreenplace.net/2018/basics-of-futexes/

Deadlock and Livelock

The actual discussion is about how to USE locks correctly, not how to IMPLEMENT locks.

Deadlock

Be careful about deadlocks when 2+ threads are working with 2+ locks.

Thread A has acquired lock X and want to acquire lock Y
Thread B has acquired lock Y and want to acquire lock X

lock_t locks [3];

int values [3] = {1,2,3}; // sum(values) == 6

// There are two threads:

void thread_body()

{

while (true) { // thread A thread B

int i = random() % 3; // i = 1 i = 2

int j = random() % 3; // j = 2 j = 1

lock(&locks[i]);

lock(&locks[j]); // the two threads MAY block here

values[i]++; // ** there is a bug in this code.

values[j]--;

unlock(&locks[i]); // does the order of unlocking matter?

unlock(&locks[j]);

}

Solution 1: always acquire locks in a pre-determined order.

lock_t locks [3];

int values [3] = {1,2,3}; // sum(values) == 6

// There are two threads:

void thread_body()

{

while (true) {

int i = random() % 3;

int j = random() % 3;

lock(&locks[i < j ? i : j]);

if (i != j)

lock(&locks[i > j ? i : j]);

values[i]++;

values[j]--;

unlock(&locks[i]);

unlock(&locks[j]);

}

Other solutions: "back off" and retry. But it has the risk of starvation -- there is no guarantee when you will get those locks.

Livelock: "A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time."

The goal of OS design:

You program/kernel should keep making progress.

Linux's lockdep: Runtime locking correctness validator

https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c

Use spinlocks in the interrupt context (Linux kernel): disable/save and enable/restore IRQ.

https://www.kernel.org/doc/Documentation/locking/spinlocks.txt

https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h

https://www.kernel.org/doc/html/latest/locking/locktypes.html

Priority inversion

Background: in real-time systems, a task of a highest priority should always run without waiting, unless it's sleeping. "hard deadline"

When does a high-priority task sleep?

For some good reason: a process sends out heartbeats every second. It sleeps most of the time.
Resource sharing: memory allocator is protect by a mutex lock.

Example: low priority <-- Task 1 < Task 2 < Task 3 --> high priority

Task 1 is running because Task 2 and 3 are sleeping (waiting for timeout).

Task 1:

calls kalloc() to get some memory
in kalloc(), Task 1 acquired the lock that protects the free-pages list.

Now, a timer interrupt is sent to the CPU, kernel wakes up task 3.

task 3:

also calls kalloc() to get memory
in kalloc() Task 3 trys to acquire the same lock, but it is already locked (by Task 1).
Task 3 makes a system call to sleep on the "lock channel".

task 1 now can run:

kalloc() will finish its task.
unlock
... may continue to run for a little while.
preempted by Task 3.

Task 3 now can continue to run and acquire the lock.

This may block Task 3 but could be acceptable in real systems-- hard deadline != immediate execution.

If Task 1's critical section has fixed length, Task 3 can run after a deterministic time window (such as <= 1000 instructions).
It may look slow, but still deterministic.

The real priority-inversion example: Task 2?

Task 2 has absolution right to run even if Task 1 is runnable (determined by their priority)
Task 2 does not acquire any locks. It does not try to block anyone!
Because of Task 2's priority, Task 1 cannot run. The consequence is Task 1 does not have a chance to release its lock, which in turn blocks Task 3.
With some reduction, we see that Task 2 can block Task 3.

Solution

If Task X is waiting for a resource that is currently holding by Task Y (of lower priority), Y should be temporarily granted the same priority of Task X.