Concurrency and Locking

Concepts you already know:

Types of synchronization mechanisms

Locks in the Linux kernel: https://www.kernel.org/doc/html/latest/locking/index.html

Always think of race conditions when there are two or more execution flows accessing a shared resource.

Lock

The goal is to establish consensus between multiple independently-running threads.

Naively, A client-server model can achieve correct locking: the is one server who manages who has the lock. Every process talks to the server to acquire the lock. This model can be implemented in different ways. messages need to be passed between the clients and the server and the implementation is non-trivial.

In performance-critical systems, locking must be very efficient.

lock();

a += 1;

b -= 1;  // a+b == constant

unlock();

Quite often a critical section has only a few instructions, can you afford to use a 10000-cycle lock()/unlock()?

To achieve fast locking (or other forms of synchronization), hardware provides the first layer of mechanism-- atomic instruction.

Atomic operations

C11 has native support for atomic operations: https://en.cppreference.com/w/c/atomic

Before C11, GCC has been providing primitives for atomic operations: __sync*

Let's first see some demo code

Atomicity not cheap:

C11 atomic operations can be directly used to implement the lock  https://en.cppreference.com/w/c/atomic

There is no magic in the code. Each of these operations corresponds to one or a few CPU instructions.

If your CPU does not provide some or all of these instructions, locking can be impossible.

x86: lock prefix + instruction. (It's still one instruction)

ARM:  link

MIPS:

Similar instructions pairs are used by several RISC CPUs (PowerPC, RISC V, Alpha, ...)

Costs: https://www.agner.org/optimize/instruction_tables.pdf

spinlock on xv6

spinlock.c acquire()

void lock() {  // pseudo code below

  pushcli(); //disable interrupt

  while (xchg(lock, 0, 1) == 1);

}

The thread who successfully received 0 gets the lock.

Unlock: set the lock back to 0; enable interrupt.

the use of acquire() and release() marks the a non-preemptive critical section.

pushcli/popcli

Compiler fence and memory fence

It prevents the compiler or the CPU to reorder the load/stores before and after the call.

Compiler reordering:

Out-of-order execution:

It's hard to elaborate the actual hardware behavior and the necessity of the fence instructions. "Architectural specifications are “loose” to cover past and future implementations"

Reading: x86-TSO

(Spinlock)

Pros:

Cons:

Reduce lock contention: multiple locks (e.g., for different slots in hash tables)

Extended reading:

pthread_spin_lock.S

the lock variable is initialized to 1.

ENTRY(pthread_spin_lock)

1:

  LOCK; decl  0(%rdi) # atomic --(*lock); %rdi (the 1st argument) is the pointer to the lock 

  jne  2f             # jump if ZF!=0. same as jnz (Jump if Not Zero); 2f means the first label "2" forward

  xor  %eax, %eax     # Return 0: lock acquired! This thread has changed the lock from 1 to 0 atomically.

  ret  # Challenge question: what can be found in (*lock) when this function returns?


  .align  16  # align the jump entry point

2:

  rep; nop  # this is actually a PAUSE instruction. Explained here

  cmpl  $0, 0(%rdi)   # compare the 2nd and the 1st, which can be confusing.

  jg  1b  # if *lock > 0, then goto 1: (the first label "1" backward)

  jmp  2b # still locked. goto 2: (spinning); 2b means the first label "2" backward

END(pthread_spin_lock)


ENTRY(pthread_spin_unlock)

  movl  $1, (%rdi)  # lock = 1;

  xorl  %eax, %eax

  retq

END(pthread_spin_unlock)

An example for ARM

Sleeplock

When failed to acquire the lock at the first attempt, sleep lock puts itself to sleep since it does not expect the lock to be release very soon.

xv6's sleeplock: sleeplock.c

acquiresleep()

pthread's mutex lock is a sleeplock

It first checks if the lock can be acquired with a few attempts of atomic compare-exchange.

If failed, it implies that someone has acquired the lock and would potentially spent some time with the lock.

it then calls futex() system call to put itself into sleep. It will then be woke up when the lock is release by the other thread, pretty much like the sleeplock in xv6.

In glibc:

Linux Futexes explained: https://eli.thegreenplace.net/2018/basics-of-futexes/

Deadlock and Livelock

The actual discussion is about how to USE locks correctly, not how to IMPLEMENT locks.

Deadlock

Be careful about deadlocks when 2+ threads are working with 2+ locks.

lock_t locks [3];

int values [3] = {1,2,3}; // sum(values) == 6


// There are two threads:

void thread_body()

{

  while (true) {                  // thread A      thread B

    int i = random() % 3;         //  i = 1         i = 2

    int j = random() % 3;         //  j = 2         j = 1

    lock(&locks[i]);

    lock(&locks[j]);              // the two threads MAY block here

    values[i]++;                  // ** there is a bug in this code.

    values[j]--;

    unlock(&locks[i]);            // does the order of unlocking matter?

    unlock(&locks[j]);

  }

}

Solution 1: always acquire locks in a pre-determined order.

lock_t locks [3];

int values [3] = {1,2,3}; // sum(values) == 6


// There are two threads:

void thread_body()

{

  while (true) {

    int i = random() % 3;

    int j = random() % 3;

    lock(&locks[i < j ? i : j]);

    if (i != j)

      lock(&locks[i > j ? i : j]);

    values[i]++;

    values[j]--;

    unlock(&locks[i]);

    unlock(&locks[j]);

  }

}

Other solutions: "back off" and retry. But it has the risk of starvation -- there is no guarantee when you will get those locks.

Livelock: "A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time."

The goal of OS design:

Linux's lockdep: Runtime locking correctness validator

https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c

Priority inversion

Background: in real-time systems, a task of a highest priority should always run without waiting, unless it's sleeping. "hard deadline"

When does a high-priority task sleep?

Example: low priority <-- Task 1 < Task 2 < Task 3 --> high priority

Task 1 is running because Task 2 and 3 are sleeping (waiting for timeout).

Task 1:

Now, a timer interrupt is sent to the CPU, kernel wakes up task 3.

task 3:

task 1 now can run:

Task 3 now can continue to run and acquire the lock.

This may block Task 3 but could be acceptable in real systems-- hard deadline != immediate execution.

The real priority-inversion example: Task 2?

Solution

If Task X is waiting for a resource that is currently holding by Task Y (of lower priority), Y should be temporarily granted the same priority of Task X.