Reader-writer locks and RCU

locks on xv6: spinlock/sleeplock

When to use lock, and what lock(s)?

On a single-core CPU without preemption, it's not necessary to use locks.
with preemption, a sleep lock would be sufficient.
With multi-cores, a spinlock may be faster for short critical sections.

Compared with spinlock, sleeplock does not stress the CPU but it does not always make the programs run faster.

While, yielding to other threads can potentially reduce the time on critical sections. But there is no guarantee.

So far all the locks we have seen causes serialization of the programs.

The program can run concurrently. How about parallelism?

The "Big Kernel Lock" (BKL)

Exploit more parallelism: reader-writer lock

Observation: Readers can concurrently access the shared resources without race condition.

Reader-writer lock (rwlock): allow multiple readers to acquire the lock at the same time.

Possible states of a rwlock (usually a 16-bit value on x86):

0, free, no one holds the lock
N, read-locked: N readers holds the lock, N >= 1
0x8000, write-locked: only one writer can hold the lock at a time. the highest bit of the value is set, assuming the total number of reader never reach 0x8000 (32768)

Simplified pseudo code of x86 writer-reader (spin) lock

(try to draw a state machine to see how it works)

Reader lock:

do {

unsigned short val = *lockptr;

if ((val & 0x8000) == 0) // try to read-lock it unless it's write-locked

if (atomic_compare_exchange(lockptr, &val, val + 1) == SUCCESS) // this may fail

return;

} while (true); // retry until locked

Writer lock:

do {

unsigned short val = *lockptr;

if (val == 0) // free now?

if (atomic_compare_exchange(lockptr, val, 0x8000) == SUCCESS) // this may fail

return;

} while (true);

reader unlock:

atomic_subtract(lockptr, 1);

writer unlock:

*lockptr = 0;

Readers are our "VIP" customers. We want to maximize the efficiency at the reader side.

Is there any room for improvement? Check the real cost of the reader-lock:

Reader needs to perform 2+ memory operations:

First, it performs a memory read to check the current value of the lock
Second, it uses the atomic-CAS to LOCK-read-modify-write the value

Also, there can be many unnecessary retries especially with many concurrent readers:

Suppose two readers are trying to acquire the lock at the same time, both reads a value of 0
However, only one reader can successfully change the value from 0 to 1. The other unlucky reader need to retry and change the value from 1 to 2

An improved reader-lock implementation

Reader-lock:

do {

if ((atomic_fetch_add(lockptr, 1) & 0x8000) == 0) // just try, don't "think twice"

return;

atomic_subtract(lockptr, 1); // undo the optimistic increment

} while (true);

This reader-lock only fails if a writer comes and grabs the lock.

only one LOCK-read-modify-write operation is used to acquire the lock.

This two improvements lead to more than 2x improvement compared to the previous version.

Starvation

OK, readers are all VIPs. What about the writers?

Although readers frequently comes and go, it's possible that the number of active readers never drop to 0.

writers may have to wait indefinitely.

One possible solution:

writer directly use a CAS to flip the highest bit (0x8000), and wait until all existing readers to leave. Note that once the highest bit is set, new readers will read that value and wait.

By doing this, we're possibly doing too much favor for the writer. Readers may also starve, or near-starve, when trying to acquire the lock.

Fairness

It's highly subjective.

One way to maintain "fairness" in reader-writer locks is to use "tickets" (we mentioned tickets in spinlock/sleeplock).

Ticket is a monotonically increasing number. By taking tickets, the threads form a line and the lock is taken in a first-come, first-serve fashion.

More details here

Fine-grained locking

Hash tables: N locks, each protects a few hash buckets.

Trees: one lock per node. user acquires and releases the locks on the path as it traverses on the tree.

RCU (Read-Copy-Update)

What are the remaining issues of reader-writer lock?

writers can block readers
the atomic-inc is slow

What if we want to let readers run even faster? RCU.

use RCU when you really don't care about the feeling of writers...

Writers have to do several expensive operations:

make a copy of the structure to-be-modified, and then apply the changes.
replace the structure with the new one by changing the pointers.
Wait for readers working on the old copy to leave and garbage-collect the old data.

Example: change the content of a node on a linked list.

Treat the list as a path, once the new node is ready, we change the pointer at the parent node to let new readers "detour" using the new path.

The difference between RCU and COW

RCU merely allows concurrent read/write operations. The only purpose of making copies is to prevent the program to "crash".
It does not automatically enforce "transaction" behavior on the structure. How to support that is beyond the scope of RCU.
COW can be considered as an efficient way to create and protect multiple versions of a structure.
When several operations are batched in one copy-on-write operation, the change are visible to others atomically. COW is not a synchronization technique.

(user-space) RCU implementations

QSBR RCU, and a few others

QSBR RCU

For "read-copy-update", changing the pointer to the new data exposes the updates to the readers.

That pointer is an integral part of the structure.

Readers:

It's (almost) free for readers: just "read" the structure following the pointers.
Every time a reader finished reading on the structure, it calls rcu_quiescent_state()

Writers:

First to the read-copy-update and update the pointer so future reads will use the new version of the data.
The risk is at GC: How do you know that all readers working on the old version have left?
Solution: the writer wait until every reader has called rcu_quiescent_state() (for at least once).

rcu_quiescent_state()

Each reader has a private variable. rcu_quiescent_state() simply perform "++" operation on that variable.
The writer reads it and waits until it changes to another value, which indicates that the corresponding reader has finished the work on the old version.

A simplified (and incomplete) user-space QSBR RCU

Suppose there is only ONE reader and ONE writer. Unregister() is also omitted.

(A realistic implementation should be able to handle multiple readers, and uses a lock to coordinate multiple writers.)

struct rcu { // the main rcu structure

void * pvar; // pointer to the shared variable

struct rcu_reader * reader;

};

struct rcu_reader { // each user's private structure.

void * last_pvar; // user record the last pvar it has seen

}

// reader calls this to register itself with the rcu

void rcu_register(struct rcu * rcu, struct rcu_reader * reader)

{

rcu->reader = reader;

}

// reader calls this to find the shared variable

void * rcu_read(struct rcu * rcu, struct rcu_reader * reader)

{

reader->last_pvar = rcu->pvar; // the only overhead is a memory write

return reader->last_pvar;

}

// writer calls this to perform update

void rcu_update(struct rcu * rcu, void * new_pvar)

{

void * old_pvar = rcu->pvar;

rcu->pvar = new_pvar;

while (rcu->reader->last_pvar != new_pvar)

sleep(1); // wait until the reader has been using new_pvar

free(old_pvar);

}