Mutexes, atomics, lockfree programming

Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.

Mutexes

  • On Linux, libpthread mutexes are implemented using the underlying futex syscall
  • They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
  • The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
  • Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex โ€” Windows and macOS have them too
  • Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
  • A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
  • Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations

pthreads

  • pthread is the POSIX threads API and defines thread creation and management APIs, mutexes, reader-writer locks (rwlock), condition variables. It also defines semaphores but funnily enough those APIs aren’t prefixed with “pthread” like the others.
  • There is also an older System V semaphore API (semget, semop)
  • Originally there was “LinuxThreads” in glibc – a basic threads API backed by clone(2). But it wasn’t POSIX compliant – so new thread libraries were created, with the winner being NPTL by Red Hat – this is what’s in use today in glibc.

https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library

Atomics

  • Atomics are a foundational primitive provides by the hardware required for any kind of concurrency/preemptive multitasking.
  • At the architectural level, they involve special instructions (or instruction prefixes, e.g. “lock” on x86) which have different semantics than their non-atomic memory instruction counterparts
  • They are approximately 10x slower than non-atomic operations. This can be easy to overlook when reading code.
  • Atomic operations serialize memory operations that may be happening at the same time across all cores, which is why they are slower. This involves e.g. cross core communication, cache invalidation, and locking the memory bus
  • They guarantee that the read/write operation will happen completely (e.g. there won’t be “tearing” if reading an int that is being written to by another thread at the same time). On many architectures, operating on the native word size is atomic anyway even without atomic, but this is not guaranteed at the language level (e.g. by the C++ standard) so atomics should always be used when this is relied upon (doing multithreaded programming specifically on multicore. This doesn’t really matter with multithreaded programming on single core because all memory accesses are effectively serialized anyway).
  • Some people use “volatile” instead of atomics โ€” this is generally discouraged. They assume that memory accesses at the bit widths they are doing are already atomic and just need to force the compiler to always emit memory instructions instead of holding things in registers. What they are missing though is the other thing that atomics provide beyond atomic access: memory order handling. When programming like this, it’s important to include memory barriers at the right points to disable compiler and hardware optimizations that might reorder memory access instructions in ways that violate correctness.
  • They are available from C using GCC intrinsics (for example), and “#include <atomic>” in C++. Java, Rust, and most other non-scripting languages offer support for them.

Lockfree programming

  • Concurrent/multithreaded programming without locks (mutexes, even spinlocks) is known as lockfree programming
  • Threads must be able to continue to “make progress” at all times
  • The Tony van Eerd video below has a good example of this. Even though spinlocks don’t use OS blocking and use atomics, with spinlocks there is a risk that one thread grabs a lock then gets descheduled โ€” then other threads waiting for the lock are also stuck and can’t make progress.
  • With CAS loops in lockfree programming, progress is guaranteed even in the above situation because one thread getting descheduled means it’s another thread’s opportunity to get the lock
  • CAS โ€” compare and set aka compare exchange โ€” is a fundamental atomic operation that is essential to lockfree programming. It might even be the single fundamental operation?
  • A major problem with lockfree programming is that your code must be robust against any thread interleaving โ€” with mutexes the point is to avoid thread interlaving in critical sections. With lockfree programming, there are no critical sections โ€” anything can interleave at any time. In order to cope with the fact that variables can “silently” change value at “any point”, including “between lines” (e.g. after an if statement), CAS is a key tool that lets you attempt an operation on a variable but abort and retry or handle it gracefully if it turns out another thread has concurrently modified the value
  • Handling it gracefully usually means looping and trying again โ€” i.e. the CAS “loop”
  • The key concept here is: read into local variable, compute new state of variable, CAS to attempt to write it back to the global shared state. Loop if the CAS doesn’t succeed.
  • Sometimes compilers need to secretly emit CAS loops when a certain kind of memory access can’t be guaranteed atomic at the hardware level. https://stackoverflow.com/a/47570868
  • There is much more to lockfree programming (memory models, access semantics (acquire/release) etc) that I don’t know โ€” it seems to be truly a topic for the experts of the experts.
  • One failure mode of the CAS loop is the “ABA problem” where the shared variable is modified, but then modified again back to the original cached state as seen by one of the concurrent writers. From the writer’s perspective the shared variable hasn’t been changed, even though it actually has. This will cause the CAS to success when it shouldn’t -> e.g. in the case where the shared variable is a pointer, and due to memory allocation now has the same value it use to have, but it now points to a different allocated variable.

std::atomic isn’t always lock free

  • std::atomic has this interesting is_lock_free member function. I was surprised to learn that it doesn’t always return true โ€” but when you think about how std::atomic<T> is a template that can wrap semi-arbitrary types it makes sense. While atomic<int> and other small and approximately word-sized types can be made atomic, it makes sense that if you try to make a struct with a char[100] inside, that can’t possibly be made lock-free
  • And in cases like these, the compiler/libc++ will literally emit a mutex/lock into the generated code!
  • In the small cases, libc++ uses an atomic builtin that transparently codegens to the appropriate atomic instructinos.
  • But in the large cases where it can’t do this, it codegens to a call to the compiler runtime library (compiler-rt in llvm speak; this is known as libgcc in gcc world) which implements the code that calls the mutex
  • I was unable to observe this in godbolt or even when disassembling binaries with r2 on my mac, but by stepping through with lldb, I was able to observe locks being created when creating an atomic around a comically large struct
  • https://libcxx.llvm.org/DesignDocs/AtomicDesign.html

re: pthread mutex using signals before futex was available, check out this 20 year old thread:

https://comp.programming.threads.narkive.com/4xSgVvf6/

compared_exchange_weak vs compare_exchange_strong

  • General _weak seems to be preferred (used in a loop typically)
  • The main difference is whether spurious failures are permitted or not. In both cases, the function may fail if there was a concurrent write, but _weak is even allowed to fail more “randomly” (“spuriously”) (random note: this may be similar to how waiting on a condition variable is allowed to spuriously wake)

https://stackoverflow.com/questions/17914630/when-should-stdatomic-compare-exchange-strong-be-used

https://devblogs.microsoft.com/oldnewthing/20180330-00/?p=98395

https://stackoverflow.com/questions/25199838/understanding-stdatomiccompare-exchange-weak-in-c11

Load link/store conditional

  • Seems to be a common pattern used on RISC architectures for providing atomicity
  • https://en.wikipedia.org/wiki/Load-link/store-conditional

Resources

https://stackoverflow.com/questions/368322/differences-between-system-v-and-posix-semaphores

https://stackoverflow.com/questions/14624776/can-a-bool-read-write-operation-be-not-atomic-on-x86

https://eli.thegreenplace.net/2018/basics-of-futexes/

https://preshing.com/20111124/always-use-a-lightweight-mutex/

Introduction to Lock-free Programming – Tony van Eerd

https://stackoverflow.com/questions/38447226/atomicity-of-loads-and-stores-on-x86

https://stackoverflow.com/questions/59063744/is-a-read-of-8-bytes-on-modern-intel-x86-guaranteed-to-by-sane-if-8byte-written

https://stackoverflow.com/questions/46092373/are-reads-and-writes-of-an-int-in-c-atomic-on-x86-64-multi-core-machine

https://stackoverflow.com/questions/50951011/how-does-a-mutex-lock-and-unlock-functions-prevents-cpu-reordering

https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

Any thoughts?