Category Archives: Concurrency

WIP: What’s the deal with memory ordering? (seq_cst, acquire, release, etc)

(This is a high level summary of my current knowledge, primarily to help me crystallize the knowledge. It comes entirely from from Jeff Preshing’s blog (see end of post) and youtube talk. This is not intended to be a comprehensive overview; for that, please see the aforementioned materials. I am very much a non-expert on this topic; please treat everything with skepticism.)

When programming with atomics, how are you suppose to know which of the ~four memory orderings to use? For example, the main ones (C++ terminology) are:

memory_order_seq_cst
memory_order_acquire
memory_order_release
memory_order_relaxed
(and a few other niche ones: acq_rel, consume)

First, as Jeff Preshing states, there is a distinction between “sequentially consistent” atomics and “low level” atomics. He describes it as two libraries for atomics masquerading as a single one within the C++ standard library.

The first, “sequentially consistent”, can be considered a higher level way of using atomics. You can safely use seq_cst everywhere. You get simpler semantics and higher likelihood of correctness, just at the expensive of performance. As an optimization, you can then port the code to the second form of “low level atomics”. This is where you must choose the explicit memory orderings.

But why do sequentially consistent atomics come with a performance hit?

The performance hit comes from cross core communication. The sequentially consistent memory model offers a very strong guarantee to the programmer; in addition to the ordering of atomic operations being consistent across cores (which is always the case), the ordering of non-atomic operations is also guaranteed to be consistent (i.e. no reordering) relative to the atomic ones.

This is relevant because programming with atomics often involves “guard” (atomic) variables who regulate access to “normal” (non-atomic) data that is transferred between threads. This guarantee requires extra effort from the memory subsystem of the CPU in the form of cross core communication as the cores need to effectively synchronize their caches.

When one moves to “low level” atomics, the strict constraints required of the memory subsystem are relaxed. Not all orderings of non-atomic accesses relative to atomic accesses must be maintained. The consequence is less cross-core coordination is required. This can be exploited for higher performance in specific scenarios where the strict ordering constraint is not required in both (or any) directions (i.e. non-atomic memory accesses are allowed to move before or after the atomic access).

Exercise: Would one expect to see a performance improvement from porting code from sequentially consistent atomics to low level atomics, if the code is run on a single core system?

The whole point of low level atomics is to optimize performance by relaxing constraints and reducing cross core communication, so no. There is no cross core communication in a single core system, so there is nothing substantial to optimize.

(I am not 100% sure of this answer. This is the current state of my knowledge and I would appreciate being corrected or affirmed either way!)

So how does one choose between all those memory orderings?

With my non-expert understanding, I believe there are some simple rules that make the decision much easier than it might seem.

First off: Decide whether you’re using sequentially consistent or low level atomics. If the former, you use seq_cst everywhere (this is even the default with C++ if you don’t specify anything).

If you want to optimize to use low level atomics, then for most cases, you then only have three choices: acquire, release, and relaxed. (seq_cst is no longer an option; acq_rel is more niche; consume is actively discouraged). Then:

If you’re deciding for a load operation, you then only choose between acquire and relaxed. Loads are never release.
And vice verse, If you’re deciding for a store operation, you then only choose between release and relaxed. Stores are never acquire.

This narrows it down to two choices. To determine whether it’s acquire/release or relaxed, determine whether the load/store has a synchronizes-with relation to a corresponding store/load. If there is one, you want acquire/release. Otherwise, choose relaxed.

Read these blog posts for a fuller answer to this:

Links:

https://www.youtube.com/watch?v=X1T3IQ4N-3g

4 basic multithreading & lockfree programming exercises

3 Replies

Four exercises that touch basic multithreaded and lockfree programming concepts.

Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
Fix the above with atomics.
Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
Fix the above using a CAS loop.
(Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)

Continue reading →

Mutexes, atomics, lockfree programming

offlinemark

Life, art, and systems programming

Category Archives: Concurrency

WIP: What’s the deal with memory ordering? (seq_cst, acquire, release, etc)

4 basic multithreading & lockfree programming exercises

Mutexes, atomics, lockfree programming