Atomics and Concurrency (opens in new tab)

(redixhumayun.github.io)

106 pointsredixhumayun2y ago48 comments

I wrote this as a way to understand atomics, concurrency and memory ordering in C++ when I was exploring lock-free programming

48 comments

zozbot2342y ago

The best exposition of advanced concurrency I know of is Mara Bos' Rust Atomics and Locks - https://marabos.nl/atomics/ also available as a physical book. Don't let the "Rust" reference put you off, the book is relevant to any low-level programming language.

(If you're interested in concurrency in a Linux context as seen from a broader systems point of view, you might like Paul McKeneny's 'Is Parallel Programming Hard, And If So What Can You Do About It?" https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... )

gpderetta2y ago

Another great book is "The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit.

edit: already mentioned elsethread.

gpderetta2y ago

IMHO that the graph in the Memory Barrier section is misleading [1]. It has the barriers spanning across threads, but that's not the right mental model. Something like this is more correct (note the additional barriers after the store and before the load to match seq_cst semantics):

  Thread 1                Memory                  Thread 2
  ---------               -------                 ---------
  |                          |                          |
  |   write(data, 100)       |                          |
  | -----------------------> |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(ready, true)     |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  |                          | ===Memory Barrier======= |
  |                          |   load(ready) == true    |                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |       read(data)         |
  |                          | <----------------------  |
  |                          |                          |

I.e. barriers prevent reordering of operations within a thread, not across threads. It also makes immediately obvious why the seq_cst ordering of both the thread 1 atomic store and the thread 2 atomic load can be relaxed: The last barrier in Thread 1 does not prevent any reordering in this example, hence it can be omitted, leaving only the barrier before the store making it a release operation. Similarly, we can omit the barrier before the first load in thread 2, leaving only the barrier after, making it an acquire operation.

[1] well, it is showing the effect of sequential consistency as opposed to acquire-release, so a logical barrier spanning threads is not necessarily wrong, but then you would still need to show a barrier before the last store and the first load.

redixhumayunOP2y ago

Hey, so I'm curious why having memory barriers span across threads is the wrong mental model.

Assuming that the memory barrier is syncing across a single variable (in this case ready), why would it be correct to think of it as two separate barriers? If it were correct to think of it as two separate barriers on two separate threads, wouldn't there need to be some form of synchronization or linkage between the two barriers themselves so that memory barriers can be coupled together?

For instance, if I had release-acquire models on two variables, ready and not_ready, using separate barriers as representation might look something like this

```

  Thread 1                Memory                  Thread 2
  ---------               -------                 ---------
  |                          |                          |
  |   write(data, 100)       |                          |
  | -----------------------> |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(ready, true)     |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(not_ready, true) |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  |                          | ===Memory Barrier======= |
  |                          |   load(ready) == true    |                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |.===Memory Barrier======= |
  |                          |   load(not_ready) == true|                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |       read(data)         |
  |                          | <----------------------  |
  |                          |                          |

```

Now, how does the processor know which memory barriers are linked together? I ask because without understanding which barriers are linked together, how is instruction re-ordering determined?

gpderetta2y ago

The linking of barriers in pair is really just a mental model, not (usually) what happens at the hardware level. In fact in the C++ memory model the synchronizes-with relationship is load and stores, not barriers, which indirectly affect the properties of load and stores around them. That's another reason why I don't really like the memory barrier model and I prefer to think in terms of happens-before dependency graphs.

edit: AFAIK, seq_cst ordering (as opposed to acq_rel) is only relevant when you have more than two threads and you care about things like IRIW. In this case acquires and releases are not enough to capture the full set of constraints, although at the hardware level it is still everything local.

edit2: I guess the missing bit is that beyond the hardware fences you have the hardware cache coherency protocol that makes sure that a total order of operations always exist once load and stores reach the coherence fabric.

redixhumayunOP2y ago

Yeah, I see your point about thinking in terms of dependency graphs. I actually got the idea for using a visual memory barrier from the Linux docs(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...) and the C++ concurrency in action book.

>I guess the missing bit is that beyond the hardware fences you have the hardware cache coherency protocol that makes sure that a total order of operations always exist once load and stores reach the coherence fabric.

Can you explain more about this?

bluetomcat2y ago

Modern processors are out-of-order execution beasts. A barrier within a thread serves to enforce some ordering within that thread - that a store will occur after another store, and that a load will occur before another load. Threads know nothing of each other.

el_pollo_diablo2y ago

There are many problems with this concurrent queue. Here are a few I found in a cursory reading.

In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

enqueue() assumes that the queue is not empty.

enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail).

Then there is the old-time favourite ABA problem.

By the way, when used in a loop, compare_exchange_weak is preferred to compare_exchange_strong, and there is no need to reload the atomic every time: compare_exchange_* do that for you by updating the argument containing the expected value.

adastra222y ago

> In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

With added indirection, there are ways of doing this atomically.

T0pH4t2y ago

I'm surprised more people are not calling out

"enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail)."

Its probably one of the bigger problems of this queue, basically negates the whole structure with this bug.

redixhumayunOP2y ago

That concurrent queue is only there to illustrate usage of CAS in a data structure. I think having an actual implementation of a concurrent queue along with handling the ABA problem might be an entirely separate post.

I added in a note about the ABA problem but perhaps you're seeing a cached version of the post.

lmuscat2y ago

Ignoring the ABA problem, given the implementation of `new` and `delete` are blocking, so is the queue.

CyberDildonics2y ago

This is a great point. Technically you could replace your allocator with jemalloc or something similar, but most people probably don't.

Memory allocation gets forgotten in the "lock free" algorithms all the time, especially in java where allocation is forgotten about and brushed aside.

menaerus2y ago

I think that the point rather was not to use any allocation in critical sections since allocator implementations are not lock-free or wait-free.

https://github.com/jemalloc/jemalloc/blob/dev/src/mutex.c

1 more reply

chc42y ago

For the concurrent queue at the bottom, the issue is with the ABA problem where you're guaranteed to execute the atomic compare-and-swap correctly, but you aren't guaranteed that the same pointer value means the queue is in the same state: if you have a queue `123` and start a dequeue, you load current head = 1, next = 2, and then can be scheduled out for another thread to execute a series of operations resulting in a queue of `145`. When your thread resumes, the compare-and-swap of 1->2 works fine...but now the head of your queue is the freed 2 node. Woops! Concurrent primitives like RCUs or hazard pointers (or garbage collection!) are needed in order to safely free data under most lockfree data structures.

samsquire2y ago

I use a tag masked into the stored data to try prevent the ABA problem.

Every thread has its own unique tag.

  long original = me->realend;                                                                                                              
  int tag = (original & TAG_MASK);                                                                                                          
  changed = (((original & END_MASK) >> 32)) % me->size;                                                                                   
  long new = (data->thread_tag) | (((changed + 1) % me->size) << 32);

Then compare and swap as usual. If another thread updates then they shall fail the compare and swap and we have to reloop and try again.

I am still learning TLA+ to write a model.

saagarjha2y ago

Yeah, this stuff is hard even before you bring memory ordering into the picture. What's worse, checks such as thread sanitizer will not catch it, since it's not a simple data race. Proving functional invariants about multithreaded code requires a different class of tools.

chc42y ago

Funnily enough, ARM has another difference here on top of just having a non-TSO memory model: LL/SC atomics solve the ABA problem, because the word holding the queue head has been written to and the store-conditional fails, even though the contents of the memory will be the same at the end. Which makes sense once you say it and some docs about LL/SC will mention that, but also reading various lockfree data structure papers I've basically never seen talked about (probably because LL/SC progress guarantees are kinda scuffed)

gpderetta2y ago

the issue with LL/SC is that it is hard to expose to higher level languages than assembler. What you can do within an LL/SC section without causing it to spuriously fail is very much architecture dependent and you need full control of the load and stores within it. Exposing it to compiler optimizations won't work reliably.

So in practice LL/SC, in higher level languages, is used to implement CAS, XCHG and other atomic primitives which don't allow taking advantage of the ABA resistance. As an additional downside, you get a weak CAS that you always need to call in an loop.

1 more reply

saagarjha2y ago

Tired: deadlocks

Wired: livelocks

redixhumayunOP2y ago

Are there tools out there that can prove semantic invariants in multi-threaded code? I don't understand how there can be automated tools around it at all because how would that even be possible?

jeffreygoesto2y ago

TLA+ comes to mind

[0] https://www.learntla.com/index.html

kangda1232y ago

There are model checkers such as nidhugg (C++), dscheck (ocaml). They take a test case and reach all possible terminal states by trying different interleavings.

Crucially, they don’t have to try all interleavings to reach all terminal states, making the enumeration quite fast.

maxbond2y ago

Rust comes to mind.

1 more reply

ZoomZoomZoom2y ago

> This is going to be a long post

Is it really? Sorry, but this is barely an introduction.

Some additional links to important documentation:

https://en.cppreference.com/w/cpp/language/memory_model

https://en.cppreference.com/w/cpp/atomic/memory_order

https://research.swtch.com/mm

Effective Concurrency series by Sutter: https://herbsutter.com/2009/07/15/effective-concurrency/

The Art of Multiprocessor Programming by Maurice Herlihy, Nir Shavit et al. ISBN: 978-0124159501

sneela2y ago

> The Art of Multiprocessor Programming by Maurice Herlihy, Nir Shavit et al. ISBN: 978-0124159501

An excellent book that I used for my Parallel Programming course in my undergrad. While looking up Nir Shavit (back then), I came across a "Summer School on Practice and Theory of Concurrent Computing (2017)" [1] which had notable people give talks about interesting & fundamental topics related to Parallel Computing such as "Wait-free computing 'for dummies'", "Lock-free concurrent data structures", and "Locking, from traditional to modern" (taught by Nir Shavit). [2] is a link to a YouTube playlist containing all the videos from that Summer School. I highly recommend it.

[1] https://neerc.ifmo.ru/sptcc/courses.html

[2] https://www.youtube.com/playlist?list=PLVe-2wcL84b9G9o7KPubp...

redixhumayunOP2y ago

Honestly, now that I look back at it having written it a couple of weeks ago, it doesn't feel that long. But, writing it felt incredibly wrong because I was encountering memory models for the first time.

I think for someone who has no exposure to it before, its quite dense (perhaps long wasn't the best choice of wording)

Also, I've been meaning to read The Art of Multiprocessor Programming. I've heard great things about it!

menaerus2y ago

You shouldn't feel bad about it - the blog is going to have its audience, so don't worry about it. That said, the topic is incredibly complex and to understand it fully requires intimate knowledge of the CPU microarchitectural details and design. So, technically speaking even the links from the comment you're replying to are providing a shallow although somewhat longer introduction. Programming languages only provide an abstraction for these very real things happening in the silicon so that's about as far as they can go by providing the sufficient amount of details. The real meat is down the rabbit hole of the CPU and memory subsystem design and if you want to go there I'd suggest the yt lectures from ETH Zurich on the topic of computer architectures and design (can find the link later).

miloudi2y ago

are you talking about this one? https://www.youtube.com/watch?v=AJBmIaUneB0&list=PL5Q2soXY2Z...

1 more reply

loeg2y ago

I appreciate that this was written by someone new to atomics and memory ordering semantics. But it presents itself as authoritative despite having some mistakes. So I am not super keen on that.

Just for example, right off the bat:

> Atomics are simply operations or instructions that cannot be split by the compiler or the CPU or re-ordered in any way.

C++ atomics can in fact be reordered by the compiler or CPU, depending on ordering semantics. Acquire loads cannot be reordered after subsequent program-order loads or stores. But they can be reordered before previous program-order operations. Similarly, release stores cannot be reordered before prior program-order operations, but can be reordered after subsequent program-order operations. Relaxed atomic operations can be reordered arbitrarily (other than with accesses to the same object).

dnedic2y ago

If you're interested about practical uses of atomic operations, I wrote lockfree: https://github.com/DNedic/lockfree, a collection of lock-free data structures meant to be readable and both hosted system and embedded friendly.

redixhumayunOP2y ago

This is awesome! Also, I'm pretty sure I've come across your repo before from a SO answer (if I remember correctly)

feverzsj2y ago

C++26 will have safe reclamation

[0] https://en.cppreference.com/w/cpp/thread#Safe_Reclamation

loeg2y ago

I'm super excited for this. Do you know if any of the common C++ add-on libraries implement C++26 Safe Reclamation semantics yet? I'm pretty sure Folly has hazard pointers although I think the API is different from this.

j / k navigate · click thread line to collapse

48 comments

zozbot2342y ago

gpderetta2y ago

Another great book is "The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit.

edit: already mentioned elsethread.

gpderetta2y ago

  Thread 1                Memory                  Thread 2
  ---------               -------                 ---------
  |                          |                          |
  |   write(data, 100)       |                          |
  | -----------------------> |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(ready, true)     |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  |                          | ===Memory Barrier======= |
  |                          |   load(ready) == true    |                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |       read(data)         |
  |                          | <----------------------  |
  |                          |                          |

redixhumayunOP2y ago

Hey, so I'm curious why having memory barriers span across threads is the wrong mental model.

For instance, if I had release-acquire models on two variables, ready and not_ready, using separate barriers as representation might look something like this

```

  Thread 1                Memory                  Thread 2
  ---------               -------                 ---------
  |                          |                          |
  |   write(data, 100)       |                          |
  | -----------------------> |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(ready, true)     |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  | ====Memory Barrier====== |                          |
  |   store(not_ready, true) |                          |
  | -----------------------> |                          |
  | ====Memory Barrier====== |                          |
  |                          |                          |
  |                          | ===Memory Barrier======= |
  |                          |   load(ready) == true    |                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |.===Memory Barrier======= |
  |                          |   load(not_ready) == true|                   
  |                          | <----------------------  |
  |                          | ====Memory Barrier=====  |
  |                          |                          |
  |                          |       read(data)         |
  |                          | <----------------------  |
  |                          |                          |

```

Now, how does the processor know which memory barriers are linked together? I ask because without understanding which barriers are linked together, how is instruction re-ordering determined?

gpderetta2y ago

redixhumayunOP2y ago

Can you explain more about this?

bluetomcat2y ago

el_pollo_diablo2y ago

There are many problems with this concurrent queue. Here are a few I found in a cursory reading.

In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

enqueue() assumes that the queue is not empty.

enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail).

Then there is the old-time favourite ABA problem.

adastra222y ago

> In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.

With added indirection, there are ways of doing this atomically.

T0pH4t2y ago

I'm surprised more people are not calling out

"enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail)."

Its probably one of the bigger problems of this queue, basically negates the whole structure with this bug.

redixhumayunOP2y ago

I added in a note about the ABA problem but perhaps you're seeing a cached version of the post.

lmuscat2y ago

Ignoring the ABA problem, given the implementation of `new` and `delete` are blocking, so is the queue.

CyberDildonics2y ago

This is a great point. Technically you could replace your allocator with jemalloc or something similar, but most people probably don't.

Memory allocation gets forgotten in the "lock free" algorithms all the time, especially in java where allocation is forgotten about and brushed aside.

menaerus2y ago

I think that the point rather was not to use any allocation in critical sections since allocator implementations are not lock-free or wait-free.

https://github.com/jemalloc/jemalloc/blob/dev/src/mutex.c

1 more reply

chc42y ago

samsquire2y ago

I use a tag masked into the stored data to try prevent the ABA problem.

Every thread has its own unique tag.

  long original = me->realend;                                                                                                              
  int tag = (original & TAG_MASK);                                                                                                          
  changed = (((original & END_MASK) >> 32)) % me->size;                                                                                   
  long new = (data->thread_tag) | (((changed + 1) % me->size) << 32);

Then compare and swap as usual. If another thread updates then they shall fail the compare and swap and we have to reloop and try again.

I am still learning TLA+ to write a model.

saagarjha2y ago

chc42y ago

gpderetta2y ago

1 more reply

saagarjha2y ago

Tired: deadlocks

Wired: livelocks

redixhumayunOP2y ago

Are there tools out there that can prove semantic invariants in multi-threaded code? I don't understand how there can be automated tools around it at all because how would that even be possible?

jeffreygoesto2y ago

TLA+ comes to mind

[0] https://www.learntla.com/index.html

kangda1232y ago

There are model checkers such as nidhugg (C++), dscheck (ocaml). They take a test case and reach all possible terminal states by trying different interleavings.

Crucially, they don’t have to try all interleavings to reach all terminal states, making the enumeration quite fast.

maxbond2y ago

Rust comes to mind.

1 more reply

ZoomZoomZoom2y ago

> This is going to be a long post

Is it really? Sorry, but this is barely an introduction.

Some additional links to important documentation:

https://en.cppreference.com/w/cpp/language/memory_model

https://en.cppreference.com/w/cpp/atomic/memory_order

https://research.swtch.com/mm

Effective Concurrency series by Sutter: https://herbsutter.com/2009/07/15/effective-concurrency/

The Art of Multiprocessor Programming by Maurice Herlihy, Nir Shavit et al. ISBN: 978-0124159501

sneela2y ago

> The Art of Multiprocessor Programming by Maurice Herlihy, Nir Shavit et al. ISBN: 978-0124159501

[1] https://neerc.ifmo.ru/sptcc/courses.html

[2] https://www.youtube.com/playlist?list=PLVe-2wcL84b9G9o7KPubp...

redixhumayunOP2y ago

I think for someone who has no exposure to it before, its quite dense (perhaps long wasn't the best choice of wording)

Also, I've been meaning to read The Art of Multiprocessor Programming. I've heard great things about it!

menaerus2y ago

miloudi2y ago

are you talking about this one? https://www.youtube.com/watch?v=AJBmIaUneB0&list=PL5Q2soXY2Z...

1 more reply

loeg2y ago

I appreciate that this was written by someone new to atomics and memory ordering semantics. But it presents itself as authoritative despite having some mistakes. So I am not super keen on that.

Just for example, right off the bat:

> Atomics are simply operations or instructions that cannot be split by the compiler or the CPU or re-ordered in any way.

dnedic2y ago

redixhumayunOP2y ago

This is awesome! Also, I'm pretty sure I've come across your repo before from a SO answer (if I remember correctly)

feverzsj2y ago

C++26 will have safe reclamation

[0] https://en.cppreference.com/w/cpp/thread#Safe_Reclamation

loeg2y ago

j / k navigate · click thread line to collapse