(If you're interested in concurrency in a Linux context as seen from a broader systems point of view, you might like Paul McKeneny's 'Is Parallel Programming Hard, And If So What Can You Do About It?" https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... )
edit: already mentioned elsethread.
Thread 1 Memory Thread 2
--------- ------- ---------
| | |
| write(data, 100) | |
| -----------------------> | |
| | |
| ====Memory Barrier====== | |
| store(ready, true) | |
| -----------------------> | |
| ====Memory Barrier====== | |
| | |
| | ===Memory Barrier======= |
| | load(ready) == true |
| | <---------------------- |
| | ====Memory Barrier===== |
| | |
| | read(data) |
| | <---------------------- |
| | |
I.e. barriers prevent reordering of operations within a thread, not across threads. It also makes immediately obvious why the seq_cst ordering of both the thread 1 atomic store and the thread 2 atomic load can be relaxed: The last barrier in Thread 1 does not prevent any reordering in this example, hence it can be omitted, leaving only the barrier before the store making it a release operation. Similarly, we can omit the barrier before the first load in thread 2, leaving only the barrier after, making it an acquire operation.[1] well, it is showing the effect of sequential consistency as opposed to acquire-release, so a logical barrier spanning threads is not necessarily wrong, but then you would still need to show a barrier before the last store and the first load.
Assuming that the memory barrier is syncing across a single variable (in this case ready), why would it be correct to think of it as two separate barriers? If it were correct to think of it as two separate barriers on two separate threads, wouldn't there need to be some form of synchronization or linkage between the two barriers themselves so that memory barriers can be coupled together?
For instance, if I had release-acquire models on two variables, ready and not_ready, using separate barriers as representation might look something like this
```
Thread 1 Memory Thread 2
--------- ------- ---------
| | |
| write(data, 100) | |
| -----------------------> | |
| | |
| ====Memory Barrier====== | |
| store(ready, true) | |
| -----------------------> | |
| ====Memory Barrier====== | |
| | |
| ====Memory Barrier====== | |
| store(not_ready, true) | |
| -----------------------> | |
| ====Memory Barrier====== | |
| | |
| | ===Memory Barrier======= |
| | load(ready) == true |
| | <---------------------- |
| | ====Memory Barrier===== |
| | |
| |.===Memory Barrier======= |
| | load(not_ready) == true|
| | <---------------------- |
| | ====Memory Barrier===== |
| | |
| | read(data) |
| | <---------------------- |
| | |
```Now, how does the processor know which memory barriers are linked together? I ask because without understanding which barriers are linked together, how is instruction re-ordering determined?
edit: AFAIK, seq_cst ordering (as opposed to acq_rel) is only relevant when you have more than two threads and you care about things like IRIW. In this case acquires and releases are not enough to capture the full set of constraints, although at the hardware level it is still everything local.
edit2: I guess the missing bit is that beyond the hardware fences you have the hardware cache coherency protocol that makes sure that a total order of operations always exist once load and stores reach the coherence fabric.
In order to enqueue an element, you have to change two atomic pointers: tail->next and tail. That cannot be done atomically.
enqueue() assumes that the queue is not empty.
enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail).
Then there is the old-time favourite ABA problem.
By the way, when used in a loop, compare_exchange_weak is preferred to compare_exchange_strong, and there is no need to reload the atomic every time: compare_exchange_* do that for you by updating the argument containing the expected value.
With added indirection, there are ways of doing this atomically.
"enqueue() happily overwrites current_tail->next even if it is not null (which may happen if some other producer has enqueued something since we read current_tail)."
Its probably one of the bigger problems of this queue, basically negates the whole structure with this bug.
I added in a note about the ABA problem but perhaps you're seeing a cached version of the post.
Memory allocation gets forgotten in the "lock free" algorithms all the time, especially in java where allocation is forgotten about and brushed aside.
Every thread has its own unique tag.
long original = me->realend;
int tag = (original & TAG_MASK);
changed = (((original & END_MASK) >> 32)) % me->size;
long new = (data->thread_tag) | (((changed + 1) % me->size) << 32);
Then compare and swap as usual. If another thread updates then they shall fail the compare and swap and we have to reloop and try again.I am still learning TLA+ to write a model.
Is it really? Sorry, but this is barely an introduction.
Some additional links to important documentation:
https://en.cppreference.com/w/cpp/language/memory_model
https://en.cppreference.com/w/cpp/atomic/memory_order
Effective Concurrency series by Sutter: https://herbsutter.com/2009/07/15/effective-concurrency/
The Art of Multiprocessor Programming by Maurice Herlihy, Nir Shavit et al. ISBN: 978-0124159501
An excellent book that I used for my Parallel Programming course in my undergrad. While looking up Nir Shavit (back then), I came across a "Summer School on Practice and Theory of Concurrent Computing (2017)" [1] which had notable people give talks about interesting & fundamental topics related to Parallel Computing such as "Wait-free computing 'for dummies'", "Lock-free concurrent data structures", and "Locking, from traditional to modern" (taught by Nir Shavit). [2] is a link to a YouTube playlist containing all the videos from that Summer School. I highly recommend it.
[1] https://neerc.ifmo.ru/sptcc/courses.html
[2] https://www.youtube.com/playlist?list=PLVe-2wcL84b9G9o7KPubp...
I think for someone who has no exposure to it before, its quite dense (perhaps long wasn't the best choice of wording)
Also, I've been meaning to read The Art of Multiprocessor Programming. I've heard great things about it!
Just for example, right off the bat:
> Atomics are simply operations or instructions that cannot be split by the compiler or the CPU or re-ordered in any way.
C++ atomics can in fact be reordered by the compiler or CPU, depending on ordering semantics. Acquire loads cannot be reordered after subsequent program-order loads or stores. But they can be reordered before previous program-order operations. Similarly, release stores cannot be reordered before prior program-order operations, but can be reordered after subsequent program-order operations. Relaxed atomic operations can be reordered arbitrarily (other than with accesses to the same object).
[0] https://en.cppreference.com/w/cpp/thread#Safe_Reclamation