While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].
It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.
Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.
We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.
[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...
I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?
Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.
#include <stdio.h>
int stop = 1;
void maybeStop() {
if(stop)
for(;;);
}
int main() {
printf("hello, ");
maybeStop();
printf("world\n");
}
into int main() {
printf("hello, world\n");
}
(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?Maybe there is a reason, but none of this inspires confidence.
Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.
It's accepted by the compiler, but if provided, it compiles to a panic.
Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.
---------
From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).
--------
So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.
Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.
Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.
In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.
Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/
a few previous hn discussions of that one:
https://news.ycombinator.com/item?id=17099608
This is not true for Java; see
http://gee.cs.oswego.edu/dl/html/j9mm.html
https://docs.oracle.com/en/java/javase/16/docs/api/java.base...
If you want to test out weaker acquire/release semantics, you need to buy an ARM or POWER9 processor.
As I mentioned in the post (https://research.swtch.com/plmm#sc), Herb Sutter claimed in 2017 that POWER was going to do something to make SC atomics cheaper. If it did, then that might end up being cheaper than the old sync-based acq/rel too, same as ARM, in which case we'd end up with SC = acq/rel on both ARM and POWER. It looks like that didn't happen, but I'd be very interested to know what did, if anything.
Conversely acq/rel are from somewhat to very expensive to implement on ARM/POWER.
The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.
Here's an idea for ISA primitives that could help a language group variables together: bind/propagate operators on (combinations of) address ranges. https://pure.uva.nl/ws/files/1813114/109501_19.pdf
All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).
--------
The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.
synchronized(foobar_object){ foo(); }
synchronized(foobar_object){ bar(); }
synchronized(foobar_object){ baz(); }
Will have foo, bar, baz methods well behaved in any data that they share regardless of whether they are foobar methods or methods of any other class(es). It is exactly analogous to the S(a) -> S(a) synchronizing instruction from the article that establishes a happens-before partitioning each thread into before/after the S(a).The only time synchronized(explicit_object) relates to anything else is when also using the keyword where `synchronized void foo()` is equivalent (with a minor performance difference) to `synchronized(this) { ... }` wrapping the entire body of the foo method.
You can read more about this here if you're interested: https://www.isa-afp.org/entries/JinjaThreads.html
AKA why can't I stumble upon such stuff more often. Thanks OP!
Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.
In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".
And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.
Indeed.
Just a reminder to everyone: your pthreads_mutex_lock() and pthreads_mutex_unlock() functions already contain the appropriate compiler / cache memory barriers in the correct locations.
This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.
This is the stuff of cutting edge research right now: its a niche subject. Your typical programmer _SHOULD_ just stick a typical pthread_mutex_t onto an otherwise single-threaded data-structure and call it. Locks work. They're not "the best", but "the best" is constantly being researched / developed right now. I'm pretty sure that any new lockfree data-structure with decent performance is pretty much an instant PH.D thesis material.
-----------
Anyway, the reason why "single-threaded data-structure behind a mutex" works is because your data-structure still keeps all of its performance benefits (from sticking to L1 cache, or letting the compiler "manually cache" data to registers when appropriate), and then you only lose performance when associated with the lock() or unlock() calls (which will innately have memory barriers to publish the results)
That's 2 memory barriers (one barrier for lock() and one barrier for unlock()). The thing about lock-free algorithms is that they __might__ get you down to __1__ memory barrier per operation if you're a really, really good programmer. But its not exactly easy. (Or: they might still have 2 memory barriers but the lockfree aspect of "always forward progress" and/or deadlock free might be easier to prove)
Writing a low-performance but otherwise correct lock free algorithm isn't actually that hard. Writing a lock free algorithm that beats your typical mutex + data-structure however, is devilishly hard.
---------
"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).
That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.
---------
Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.
That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).
Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).
Since as you say, they are very similar, wouldn't it be reasonable to assume for access purposes that they are effectively global?
What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?
Pointers can be used to circumvent most safety measures. If you obscure the access, you should assume responsibility for the result.
The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.
Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid. Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".
Edit: forgot spaces between paragraphs.
It would be nice if sometime we stopped pretending that beginners are too slow to know/understand things and instead faced the fact that their instructors and mentors are bad at teaching.
Also, maybe you are different, but I can only keep so much in my head at a time. If I can keep something simple or abstract it away so I can focus on other details, that doesn't make me a dilletante. It makes me more effective at what I'm actually trying to do.
Source?
The "Exploiting the slice type" section.
Go has no VM but it has a GC. WASM has a VM but no GC.
Eveything has been tried and Java still kicks everythings ass to the moon on the server.
Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.
"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
From my perspective, Go in the context of serverless programming seems to currently be the best choice for server-side programming.
In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.
I'd love for this to be Crystal: https://crystal-lang.org/
> I haven't seen server work being done in Java in ages.
In the meantime, I've been doing a large amount of Java backend server work for the past 10 years.
What have you built with go that is interesting?
To each its own.