undefined | Better HN

0 pointsanematode6mo ago0 comments

Curious: even with hardware atomics, wouldn't it be a good idea to first perform a non-atomic load to check for whether the store might be necessary (which would require the cache line to be locked), then only run the atomic max if it might change the value?

0 comments

adwn6mo ago

Yes, this can make sense if

- the value is often doesn't require an update, and

- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.

But there are important details to consider:

1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.

2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.

anematodeOP6mo ago

Thanks for your insights. (2) makes sense to me, but for (1), on ARM64 can an aligned 64-bit store really tear in a 64-bit non-atomic load? The spec says "A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic" (B2.2.1)

adwn6mo ago

> […] on ARM64 […]

Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory.

But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely).

1 more reply

adgjlsfhk16mo ago

This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test).

anematodeOP6mo ago

I did a little unscientific test here on an Apple M4 Pro with n threads spamming atomic operations with pseudorandom values on one memory location (the worst case). Used inline asm to make sure there was no funny business going on.

  atomic adds
  n = 1 ->  333e6 adds/second
  n = 2 ->  174e6
  n = 4 ->   95e6
  n = 8 ->   63e6

  atomic maxs
  n = 1 ->  161e6 maxs/second
  n = 2 ->   59e6
  n = 4 ->   39e6
  n = 8 ->   27e6

  atomic maxs with preceding check
  n = 1 ->  929e6 maxs/second
  n = 2 -> 1541e6
  n = 4 -> 3494e6
  n = 8 -> 5985e6

So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic add

thequux6mo ago

I think that this can change the semantics though; with the preceding check you can miss the shared variable being decremented from another thread. In some cases, such as if the shared value is monotonic, this is done, but not in the general case.

1 more reply

j / k navigate · click thread line to collapse

0 comments

adwn6mo ago

Yes, this can make sense if

- the value is often doesn't require an update, and

- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.

But there are important details to consider:

anematodeOP6mo ago

adwn6mo ago

> […] on ARM64 […]

1 more reply

adgjlsfhk16mo ago

This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test).

anematodeOP6mo ago

  atomic adds
  n = 1 ->  333e6 adds/second
  n = 2 ->  174e6
  n = 4 ->   95e6
  n = 8 ->   63e6

  atomic maxs
  n = 1 ->  161e6 maxs/second
  n = 2 ->   59e6
  n = 4 ->   39e6
  n = 8 ->   27e6

  atomic maxs with preceding check
  n = 1 ->  929e6 maxs/second
  n = 2 -> 1541e6
  n = 4 -> 3494e6
  n = 8 -> 5985e6

thequux6mo ago

1 more reply

j / k navigate · click thread line to collapse