undefined | Better HN

0 pointsadgjlsfhk18mo ago0 comments

This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test).

0 comments

anematode8mo ago

I did a little unscientific test here on an Apple M4 Pro with n threads spamming atomic operations with pseudorandom values on one memory location (the worst case). Used inline asm to make sure there was no funny business going on.

  atomic adds
  n = 1 ->  333e6 adds/second
  n = 2 ->  174e6
  n = 4 ->   95e6
  n = 8 ->   63e6

  atomic maxs
  n = 1 ->  161e6 maxs/second
  n = 2 ->   59e6
  n = 4 ->   39e6
  n = 8 ->   27e6

  atomic maxs with preceding check
  n = 1 ->  929e6 maxs/second
  n = 2 -> 1541e6
  n = 4 -> 3494e6
  n = 8 -> 5985e6

So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic add

thequux8mo ago

I think that this can change the semantics though; with the preceding check you can miss the shared variable being decremented from another thread. In some cases, such as if the shared value is monotonic, this is done, but not in the general case.

anematode8mo ago

With a relaxed ordering I'm not sure if that's right, since the ldumax would have no imposed ordering relation with the (atomic) decrement on another thread and so could very well have operated on the old value obtained by the non-atomic load

gpderetta8mo ago

All operations on a single memory location are always totally ordered in a CC system, no matter how relaxed the memory model is.

Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?

edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.

edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...

edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

1 more reply

ibraheemdev8mo ago

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

2 more replies

j / k navigate · click thread line to collapse

0 comments

anematode8mo ago

  atomic adds
  n = 1 ->  333e6 adds/second
  n = 2 ->  174e6
  n = 4 ->   95e6
  n = 8 ->   63e6

  atomic maxs
  n = 1 ->  161e6 maxs/second
  n = 2 ->   59e6
  n = 4 ->   39e6
  n = 8 ->   27e6

  atomic maxs with preceding check
  n = 1 ->  929e6 maxs/second
  n = 2 -> 1541e6
  n = 4 -> 3494e6
  n = 8 -> 5985e6

thequux8mo ago

anematode8mo ago

gpderetta8mo ago

All operations on a single memory location are always totally ordered in a CC system, no matter how relaxed the memory model is.

Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?

edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.

edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...

edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

1 more reply

ibraheemdev8mo ago

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

2 more replies

j / k navigate · click thread line to collapse