atomic adds
n = 1 -> 333e6 adds/second
n = 2 -> 174e6
n = 4 -> 95e6
n = 8 -> 63e6
atomic maxs
n = 1 -> 161e6 maxs/second
n = 2 -> 59e6
n = 4 -> 39e6
n = 8 -> 27e6
atomic maxs with preceding check
n = 1 -> 929e6 maxs/second
n = 2 -> 1541e6
n = 4 -> 3494e6
n = 8 -> 5985e6
So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic addAlso am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?
edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.
edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...
edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.