- the value is often doesn't require an update, and
- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.
But there are important details to consider:
1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.
2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.
Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory.
But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely).
atomic adds
n = 1 -> 333e6 adds/second
n = 2 -> 174e6
n = 4 -> 95e6
n = 8 -> 63e6
atomic maxs
n = 1 -> 161e6 maxs/second
n = 2 -> 59e6
n = 4 -> 39e6
n = 8 -> 27e6
atomic maxs with preceding check
n = 1 -> 929e6 maxs/second
n = 2 -> 1541e6
n = 4 -> 3494e6
n = 8 -> 5985e6
So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic add