> An atomic read-modify-write.
No, this also applies to (non-relaxed) atomic loads and stores, depending on the platform.
> Atomic non-seq-cst load/stores can be cheap.
Relaxed atomic loads and stores are always cheap, but anything above requires additional memory order instructions on many platforms, most notably on ARM.
Here we are talking specifically about mutexes, which follow acquire release semantics.
To be clear: locking an uncontented mutex is indeed much, much cheaper than an actual call into the kernel, but it is not free either.