x86, as wonderfully CISC-y as it is, has register-memory and memory-memory ops with various fun addressing modes.
The most important atomic read-modify-write instruction is fetch-and-add, the next in importance are fetch-and-or, fetch-and-and and fetch-and-xor, and then the next in importance are fetch-and-max and fetch-and-min (signed and unsigned).
Aarch64 has all of them since Armv8.1-A (since Cortex-A55 and Cortex-A75), while RISC-V also has all of them in one of the extensions (AMO).
The RISC-V AMO instructions are designed and intended to be implemented such that the arithmetic part is NOT executed in the CPU in a read-modify-write sequence -- only very low end CPUs (microcontrollers, that don't have a real memory hierarchy or multiple processors anyway) do it that way.
What actually happens is this:
amoadd.w rd,rs2,(rs1)
All of rs1 (the memory address), rs2 (the amount to be added), and a field indicating this is an AMOADD, not AMOMIN, AMOXOR etc or even a plain store are send out on in parallel fields on the peripheral bus (e.g. TileLink-C or TileLink-UH) until it gets to either the actual endpoint containing the target address (perhaps an I/O device and register), or the point where the target address is found to be accessed by a simple read/write (TileLink-UL) bus -- this is often the last-level cache controller. But it could also be an L1 or L2 cache for another CPU core, or in another CPU cluster, or even in a completely different computer with the addr/data/op triple passing over 400G ethernet or NVMe or something on the way.In either case, this point is much closer to the data than the originating CPU is. The TileLink device at that point atomically reads the memory contents, performs the arithmetic, stores the new value, and then sends the old value back to the CPU just the same as for a memory read.
From the CPU's point of view, the AMO is just like a memory read, except extra data (the value to be swapped, aded, xored etc) is sent with the address ... so that's like a write.
AMO instructions do not add any new complexity or state sequencing to the CPU core, compared to simple load/store.
I use TileLink as the example bus, as it was co-developed with RISC-V at Berkeley, but may RISC-V CPU cores can use AXI (or both), and Arm has recently added similar capability to AXI.
So this RISC-V extension, while very useful and actually mandatory for any CPU with a high core count, has nothing original or new, but it just follows a 40-years old established practice.
From the point of view of the memory, these atomic operations are always read-modify-write cycles. Whether the data travels up to a CPU core or only on a shorter path, depends on the implementation, but this does not change the meaning of the operation.