undefined | Better HN

0 pointsmusicale2y ago0 comments

Classic RISC "load/store" architectures don't even have register-memory ops either besides load and store.

x86, as wonderfully CISC-y as it is, has register-memory and memory-memory ops with various fun addressing modes.

0 comments

While most of the time load and store instructions are enough, all RISC ISAs were forced to add eventually a few atomic read-modify-write instructions, otherwise the programs for systems with a large number of cores become too inefficient.

The most important atomic read-modify-write instruction is fetch-and-add, the next in importance are fetch-and-or, fetch-and-and and fetch-and-xor, and then the next in importance are fetch-and-max and fetch-and-min (signed and unsigned).

Aarch64 has all of them since Armv8.1-A (since Cortex-A55 and Cortex-A75), while RISC-V also has all of them in one of the extensions (AMO).

brucehoult2y ago

That is solving a completely different problem.

The RISC-V AMO instructions are designed and intended to be implemented such that the arithmetic part is NOT executed in the CPU in a read-modify-write sequence -- only very low end CPUs (microcontrollers, that don't have a real memory hierarchy or multiple processors anyway) do it that way.

What actually happens is this:

    amoadd.w rd,rs2,(rs1)

All of rs1 (the memory address), rs2 (the amount to be added), and a field indicating this is an AMOADD, not AMOMIN, AMOXOR etc or even a plain store are send out on in parallel fields on the peripheral bus (e.g. TileLink-C or TileLink-UH) until it gets to either the actual endpoint containing the target address (perhaps an I/O device and register), or the point where the target address is found to be accessed by a simple read/write (TileLink-UL) bus -- this is often the last-level cache controller. But it could also be an L1 or L2 cache for another CPU core, or in another CPU cluster, or even in a completely different computer with the addr/data/op triple passing over 400G ethernet or NVMe or something on the way.

In either case, this point is much closer to the data than the originating CPU is. The TileLink device at that point atomically reads the memory contents, performs the arithmetic, stores the new value, and then sends the old value back to the CPU just the same as for a memory read.

From the CPU's point of view, the AMO is just like a memory read, except extra data (the value to be swapped, aded, xored etc) is sent with the address ... so that's like a write.

AMO instructions do not add any new complexity or state sequencing to the CPU core, compared to simple load/store.

I use TileLink as the example bus, as it was co-developed with RISC-V at Berkeley, but may RISC-V CPU cores can use AXI (or both), and Arm has recently added similar capability to AXI.

adrian_b2y ago

This kind of implementation (with the computations done locally in the memory or memory controller or cache controller) has already been proposed and implemented in 1981, in the NYU Ultracomputer project, i.e. the first time when such fetch-and-operation instructions have been proposed as a superior alternative to the swap or test-and-set instructions that had been previously used to ensure mutual exclusion in multiprocessor systems.

So this RISC-V extension, while very useful and actually mandatory for any CPU with a high core count, has nothing original or new, but it just follows a 40-years old established practice.

From the point of view of the memory, these atomic operations are always read-modify-write cycles. Whether the data travels up to a CPU core or only on a shorter path, depends on the implementation, but this does not change the meaning of the operation.

1 more reply

j / k navigate · click thread line to collapse

0 comments

adrian_b2y ago

Aarch64 has all of them since Armv8.1-A (since Cortex-A55 and Cortex-A75), while RISC-V also has all of them in one of the extensions (AMO).

brucehoult2y ago

That is solving a completely different problem.

What actually happens is this:

    amoadd.w rd,rs2,(rs1)

From the CPU's point of view, the AMO is just like a memory read, except extra data (the value to be swapped, aded, xored etc) is sent with the address ... so that's like a write.

AMO instructions do not add any new complexity or state sequencing to the CPU core, compared to simple load/store.

I use TileLink as the example bus, as it was co-developed with RISC-V at Berkeley, but may RISC-V CPU cores can use AXI (or both), and Arm has recently added similar capability to AXI.

adrian_b2y ago

So this RISC-V extension, while very useful and actually mandatory for any CPU with a high core count, has nothing original or new, but it just follows a 40-years old established practice.

1 more reply

j / k navigate · click thread line to collapse