RISC-V Instructions (opens in new tab)

(robalni.org)

31 pointsrobalni2y ago26 comments

26 comments

What seems to be missing are the hardware optimized and accelerated short and big memcpy/memset.

On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep movs[bwdq]". I bet that, in modern binaries, memcpy/memset call sites are actually place holders for such instructions (before the memory segment goes back to Read/Executable), registers are rdi,rsi,rdx (rcx would be pushed on the stack or the code generated to account for just rcx availability on the call site).

Also, expect x86_64 -> risc-v port bugs because to: byte->byte word->halfword doubleword->word quadword->doubleword

brucehoult2y ago

> On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep movs[bwdq]". I bet that, in modern binaries, memcpy/memset call sites are actually place holders for such instructions

You'd lose that bet.

An optimised memcpy/memset using normal instructions is typically much faster than "rep movsb" (etc).

It is however a lot of code, so "rep movsb" has its place in low-memory low-performance settings.

> hardware optimized and accelerated short and big memcpy/memset

Classic CISC fallacy: if they made an instruction to do it then it must be the fastest way to do it.

Nope. That wasn't even the intention of the designers of things such as the VAX and 8086. Those complex instructions were provided to let assembly language programmers write code a little more quickly, even if it ran a little more slowly, because according to the 1970's "Software crisis" theory computers were rapidly getting cheaper, but programmers were scarce and expensive, and vast amounts of software needed to be written.

The whole key was to make (assembly language) programmers more productive, and if that made the code slightly inefficient that didn't matter because you could easily buy an extra computer or two, and anyway next year's model will be faster.

icelusxl2y ago

"rep stosb" has been optimized since Ivy Bridge CPUs.

> Beginning with processors based on Ivy Bridge microarchitecture, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for software in common situations like memory copy and set operations.

> Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.

* https://www-ssl.intel.com/content/www/us/en/architecture-and...

* https://stackoverflow.com/a/33485055

adrian_b2y ago

See my other reply for more details.

An optimized memcpy/memset using normal instructions is much faster than "rep movsb"/"rep stosb" only in certain ranges of the copy/fill size (on all modern Intel/AMD CPUs).

Using normal instructions for memcpy can have around a double speed for copy sizes under 1 kB, but it is always slower for very big copies.

For an optimized memcpy/memset, one must choose between normal instructions and "rep movsb"/"rep stosb" for each copy/fill, depending on the CPU model and on the copy/fill size.

1 more reply

dzaima2y ago

Those might be "cheap" to tack onto an x86 core where there already is a ton of microcoding infrastructure, but on RISC-V that just won't the the case often. It could make sense to have an extension for it, but not really any reason to have it in base RISC-V.

robalniOP2y ago

> Also, expect x86_64 -> risc-v port bugs because to: byte->byte word->halfword doubleword->word quadword->doubleword

Yeah, I don't know why everyone doesn't just call it int8, int16 and so on. That would be much better. This "word" naming is just confusing.

anfilt2y ago

I would be using SSE or AVX instructions for memcpy.

adrian_b2y ago

On recent Intel and AMD CPUs, SSE or AVX instructions can beat "rep movsb" only in certain ranges of copy sizes.

On all CPUs, "rep movsb" is slower for very short copies. The threshold under which "rep movsb" becomes slower depends on the CPU model. For example, on a Zen 3 the threshold is slightly above 2 kilobytes, so for copies up to 2 kB one should use SSE/AVX and above 2 kB one should use "rep movsb".

On Zen 3 there is a second range where "rep movsb" is slower, approximately between 1 MB and 20 MB (i.e. when the operands are in the L3 cache memory).

For any larger copies "rep movsb" is again faster.

So depending on the size of the copy and on the CPU model, an optimized memcpy should choose either "rep movsb" or SSE/AVX for each copy.

A simplified criterion that should be acceptable on most recent CPUs would be to always use "rep movsb" for sizes of one memory page (4 kB) or more and to use SSE/AVX for the shorter copies.

snvzz2y ago

There's no such thing, as RISC-V memory operations are very consciously explicit load/store instructions.

RISC-V does not have memory to memory ops.

musicale2y ago

Classic RISC "load/store" architectures don't even have register-memory ops either besides load and store.

x86, as wonderfully CISC-y as it is, has register-memory and memory-memory ops with various fun addressing modes.

1 more reply

camel-cdr2y ago

You'll likely see memcpy implemented using the vector extension, e.g.:

    memcpy:
        mv a3, a0 # Copy destination
    loop:
      vsetvli t0, a2, e8, m8, ta, ma   # Vectors of 8b
      vle8.v v0, (a1)               # Load bytes
        add a1, a1, t0              # Bump pointer
        sub a2, a2, t0              # Decrement count
      vse8.v v0, (a3)               # Store bytes
        add a3, a3, t0              # Bump pointer
        bnez a2, loop               # Any more?
        ret                         # Return

Are you sure `rep stos/movs` are actually optimal on x86_64 systems?

Edit: I just ran tinymembench on my CPU (Ryzen 5 1600X)

     C copy backwards                                     :   7300.7 MB/s (1.2%)
     C copy backwards (32 byte blocks)                    :   7330.5 MB/s (1.5%)
     C copy backwards (64 byte blocks)                    :   7313.6 MB/s (0.7%)
     C copy                                               :   7385.3 MB/s (1.0%)
     C copy prefetched (32 bytes step)                    :   7737.9 MB/s (1.0%)
     C copy prefetched (64 bytes step)                    :   7701.1 MB/s (1.6%)
     C 2-pass copy                                        :   6414.2 MB/s (2.1%)
     C 2-pass copy prefetched (32 bytes step)             :   6947.9 MB/s (1.4%)
     C 2-pass copy prefetched (64 bytes step)             :   6985.8 MB/s (1.5%)
     C fill                                               :   9197.2 MB/s (1.2%)
     C fill (shuffle within 16 byte blocks)               :   9193.0 MB/s (1.4%)
     C fill (shuffle within 32 byte blocks)               :   9175.0 MB/s (2.2%)
     C fill (shuffle within 64 byte blocks)               :   9229.0 MB/s (1.1%)
     ---
     standard memcpy                                      :  11302.6 MB/s (1.2%)
     standard memset                                      :  11046.1 MB/s (1.4%)
     ---
     MOVSB copy                                           :   7668.6 MB/s (1.5%)
     MOVSD copy                                           :   7607.0 MB/s (0.8%)
     SSE2 copy                                            :   7987.0 MB/s (5.0%)
     SSE2 nontemporal copy                                :  11989.2 MB/s (2.7%)
     SSE2 copy prefetched (32 bytes step)                 :   7739.9 MB/s (1.3%)
     SSE2 copy prefetched (64 bytes step)                 :   7807.6 MB/s (2.9%)
     SSE2 nontemporal copy prefetched (32 bytes step)     :  12503.7 MB/s (1.5%)
     SSE2 nontemporal copy prefetched (64 bytes step)     :  12605.2 MB/s (2.5%)
     SSE2 2-pass copy                                     :   6977.1 MB/s (1.7%)
     SSE2 2-pass copy prefetched (32 bytes step)          :   7311.1 MB/s (1.8%)
     SSE2 2-pass copy prefetched (64 bytes step)          :   7334.7 MB/s (1.5%)
     SSE2 2-pass nontemporal copy                         :   3223.3 MB/s
     SSE2 fill                                            :  10919.1 MB/s (1.8%)
     SSE2 nontemporal fill                                :  30713.9 MB/s (1.8%)

sylware2y ago

short and big "rep [sto|mov]s[bwdq]" hardware accelerated are very recent on AMD, I think you need at least a zen3.

1 more reply

gary_02y ago

Does anyone have something like this for amd64 or aarch64?

Might be useful when I'm tinkering with my toy compiler.

musicale2y ago

You could look at the architecture manuals, such as:

https://www.amd.com/en/support/tech-docs/amd64-architecture-...

https://developer.arm.com/documentation/ddi0602/2023-06/

There seem to be some instruction set summaries on the web too like:

https://developer.arm.com/documentation/qrc0001/m

https://www.cs.swarthmore.edu/~kwebb/cs31/resources/ARM64_Ch...

https://courses.cs.washington.edu/courses/cse469/18wi/Materi...

https://www.felixcloutier.com/x86/

That last one seems like a real gem for intel/x86.

These architectures might have a few more instructions than RISC-V though, and the encoding (especially amd64) may be more complicated.

j / k navigate · click thread line to collapse

26 comments

sylware2y ago

What seems to be missing are the hardware optimized and accelerated short and big memcpy/memset.

Also, expect x86_64 -> risc-v port bugs because to: byte->byte word->halfword doubleword->word quadword->doubleword

brucehoult2y ago

> On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep movs[bwdq]". I bet that, in modern binaries, memcpy/memset call sites are actually place holders for such instructions

You'd lose that bet.

An optimised memcpy/memset using normal instructions is typically much faster than "rep movsb" (etc).

It is however a lot of code, so "rep movsb" has its place in low-memory low-performance settings.

> hardware optimized and accelerated short and big memcpy/memset

Classic CISC fallacy: if they made an instruction to do it then it must be the fastest way to do it.

icelusxl2y ago

"rep stosb" has been optimized since Ivy Bridge CPUs.

> Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.

* https://www-ssl.intel.com/content/www/us/en/architecture-and...

* https://stackoverflow.com/a/33485055

adrian_b2y ago

See my other reply for more details.

An optimized memcpy/memset using normal instructions is much faster than "rep movsb"/"rep stosb" only in certain ranges of the copy/fill size (on all modern Intel/AMD CPUs).

Using normal instructions for memcpy can have around a double speed for copy sizes under 1 kB, but it is always slower for very big copies.

For an optimized memcpy/memset, one must choose between normal instructions and "rep movsb"/"rep stosb" for each copy/fill, depending on the CPU model and on the copy/fill size.

1 more reply

dzaima2y ago

robalniOP2y ago

> Also, expect x86_64 -> risc-v port bugs because to: byte->byte word->halfword doubleword->word quadword->doubleword

Yeah, I don't know why everyone doesn't just call it int8, int16 and so on. That would be much better. This "word" naming is just confusing.

anfilt2y ago

I would be using SSE or AVX instructions for memcpy.

adrian_b2y ago

On recent Intel and AMD CPUs, SSE or AVX instructions can beat "rep movsb" only in certain ranges of copy sizes.

On Zen 3 there is a second range where "rep movsb" is slower, approximately between 1 MB and 20 MB (i.e. when the operands are in the L3 cache memory).

For any larger copies "rep movsb" is again faster.

So depending on the size of the copy and on the CPU model, an optimized memcpy should choose either "rep movsb" or SSE/AVX for each copy.

A simplified criterion that should be acceptable on most recent CPUs would be to always use "rep movsb" for sizes of one memory page (4 kB) or more and to use SSE/AVX for the shorter copies.

snvzz2y ago

There's no such thing, as RISC-V memory operations are very consciously explicit load/store instructions.

RISC-V does not have memory to memory ops.

musicale2y ago

Classic RISC "load/store" architectures don't even have register-memory ops either besides load and store.

x86, as wonderfully CISC-y as it is, has register-memory and memory-memory ops with various fun addressing modes.

1 more reply

camel-cdr2y ago

You'll likely see memcpy implemented using the vector extension, e.g.:

    memcpy:
        mv a3, a0 # Copy destination
    loop:
      vsetvli t0, a2, e8, m8, ta, ma   # Vectors of 8b
      vle8.v v0, (a1)               # Load bytes
        add a1, a1, t0              # Bump pointer
        sub a2, a2, t0              # Decrement count
      vse8.v v0, (a3)               # Store bytes
        add a3, a3, t0              # Bump pointer
        bnez a2, loop               # Any more?
        ret                         # Return

Are you sure `rep stos/movs` are actually optimal on x86_64 systems?

Edit: I just ran tinymembench on my CPU (Ryzen 5 1600X)

     C copy backwards                                     :   7300.7 MB/s (1.2%)
     C copy backwards (32 byte blocks)                    :   7330.5 MB/s (1.5%)
     C copy backwards (64 byte blocks)                    :   7313.6 MB/s (0.7%)
     C copy                                               :   7385.3 MB/s (1.0%)
     C copy prefetched (32 bytes step)                    :   7737.9 MB/s (1.0%)
     C copy prefetched (64 bytes step)                    :   7701.1 MB/s (1.6%)
     C 2-pass copy                                        :   6414.2 MB/s (2.1%)
     C 2-pass copy prefetched (32 bytes step)             :   6947.9 MB/s (1.4%)
     C 2-pass copy prefetched (64 bytes step)             :   6985.8 MB/s (1.5%)
     C fill                                               :   9197.2 MB/s (1.2%)
     C fill (shuffle within 16 byte blocks)               :   9193.0 MB/s (1.4%)
     C fill (shuffle within 32 byte blocks)               :   9175.0 MB/s (2.2%)
     C fill (shuffle within 64 byte blocks)               :   9229.0 MB/s (1.1%)
     ---
     standard memcpy                                      :  11302.6 MB/s (1.2%)
     standard memset                                      :  11046.1 MB/s (1.4%)
     ---
     MOVSB copy                                           :   7668.6 MB/s (1.5%)
     MOVSD copy                                           :   7607.0 MB/s (0.8%)
     SSE2 copy                                            :   7987.0 MB/s (5.0%)
     SSE2 nontemporal copy                                :  11989.2 MB/s (2.7%)
     SSE2 copy prefetched (32 bytes step)                 :   7739.9 MB/s (1.3%)
     SSE2 copy prefetched (64 bytes step)                 :   7807.6 MB/s (2.9%)
     SSE2 nontemporal copy prefetched (32 bytes step)     :  12503.7 MB/s (1.5%)
     SSE2 nontemporal copy prefetched (64 bytes step)     :  12605.2 MB/s (2.5%)
     SSE2 2-pass copy                                     :   6977.1 MB/s (1.7%)
     SSE2 2-pass copy prefetched (32 bytes step)          :   7311.1 MB/s (1.8%)
     SSE2 2-pass copy prefetched (64 bytes step)          :   7334.7 MB/s (1.5%)
     SSE2 2-pass nontemporal copy                         :   3223.3 MB/s
     SSE2 fill                                            :  10919.1 MB/s (1.8%)
     SSE2 nontemporal fill                                :  30713.9 MB/s (1.8%)

sylware2y ago

short and big "rep [sto|mov]s[bwdq]" hardware accelerated are very recent on AMD, I think you need at least a zen3.

1 more reply

gary_02y ago

Does anyone have something like this for amd64 or aarch64?

Might be useful when I'm tinkering with my toy compiler.

musicale2y ago

You could look at the architecture manuals, such as:

https://www.amd.com/en/support/tech-docs/amd64-architecture-...

https://developer.arm.com/documentation/ddi0602/2023-06/

There seem to be some instruction set summaries on the web too like:

https://developer.arm.com/documentation/qrc0001/m