On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep movs[bwdq]". I bet that, in modern binaries, memcpy/memset call sites are actually place holders for such instructions (before the memory segment goes back to Read/Executable), registers are rdi,rsi,rdx (rcx would be pushed on the stack or the code generated to account for just rcx availability on the call site).
Also, expect x86_64 -> risc-v port bugs because to: byte->byte word->halfword doubleword->word quadword->doubleword
You'd lose that bet.
An optimised memcpy/memset using normal instructions is typically much faster than "rep movsb" (etc).
It is however a lot of code, so "rep movsb" has its place in low-memory low-performance settings.
> hardware optimized and accelerated short and big memcpy/memset
Classic CISC fallacy: if they made an instruction to do it then it must be the fastest way to do it.
Nope. That wasn't even the intention of the designers of things such as the VAX and 8086. Those complex instructions were provided to let assembly language programmers write code a little more quickly, even if it ran a little more slowly, because according to the 1970's "Software crisis" theory computers were rapidly getting cheaper, but programmers were scarce and expensive, and vast amounts of software needed to be written.
The whole key was to make (assembly language) programmers more productive, and if that made the code slightly inefficient that didn't matter because you could easily buy an extra computer or two, and anyway next year's model will be faster.
> Beginning with processors based on Ivy Bridge microarchitecture, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for software in common situations like memory copy and set operations.
> Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.
* https://www-ssl.intel.com/content/www/us/en/architecture-and...
An optimized memcpy/memset using normal instructions is much faster than "rep movsb"/"rep stosb" only in certain ranges of the copy/fill size (on all modern Intel/AMD CPUs).
Using normal instructions for memcpy can have around a double speed for copy sizes under 1 kB, but it is always slower for very big copies.
For an optimized memcpy/memset, one must choose between normal instructions and "rep movsb"/"rep stosb" for each copy/fill, depending on the CPU model and on the copy/fill size.
Yeah, I don't know why everyone doesn't just call it int8, int16 and so on. That would be much better. This "word" naming is just confusing.
On all CPUs, "rep movsb" is slower for very short copies. The threshold under which "rep movsb" becomes slower depends on the CPU model. For example, on a Zen 3 the threshold is slightly above 2 kilobytes, so for copies up to 2 kB one should use SSE/AVX and above 2 kB one should use "rep movsb".
On Zen 3 there is a second range where "rep movsb" is slower, approximately between 1 MB and 20 MB (i.e. when the operands are in the L3 cache memory).
For any larger copies "rep movsb" is again faster.
So depending on the size of the copy and on the CPU model, an optimized memcpy should choose either "rep movsb" or SSE/AVX for each copy.
A simplified criterion that should be acceptable on most recent CPUs would be to always use "rep movsb" for sizes of one memory page (4 kB) or more and to use SSE/AVX for the shorter copies.
RISC-V does not have memory to memory ops.
x86, as wonderfully CISC-y as it is, has register-memory and memory-memory ops with various fun addressing modes.
memcpy:
mv a3, a0 # Copy destination
loop:
vsetvli t0, a2, e8, m8, ta, ma # Vectors of 8b
vle8.v v0, (a1) # Load bytes
add a1, a1, t0 # Bump pointer
sub a2, a2, t0 # Decrement count
vse8.v v0, (a3) # Store bytes
add a3, a3, t0 # Bump pointer
bnez a2, loop # Any more?
ret # Return
Are you sure `rep stos/movs` are actually optimal on x86_64 systems?Edit: I just ran tinymembench on my CPU (Ryzen 5 1600X)
C copy backwards : 7300.7 MB/s (1.2%)
C copy backwards (32 byte blocks) : 7330.5 MB/s (1.5%)
C copy backwards (64 byte blocks) : 7313.6 MB/s (0.7%)
C copy : 7385.3 MB/s (1.0%)
C copy prefetched (32 bytes step) : 7737.9 MB/s (1.0%)
C copy prefetched (64 bytes step) : 7701.1 MB/s (1.6%)
C 2-pass copy : 6414.2 MB/s (2.1%)
C 2-pass copy prefetched (32 bytes step) : 6947.9 MB/s (1.4%)
C 2-pass copy prefetched (64 bytes step) : 6985.8 MB/s (1.5%)
C fill : 9197.2 MB/s (1.2%)
C fill (shuffle within 16 byte blocks) : 9193.0 MB/s (1.4%)
C fill (shuffle within 32 byte blocks) : 9175.0 MB/s (2.2%)
C fill (shuffle within 64 byte blocks) : 9229.0 MB/s (1.1%)
---
standard memcpy : 11302.6 MB/s (1.2%)
standard memset : 11046.1 MB/s (1.4%)
---
MOVSB copy : 7668.6 MB/s (1.5%)
MOVSD copy : 7607.0 MB/s (0.8%)
SSE2 copy : 7987.0 MB/s (5.0%)
SSE2 nontemporal copy : 11989.2 MB/s (2.7%)
SSE2 copy prefetched (32 bytes step) : 7739.9 MB/s (1.3%)
SSE2 copy prefetched (64 bytes step) : 7807.6 MB/s (2.9%)
SSE2 nontemporal copy prefetched (32 bytes step) : 12503.7 MB/s (1.5%)
SSE2 nontemporal copy prefetched (64 bytes step) : 12605.2 MB/s (2.5%)
SSE2 2-pass copy : 6977.1 MB/s (1.7%)
SSE2 2-pass copy prefetched (32 bytes step) : 7311.1 MB/s (1.8%)
SSE2 2-pass copy prefetched (64 bytes step) : 7334.7 MB/s (1.5%)
SSE2 2-pass nontemporal copy : 3223.3 MB/s
SSE2 fill : 10919.1 MB/s (1.8%)
SSE2 nontemporal fill : 30713.9 MB/s (1.8%)Might be useful when I'm tinkering with my toy compiler.
https://www.amd.com/en/support/tech-docs/amd64-architecture-...
https://developer.arm.com/documentation/ddi0602/2023-06/
There seem to be some instruction set summaries on the web too like:
https://developer.arm.com/documentation/qrc0001/m
https://www.cs.swarthmore.edu/~kwebb/cs31/resources/ARM64_Ch...
https://courses.cs.washington.edu/courses/cse469/18wi/Materi...
https://www.felixcloutier.com/x86/
That last one seems like a real gem for intel/x86.
These architectures might have a few more instructions than RISC-V though, and the encoding (especially amd64) may be more complicated.