godbolt clang compiles it to:
.LBB5_2: // =>This Inner Loop Header: Depth=1
mul x13, x11, x10
umulh x14, x11, x10
eor x13, x14, x13
mul x14, x13, x12
umulh x13, x13, x12
eor x13, x13, x14
str x13, [x0, x8, lsl #3]
add x8, x8, #2 // =2
cmp x8, x1
add x11, x11, x9
b.lo .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.
edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.
I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.
.LBB0_2:
eor x13, x9, x9, lsr #30 # 2 \* p1-6
mul x13, x13, x11 # 1 \* p5-6
eor x13, x13, x13, lsr #27 # 2 \* p1-6
mul x13, x13, x12 # 1 \* p5-6
eor x13, x13, x13, lsr #31 # 2 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB0_2 # Fused into 1 \* p1-3
# Total: 11 uops
.LBB1_2:
mul x13, x9, x11 # 1 \* p5-6
umulh x14, x9, x11 # 1 \* p5-6
eor x13, x14, x13 # 1 \* p1-6
mul x14, x13, x12 # 1 \* p5-6
umulh x13, x13, x12 # 1 \* p5-6
eor x13, x13, x14 # 1 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB1_2 # Fused into 1 \* p1-3
# Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.
However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.
The very wide execution is though.
The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?