.LBB0_2:
eor x13, x9, x9, lsr #30 # 2 \* p1-6
mul x13, x13, x11 # 1 \* p5-6
eor x13, x13, x13, lsr #27 # 2 \* p1-6
mul x13, x13, x12 # 1 \* p5-6
eor x13, x13, x13, lsr #31 # 2 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB0_2 # Fused into 1 \* p1-3
# Total: 11 uops
.LBB1_2:
mul x13, x9, x11 # 1 \* p5-6
umulh x14, x9, x11 # 1 \* p5-6
eor x13, x14, x13 # 1 \* p1-6
mul x14, x13, x12 # 1 \* p5-6
umulh x13, x13, x12 # 1 \* p5-6
eor x13, x13, x14 # 1 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB1_2 # Fused into 1 \* p1-3
# Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.
However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.