Apple’s M1 processor and the full 128-bit integer product (opens in new tab)

(lemire.me)

230 pointstgymnich5y ago175 comments

175 comments

Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.

This requires a lot more digging to understand.

Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.

ascar5y ago

This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }

N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2

So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

gpderetta5y ago

Yes, the i++ seems an oversight.

The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.

ascar5y ago

> Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

The article got updated by now :)

HelloNurse5y ago

A C for statement is a "benchmarking loop" in the same sense that "slice a sponge cake into two layers and place custard in the middle" is an actionable dessert recipe.

pjc505y ago

Failing to post disassembly for a micro benchmark is annoying.

It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.

Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.

brigade5y ago

The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.

Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.

titzer5y ago

It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.