This requires a lot more digging to understand.
Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
for (size_t i = 0; i < N; i++) {
out[i++] = g();
}
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.This is also visible in the assembly
add x8, x8, #2
So if I see this correctly the results are off by a factor of 2.[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.
The article got updated by now :)
It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.
Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.
Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.
I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".
But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.
It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)
"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."
Iterations: 10000
Instructions: 100000
Total Cycles: 25011
Total uOps: 100000
Dispatch Width: 4
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 2.5
No resource or data dependency bottlenecks discovered.
, which to me seems like 2.5 cycles per iteration (on Zen3).
Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):
.LBB5_2: # =>This Inner Loop Header: Depth=1
mov rdx, r11
add r11, r8
mulx rdx, rax, r9
xor rdx, rax
mulx rdx, rax, r10
xor rdx, rax
mov qword ptr [rdi + 8*rcx], rdx
add rcx, 2
cmp rcx, rsi
jb .LBB5_2Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
It would be fun to experiment with, for someone that has the hardware. My guess is that swapping the order will make it slower, but adding an independent instruction or two between them probably won't have a measureable effect. It would be fun to try and consistently interrupt the CPU between the two instructions as well somehow, to see if that short-circuits the optimization.
Worse, for some of us when it does finally wake up the monitor, sometimes it wakes it up with all the wrong colors, and rebooting is the only reliable fix. (and before anyone asks, yes, I tried a different HDMI cable)
It's much faster if the monitor has been used recently, though, so I always figured it was the monitor that was causing the delay by going into some deep sleep state?
I don't have any performance issues waking up though.
Have you tried a USB-C/thunderbolt cable/controller tho?
Sometimes this will happen multiple times per page load if I deselect and reselect the password field.
I suspect it’s because I have five monitors and 20 million pixels (actually more as that’s the post-retina resolution).
Rendering a FPS game at 1080p is 2 million pixels per frame. At 60fps, that's rendering 120 million pixels per second.
What am I missing?
Of course for general tasks it was slower, but I really remember that thing waking up instantly when I raised the lid, every time.
(It's still way faster than the same set of apps on an Intel Mac laptop, where it could sometimes take on the order of 30 seconds to get to a usable desktop after a long sleep. On Intel Macs it seemed more obvious that the GPU was the bottleneck)
I have buggy apps (like Facebook Messenger) locking up, but I guess that's normal, I just uninstall them.
Maybe desktop platforms sleep differently than laptops?
I do occasionally have an issue where the brightness on the built in display is borked and won’t adjust back to the correct level for anywhere between 30s to a few minutes.
And then I don’t know if it’s my monitor or the M1, but sometimes there will be a messed up run of consecutive pixel columns about 1/10th of the screen wide starting about 30% from the left of the display. The entire screen in that region is shifted a few pixels upwards. Sometimes it’s hard to notice it but once you do it can’t be unseen. Replugging the monitor into the M1 resolves the issue.
Because Apple has a lot of capital and they wouldn’t need to compete as hard for their share of tsmc production capacity.
Even then, do Apple use enough chips to justify running a fab, let alone one that would be locked into the node of the time. I really don't see it happening for many reasons and the only reason they would - would be some tax break incentive to onshore some of the money they have offshore in that it pays for itself, win or fail.
I don't think that's relevant anymore. My understanding is that the 2017 TCJA required prior unrepatriated earnings to be recognized and taxed over eight years (so still ongoing) and future foreign earnings not subject to US tax (except if the foreign tax is below the corporate alternative minimum tax rate). As a result of those changes, there's no need to hold cash offshore.
https://www.apple.com/shop/buy-mac/macbook-air
https://www.amazon.com/Lenovo-IdeaPad-Laptop-Newest-Display/...