For modern use, something about ARM CPUs would be much more useful since that's what microcontrollers all use now. No one's doing ASM programming on x86 CPUs these days (and certainly not Pentium4 CPUs).
For instance Daniel Lemire's blog [1] is quite often featured here, and very often features very low-level performance analysis and improvements.
I take your point, but I think there’s still a fair bit of x86 asm out there. For example, in ffmpeg and zstd.
I don't think that's entirely true...it's still pretty common to write high-performance / performance sensitive computation kernels in assembly or intrinsics.
I'm not familiar with Pentium but my guess is that memory store is relatively cheaper than load in many modern (out-of-order) microarchitectures.
> (Intermediate)14. Parallelization.
I feel like this is where compilers come into handy, because juggling critical paths and resource pressures at the same time sounds like a nightmare to me
> (Advanced)4. Interleaving 2 loops out of sync
Software pipelining!
"The Willamette and Northwood cores contain a 20-stage instruction pipeline. This is a significant increase in the number of stages compared to the Pentium III, which had only 10 stages in its pipeline. The Prescott core increased the length of the pipeline to 31 stages."
https://en.wikipedia.org/wiki/NetBurst
And many of that tricks actually works for long pipelines.
Can someone explain how this can work? Obviously, you can't just multiply the same numbers instead of dividing.
A longer tutorial that goes into more depth: https://homepage.cs.uiowa.edu/~jones/bcd/divide.html