But I disagree that the 3 sequences are actually identical in semantics, because the ones containing adds and xors will also affect the flags, while xlat and movs with the arithmetic done in the addressing mode don't.
The other thing to note is that pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6.
I remember benchmarking AAD/AAM and they were basically exactly the same as the longer equivalent sequences, although that was on a 2nd generation i7. The (relative) timings do change a little between CPUs, but it seems that Intel mostly tries to optimise them every time so they're not all that much slower. It would be interesting to see this benchmark done on some other CPU models (e.g. AMDs, which tend to have very different relative timings, or something like an Atom or even NetBurst.)
But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.
It's not quite free, you still have to complete the store (the cpu can't assume optimising away a stack push is safe, unless it's actually overwritten) and there is still a 4 cycle latency, but that probably isn't an issue due to out-of-order execution.
There is a stack engine. But memory accesses and arithmetic are free even without it!
I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.
The third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.
Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.
Some bad examples I found on Google:
https://i.redd.it/j0wwzqe2287z.jpg
https://in.pinterest.com/pin/366128644701746892/
The x86 comic may or may not count, depending on whether you expect the reader to know that using those sorts of legacy instructions is not actually an improvement…
If space efficiency (or fitting in cache) are important, then this instruction being more compact but having worse execution performance could be a good tradeoff!
The meme gets used in a number of similar but different ways. Sometimes the last panel is the sequence taken to a logical but unrealistic extreme.
Nah, it's rather that the meme is correctly absurd, as intended.
For more information on supported addressing modes, see the manual: https://www.intel.com/content/www/us/en/develop/download/int... (specifically volume 1, section 3.7.5)
movzx eax,al ;could also do "and eax,0ffh"
mov al,[rbx+rax]Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?
If yes, would there be anything to gain or lose?
The disadvantage is of course that this is complex to do in silicon, and the CPU might lack some insights that the compiler had. As I understand it Itanium was HP's and Intel's attempt to give a lot more power to the compiler, with an instruction set that better matches what's going on under the hood. But we all know how that ended: performance was lackluster and the Itanic was nothing but a waste of money for everyone involved.
GPUs have successfully moved the microcode translation one layer up, you generally compile to an intermediate ISA (let's call it a bytecode) and when you load the program (or shader) the GPU driver translates it to GPU-specific instructions. But that model doesn't easily translate to CPUs.
Maybe, but MOV is still MOV, so Intel, for the most part, is simply using a subset of x86 (or AMD64) instructions. Except for a few proprietary commands used to implement the more complex commands, most simple instructions are implemented as-is and are passthroughed anyways.
> If yes, would there be anything to gain or lose?
Gains: Very slight faster performance (reduced lookup is always great, but realistically it doesn't matter unless you're doing supercomputer stuff).
Losses: It's pretty much like the kernel land of Linux or NT's undocumented functions: subject to change, fully not supported. Also, cannot be done on the current CPU families anyway since that the microcode can't be updated in such a way that it is worth it.
Moreover, Intel (and I assume AMD), will take a sequence of micro ops corresponding to a sequence of "instructions" and optimize the micro-op sequence based on dynamic usage, together with "undo" for when the usage assumptions are wrong.
Sometimes it's not about being faster, sometimes it's about taking up less space. The graphic doesn't say what it's aiming for, and based on what I see in the graphic, the 4th panel seems to take up the least space.
It's an idiomatic way to populate a register with the value zero.
Not sure if it's still true, but IIRC it took fewer cycles than the more obvious "load #0 into $rcx" instruction.
48 31 c9 xor rcx,rcx
48 c7 c1 00 00 00 00 mov rcx,0x0
(This is even shorter: 31 c9 xor ecx,ecx
)(Neighbor’s got it, and I am as unsure of contemporary relevance as they are.)
xlatb is looking better here. There are also some front end concerns that may favor xlatb, in particular if it's friendlier to the decoder. xlat is also fewer muops, taking less of the muop cache once decoded.
I honestly don't know anything about this stuff, but the title is awesome.