I got nerd sniped into benchmarking legacy x86 instructions (2019) (opens in new tab)

(acepace.net)

143 pointskevinday4y ago43 comments

43 comments

If you do more than microbenchmarking, then the cache effects start showing up and often the smaller-yet-individually-slower sequence begins to win.

But I disagree that the 3 sequences are actually identical in semantics, because the ones containing adds and xors will also affect the flags, while xlat and movs with the arithmetic done in the addressing mode don't.

The other thing to note is that pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6.

I remember benchmarking AAD/AAM and they were basically exactly the same as the longer equivalent sequences, although that was on a 2nd generation i7. The (relative) timings do change a little between CPUs, but it seems that Intel mostly tries to optimise them every time so they're not all that much slower. It would be interesting to see this benchmark done on some other CPU models (e.g. AMDs, which tend to have very different relative timings, or something like an Atom or even NetBurst.)

phire4y ago

The stack engine only handles the adjustment of the stack pointer, converting the push and pop to regular load/store uops.

But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.

It's not quite free, you still have to complete the store (the cpu can't assume optimising away a stack push is safe, unless it's actually overwritten) and there is still a 4 cycle latency, but that probably isn't an issue due to out-of-order execution.

brigade4y ago

It gets more "free" once you have the zero-latency loads introduced in Zen 2 and the load can be speculatively replaced with a register move if the store is close and obvious enough

1 more reply

moonchild4y ago

> pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6

There is a stack engine. But memory accesses and arithmetic are free even without it!

jeroenhd4y ago

For an instruction only left in for backwards compatibility, I think the microcode is quite nicely optimized. Sure, it could be faster, but it beat two more naive implementations despite originating from the 386 days.

I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.

userbinator4y ago

BCD instructions: https://news.ycombinator.com/item?id=8477254

oshiar53-04y ago

https://hackaday.com/2021/03/26/undocumented-x86-instruction...

b5n4y ago

> The meme is wrong

The third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.

Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.

saagarjha4y ago

lol, no. Galaxy brain starts somewhere (generally sane and reasonable) and then moves progressively in a certain direction (generally more complicated). In this case the starting point is an instruction sequence that is reminiscent of RISC architectures and then it gets progressively more CISC as you go down the page. The whole point of galaxy brain is that it follows this sequence, and because there’s no special third panel the sequence is extensible to arbitrary lengths.

comex4y ago

I've seen galaxy brain comics where the last row is the punchline. The comic might start out moving in a certain direction in a logical way, which may or may not be humorous by itself, but then the last row has a twist, an unexpected interpretation of the direction.

Some bad examples I found on Google:

https://i.redd.it/j0wwzqe2287z.jpg

https://in.pinterest.com/pin/366128644701746892/

The x86 comic may or may not count, depending on whether you expect the reader to know that using those sorts of legacy instructions is not actually an improvement…

1 more reply

howdydoo4y ago

I always thought of the last panel as an initially silly-sounding answer that could still be considered correct in some unexpected way. In this case it fits, because if you compile with `-Os`, xlatb is probably the ideal output. I doubt llvm can output xlatb, but I'd be pleasantly surprised if it could

DrPhish4y ago

If this meme was posted to pouet or another demoscene site in the context of writing 4k (or other) space constrained demos, it would operate as you expected.

If space efficiency (or fitting in cache) are important, then this instruction being more compact but having worse execution performance could be a good tradeoff!

dspillett4y ago

> while the last panel is reserved for the punchline

The meme gets used in a number of similar but different ways. Sometimes the last panel is the sequence taken to a logical but unrealistic extreme.

charcircuit4y ago

I'm not really sure this meme has a punchline. The number of instructions decreases in each panel.

notriddle4y ago

The punchline is that they use an instruction that Intel themselves do not recommend.

jleahy4y ago

The punchline is that they didn’t think of:

movzx rax, al

mov al, [rax+rbx]

1 more reply

oshiar53-04y ago

>The meme is wrong

Nah, it's rather that the meme is correctly absurd, as intended.

oshiar53-04y ago

Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.

moonchild4y ago

Yes. Also 'xor rcx, rcx' -> 'xor ecx, ecx'.

kccqzy4y ago

Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?

bdonlan4y ago

x86-64 does not have an addressing mode that uses the 8-bit register alias; in particular, the base and index registers are always either 32-bit (in 32-bit mode) or 64-bit (in 64-bit mode). As such, you need to zero-extend AL to a 64-bit register before using it in an offset addressing mode (or use XLATB).

For more information on supported addressing modes, see the manual: https://www.intel.com/content/www/us/en/develop/download/int... (specifically volume 1, section 3.7.5)

rep_lodsb4y ago

One could get rid of the push / pop though, assuming that the high bits of EAX don't need to be saved:

  movzx eax,al     ;could also do "and eax,0ffh"
  mov al,[rbx+rax]

DeathArrow4y ago

>However, since that time, all modern CPUs have turned RISC-like, by internally using a reduced instruction set and translating the ISA opcodes into internal commands, some implemented using CPU microcode.

Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?

If yes, would there be anything to gain or lose?

wongarsu4y ago

One advantage of not exposing microcode is that newer processors can add support for new microcode instructions and map existing X86 instructions to them. In a sense there's a tiny JIT in the CPU that turns X86 into processor-optimized code.

The disadvantage is of course that this is complex to do in silicon, and the CPU might lack some insights that the compiler had. As I understand it Itanium was HP's and Intel's attempt to give a lot more power to the compiler, with an instruction set that better matches what's going on under the hood. But we all know how that ended: performance was lackluster and the Itanic was nothing but a waste of money for everyone involved.

GPUs have successfully moved the microcode translation one layer up, you generally compile to an intermediate ISA (let's call it a bytecode) and when you load the program (or shader) the GPU driver translates it to GPU-specific instructions. But that model doesn't easily translate to CPUs.

zinekeller4y ago

> Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?

Maybe, but MOV is still MOV, so Intel, for the most part, is simply using a subset of x86 (or AMD64) instructions. Except for a few proprietary commands used to implement the more complex commands, most simple instructions are implemented as-is and are passthroughed anyways.

> If yes, would there be anything to gain or lose?

Gains: Very slight faster performance (reduced lookup is always great, but realistically it doesn't matter unless you're doing supercomputer stuff).

Losses: It's pretty much like the kernel land of Linux or NT's undocumented functions: subject to change, fully not supported. Also, cannot be done on the current CPU families anyway since that the microcode can't be updated in such a way that it is worth it.

anamax4y ago

Micro-ops often have more bits than the ISA they're implementing, so you'd pay a program-size penalty.

Moreover, Intel (and I assume AMD), will take a sequence of micro ops corresponding to a sequence of "instructions" and optimize the micro-op sequence based on dynamic usage, together with "undo" for when the usage assumptions are wrong.

TonyTrapp4y ago

From my understanding, this microcode may and will change between processors, so you lose the possibility of running your code on more than specific CPU type / generation.

grishka4y ago

Isn't microcode specific to a particular microarchitecture, that can, and often does, change between CPU model generations?

jstanley4y ago

> what are the chances this obscure opcode is faster than optimized loads?

Sometimes it's not about being faster, sometimes it's about taking up less space. The graphic doesn't say what it's aiming for, and based on what I see in the graphic, the 4th panel seems to take up the least space.

kayson4y ago

Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0

CoastalCoder4y ago

> why you would do xor rcx, rcx when that result is always 0

It's an idiomatic way to populate a register with the value zero.

Not sure if it's still true, but IIRC it took fewer cycles than the more obvious "load #0 into $rcx" instruction.

saagarjha4y ago

These days you also get the benefit that it’s four bytes shorter, since it doesn’t have to store an immediate:

  48 31 c9                xor    rcx,rcx
  48 c7 c1 00 00 00 00    mov    rcx,0x0

(This is even shorter:

  31 c9                   xor    ecx,ecx
)

rep_lodsb4y ago

It should be easier for the processor to detect xor-reg-with-itself as a special case. Intel has documented this as the preferred instruction to use since the Pentium afaik.

kloopersoop4y ago

We call those runes “the shibboleth of an assembly programmer.” They are ancient and wise. If one speaks them, one knows of and yearns for a simpler time when MOV vs XOR was a debate.

(Neighbor’s got it, and I am as unsure of contemporary relevance as they are.)

celrod4y ago

Hmm, uiCA results: xlatb: https://bit.ly/3cyBNN5 sequence: https://bit.ly/3nCmVTX

xlatb is looking better here. There are also some front end concerns that may favor xlatb, in particular if it's friendlier to the decoder. xlat is also fewer muops, taking less of the muop cache once decoded.

oblib4y ago

>>nerd sniped

I honestly don't know anything about this stuff, but the title is awesome.

phkahler4y ago

I never heard the term until I did it to someone. He said I nerd sniped him, but now there is an algebraic constraint solver written in Rust on github. He wrote a decent blog about it too.

NobodyNada4y ago

It's an xkcd reference: https://xkcd.com/356/

1 more reply

VortexDream4y ago

Hadn't heard it either until I saw it twice here on HN this week. Interesting how that goes.

1 more reply

j / k navigate · click thread line to collapse

43 comments

userbinator4y ago

If you do more than microbenchmarking, then the cache effects start showing up and often the smaller-yet-individually-slower sequence begins to win.

phire4y ago

The stack engine only handles the adjustment of the stack pointer, converting the push and pop to regular load/store uops.

But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.

brigade4y ago

It gets more "free" once you have the zero-latency loads introduced in Zen 2 and the load can be speculatively replaced with a register move if the store is close and obvious enough

1 more reply

moonchild4y ago

> pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6

There is a stack engine. But memory accesses and arithmetic are free even without it!

jeroenhd4y ago

I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.

userbinator4y ago

BCD instructions: https://news.ycombinator.com/item?id=8477254

oshiar53-04y ago

https://hackaday.com/2021/03/26/undocumented-x86-instruction...

b5n4y ago

> The meme is wrong

The third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.

Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.

saagarjha4y ago

comex4y ago

Some bad examples I found on Google:

https://i.redd.it/j0wwzqe2287z.jpg

https://in.pinterest.com/pin/366128644701746892/

The x86 comic may or may not count, depending on whether you expect the reader to know that using those sorts of legacy instructions is not actually an improvement…

1 more reply

howdydoo4y ago

DrPhish4y ago

If this meme was posted to pouet or another demoscene site in the context of writing 4k (or other) space constrained demos, it would operate as you expected.

If space efficiency (or fitting in cache) are important, then this instruction being more compact but having worse execution performance could be a good tradeoff!

dspillett4y ago

> while the last panel is reserved for the punchline

The meme gets used in a number of similar but different ways. Sometimes the last panel is the sequence taken to a logical but unrealistic extreme.

charcircuit4y ago

I'm not really sure this meme has a punchline. The number of instructions decreases in each panel.

notriddle4y ago

The punchline is that they use an instruction that Intel themselves do not recommend.

jleahy4y ago

The punchline is that they didn’t think of:

movzx rax, al

mov al, [rax+rbx]

1 more reply

oshiar53-04y ago

>The meme is wrong

Nah, it's rather that the meme is correctly absurd, as intended.

oshiar53-04y ago

Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.

moonchild4y ago

Yes. Also 'xor rcx, rcx' -> 'xor ecx, ecx'.

kccqzy4y ago

Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?

bdonlan4y ago

For more information on supported addressing modes, see the manual: https://www.intel.com/content/www/us/en/develop/download/int... (specifically volume 1, section 3.7.5)

rep_lodsb4y ago

One could get rid of the push / pop though, assuming that the high bits of EAX don't need to be saved:

  movzx eax,al     ;could also do "and eax,0ffh"
  mov al,[rbx+rax]

DeathArrow4y ago

Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?

If yes, would there be anything to gain or lose?

wongarsu4y ago

zinekeller4y ago

> Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?

> If yes, would there be anything to gain or lose?

Gains: Very slight faster performance (reduced lookup is always great, but realistically it doesn't matter unless you're doing supercomputer stuff).

anamax4y ago

Micro-ops often have more bits than the ISA they're implementing, so you'd pay a program-size penalty.

TonyTrapp4y ago

From my understanding, this microcode may and will change between processors, so you lose the possibility of running your code on more than specific CPU type / generation.

grishka4y ago

Isn't microcode specific to a particular microarchitecture, that can, and often does, change between CPU model generations?

jstanley4y ago

> what are the chances this obscure opcode is faster than optimized loads?

kayson4y ago

Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0

CoastalCoder4y ago

> why you would do xor rcx, rcx when that result is always 0

It's an idiomatic way to populate a register with the value zero.

Not sure if it's still true, but IIRC it took fewer cycles than the more obvious "load #0 into $rcx" instruction.

saagarjha4y ago

These days you also get the benefit that it’s four bytes shorter, since it doesn’t have to store an immediate:

  48 31 c9                xor    rcx,rcx
  48 c7 c1 00 00 00 00    mov    rcx,0x0

(This is even shorter:

  31 c9                   xor    ecx,ecx
)

rep_lodsb4y ago

It should be easier for the processor to detect xor-reg-with-itself as a special case. Intel has documented this as the preferred instruction to use since the Pentium afaik.

kloopersoop4y ago

We call those runes “the shibboleth of an assembly programmer.” They are ancient and wise. If one speaks them, one knows of and yearns for a simpler time when MOV vs XOR was a debate.

(Neighbor’s got it, and I am as unsure of contemporary relevance as they are.)

celrod4y ago

Hmm, uiCA results: xlatb: https://bit.ly/3cyBNN5 sequence: https://bit.ly/3nCmVTX

oblib4y ago

>>nerd sniped

I honestly don't know anything about this stuff, but the title is awesome.

phkahler4y ago

I never heard the term until I did it to someone. He said I nerd sniped him, but now there is an algebraic constraint solver written in Rust on github. He wrote a decent blog about it too.

NobodyNada4y ago

It's an xkcd reference: https://xkcd.com/356/

1 more reply

VortexDream4y ago

Hadn't heard it either until I saw it twice here on HN this week. Interesting how that goes.

1 more reply

j / k navigate · click thread line to collapse