Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).
This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!
In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.
In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.
In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.
I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.
Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!
> Following standard practice, I use the benchstat tool to compare their speeds
That tool (at least as used here) would be suitable for comparing execution of the same code between two processors with the same architecture. For comparing different programs on the same architecture, you need a different tool that focuses on total execution time.
Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325
It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.
There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.
Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.
This sounds... wrong? Unless ARM64 is designed in an absurd way?
I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.
If you are interested in this sort of thing, check out comp.arch!
The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.
To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.
I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:
https://github.com/trailofbits/vast
I'd definitely like to see the compilation processes be more transparent and easy to work with.
But these kind of tricks feel like we need to con the compiler into optimising this correctly, which is of course ridiculous. What we probably need instead is if-statements that we can tell what's most likely the correct prediction.
Something like:
if v > maxV predict true
maxV = v
continueSo really you end up having to make assumptions about the input to get the performance boost.
Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?
In the original, it essentially faces
If <cond>:
Mov
Else:
Noop
Considering these branches unpredictable, it generates a CMOV.With
If <cond>:
Mov
Else:
Print
It now considers the first branch hot and the second cold, and thus branch predication valuable, and generates a branch instead.Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable, so all the conditional move does is create an unnecessary data dependency.
It’s not necessarily the wrong choice in general, as for unpredictable branches cmov will usually be a huge gain, they’ll incur a cycle or two of latency but save from 15+ cycles of penalty on a mispredicted branch (which if the prediction only works half the time is an average 7.5 cycles per iteration).
You can find older posts which demonstrate that side of the coin e.g. https://owen.cafe/posts/six-times-faster-than-c/
Happens to be extremely predictable for this data. In general, over all possible inputs, it’s not extremely predictable.
If you assume all inputs are different (not something the compiler can assume, of course) the probability of having to update the max value goes down from 1 for the first iteration to 1/n for the last, so, possibly, the loop should be split into two halves. Go through the start of the sequence assuming the value needs updating more often than not, and switch to one where it doesn’t at some point.
For truly large inputs you could even add heuristics looking at how much room there is above the current maximum (in the limit, if you’ve found MAX_INT, you don’t have to look further)
Sorting programs used to have all kinds of such heuristics (and/or command line arguments) trying to detect whether input data already is mostly sorted/mostly reverse sorted, how many different keys there are, etc. to attempt avoiding hitting worst case behavior, but I think that’s somewhat of a lost art
Edit to add: does removing it make any difference?
As other comments point out, this construct can be replaced by a cmov instruction:
if a > b:
b = a
The following construct however, cannot be replaced by cmov: if a > b:
b = a
continue
Only by first eliminating the pointless "continue" is this replacement valid. But by including it, you can make it look like it's the 'print("lol")' is what makes the difference, which is only true lexically.https://godbolt.org/z/ds1raTYc9
https://godbolt.org/z/rbWsxM83b
The `print("lol")` output looks remarkably different.
It is there only to match the continue in the second code sample, where it is needed.
I'm curious if the performance difference noted in the article happens on Intel/AMD as well...
Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm
I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.
EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:
> The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.
...which is referring to Ken Thompson I guess.
My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.
> We don't want to complicate the language
So I can understand if this complicates the implementation but I don't know if totally optional pragmas or annotations complicates the language itself. Like C has this but I don't think people say "Ah C is alright but the pragmas are a bit confusing and make things complicated".
> experience shows that programmers are often mistaken as to whether branches are likely or not
Your average programmer may mess that up, but those who would give optimisation hints aren't quite your average programmer. And insisting on introducing PGO to your build process (so build, run-with-profile, rebuild-with-profile) for some cases where someone isn't mistaken as to whether branches are likely (or whether some loops run minimum X times, etc) feels a bit needless.
Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.
Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.
I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.
-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?
That could be entirely branchless, right?
[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...
Case in point, I'm slowly being replaced by Salesforce muppets for all my projects at work. They're little code monkeys with amazon ebook type knowledge, projects cost 20x more and I look like the mad scientist for speaking the truth. The products are worse in every possible metrics, I'm not crazy. The politics at play is the reason why I'm losing ground, not logic.
Cabinet designers are being replaced by Ikea flat pack artists in the software world. All we can do is stand by and watch.
And in regards to this blog, when Medium eventually go, that knowledge will go too. Blogs have died, personal websites as well, and their ability to be found in Google is almost non-existent.
Sorry I don't have anything more positive to add, except maybe that they're still there, slowly being alienated by the modern tech world!
Partly because that's often not what we're supposed to do; the stuff under the hood "just works" and we're meant to use it to write features, not worry about optimising the stuff that happens under the hood.
And partly it's because the stuff under the hood is increasingly weird and bizarre. Branch prediction is weird, and I still don't understand why that extra print statement changes the branch prediction. Why does it predict `v > maxV` is true when the alternative is to print something, but it doesn't predict that when the alternative is to do nothing?
Is it because printing is expensive, and therefore the branch predictor is going to strongly prefer avoiding that? It's weird that we'd basically have to deceive our code into compiling into a more performant form.
I don't want to have to second guess the compiler.
But under the hood there is a hood. And under that hood there is another hood. And under that hood there is a brand new car you don't know how to open the hood, and so on.
I cannot devote my life to know everything or I won't be able to provide for my family.
That analogy doesn't really work if the new projects "cost 20x more" upfront, unless you mean long-term associated costs.
My grouchy mother in law also laments that they teach typing in school and not handwriting.
Lamentable? I guess. But it’s not realistic to hope people just learn more every generation.
Despite the tradition of “kids these days with their frameworks and libraries!” complaining, you want to see this evolution.
Well, layering abstraction is the most effective way to deal with complexity. Our brain is too limited to know everything.
The people, who understand "under the hood", probably won't know much about what's "above the hood", like writing a website or mobile app. Therefore, we need experts on every layers.
You encounter far far more dead-ends than anyone ever says, and every unsolved mystery is a mild nerd snipe, an open case, that years from now you'll see someone else explain something you realise it answers that question from years prior.
For me, the hard bit is not over-indexing on this... you learn things, but biasing too much for them is a sure fire way to over-engineer or increase complexity to the point where something is now worse for you knowing something. But once in a while that tiny thing you learned years before is a 20% savings across the board with associated performance increase and everyone wondering how on Earth you could possibly have made those jumps.
Also related... incidents. "Why" and "I'm going to find out" is the best way these things don't recur in future. A high degree of observation and understanding is a happy engineer life as it can improve what can often be the most stressful parts of the work (on-call, etc).
That XKCD comic about everyone learning something for the first time factors too... there is stuff you know that others do not, share it.
For me that means that in our world of computers there is infinite curiosities to discover. (Not that the same isn't true for the natural world too)
And it drives me nuts dealing with people who don't think this way. I'm not a jerk about it, but my personality is 'lets what all we can figure out the why'.
These skills stay with you, and if you read articles like this then you can keep broadly up-to-date with the insanity that is current CPUs. Things like pipe-lines, branch prediction, and different levels of cacheing are optimisations that you can acquire as you go.
If you're an auto-didact web developer then you never have the opportunity to learn these skills, or the need to do so.
I know a lot of people who are comfortable with doing this, but in my case it's a generational thing. If you want to do it then you can. It's not hard, it's just a different skill from those you already have, though sufficiently related that you wouldn't be starting from scratch.
But starting with modern CPUs can be hard. Learning the basics from older, simpler CPUs can help. Doing some kind of embedded programming might be the way to get started, or working on an emulator.
As always, YMMV.
Not sure if it's still asked, but reminds me of a class interview question. "When a user presses a button on a web page, describe in as much detail as possible what happens."
That being said, of all the people at all of the tech companies I've worked at, maybe ~5% of them had this sort of mentality and drive to execute on it.
And remember Spectre and Meltdown? Security vulnerabilities caused by branch prediction. If I recall correctly, the pipeline was executing code it wasn't meant to execute because it's executing it before it knows the result of the check that decides if it has to execute it.
Programming is a lot easier if the actual control flow is as linear as I'm writing it.
My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.
Seriously that threw me, and maybe it makes sense in this context but it seems strange for someone with such an apparent depth of technical knowledge leaning on an LLM for anything.
i don't see any assembly here. the analysis is done by using a profiler. a very common tool available for most programming languages.
https://timotijhof.net/wp-content/uploads/2020_profiling_fig...
People who tend to care tend to talk about what they care about. So, you wind up getting a lot of blogs who sort of self select themselves to do this kind of stuff. And everyone feeds off that energy and gets better :)
If you visit #c / #asm on any popular IRC network, you'll find a lot of skilled people that can do this routinely.
Also, if you read the other subthreads, there are a number of points to be criticized about this writeup.