Print(“lol”) doubled the speed of my Go function (opens in new tab)

(medium.com)

255 pointsludiludi2y ago126 comments

126 comments

I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:

Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).

This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!

In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.

In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.

In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.

I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.

Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!

dmurray2y ago

Yes, this part seems wrong

> Following standard practice, I use the benchstat tool to compare their speeds

That tool (at least as used here) would be suitable for comparing execution of the same code between two processors with the same architecture. For comparing different programs on the same architecture, you need a different tool that focuses on total execution time.

dataflow2y ago

I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?

Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325

Calavar2y ago

I read it three or four times. It's never explained. If the print("lol") version has a branch-less-than, what does the regular version have? It must either be a branch or a conditional move, but we aren't shown that part of the assembly. You can't reach any conclusions about why one version is faster if you don't know what you're comparing to.

kevincox2y ago

The high-level view is that adding code that never gets executed causes the compiler to emit code that the CPU predicts better. IDK if this is the compiler assuming that the `print()` call is cold or the branch predictor getting luckier by chance but basically this tickles the CPU in the right way to get better performance.

It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.

trolan2y ago

I'm in school, so this may be oversimplified, but if the processor/assembly code is predicting the next result, it gets the result faster. The processor only does this prediction with conditional branches. The extra if for printing or finding the min invoke the prediction with the accuracies stated.

tallanvor2y ago

No, this is not true.

There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.

Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.

dataflow2y ago

> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.

1 more reply

EspressoGPT2y ago

As to branch prediction, for anyone interested: https://stackoverflow.com/a/11227902

bakul2y ago

Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. Deterministic performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually pessimize performance on a future processor or a compatible processor by a different vendor).

If you are interested in this sort of thing, check out comp.arch!

aeonik2y ago

I've been researching this pretty deeply for the last few years, and I've come to the conclusion that, without a complete redesign, most popular programming languages cannot have direct control of these optimizations in an ergonomic manner.

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.

mcv2y ago

I would agree, but it's hard to argue with a factor 2 performance boost.

But these kind of tricks feel like we need to con the compiler into optimising this correctly, which is of course ridiculous. What we probably need instead is if-statements that we can tell what's most likely the correct prediction.

Something like:

  if v > maxV predict true
    maxV = v
    continue

Zinu2y ago

It's only factor 2 with an increasing array though. At which point you can just take the last element, that's way faster.

So really you end up having to make assumptions about the input to get the performance boost.

1 more reply

cyphar2y ago

In the Linux kernel, there are unlikely() and likely() macros which indicate to the compiler whether or not a condition is likely using __builtin_expect (which then influences the output assembly into producing code that should make the branch predictor do the right thing more of the time).

Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?

nomel2y ago

It's nice using these to mark less-likely, but latency sensitive, paths, which is something that profiler guided optimization can't do.

zerr2y ago

No explanation whatsoever. Why the branch predictor is not "invoked" in the first version of the function?

MauranKilom2y ago

Because it's most likely using a conditional move.

distcs2y ago

But I don't see the post going into investigating this at all. Yes, most likely that is what is going on but I don't understand the point of the OP post is if the real reason of the difference in the branch predictor behavior is not explained.

perryizgr82y ago

Why would an unconditional print have any effect on whether the branch predictor is invoked or not? The if statement is there in both cases, so branch prediction should kick in for both. I didn't find an explanation for this behaviour in the article.

masklinn2y ago

The go compiler might have a heuristic where a branch with IO is considered cold compared to a branch without.

In the original, it essentially faces

    If <cond>:
        Mov
    Else:
        Noop

Considering these branches unpredictable, it generates a CMOV.

With

    If <cond>:
        Mov
    Else:
        Print

It now considers the first branch hot and the second cold, and thus branch predication valuable, and generates a branch instead.

Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable, so all the conditional move does is create an unnecessary data dependency.

It’s not necessarily the wrong choice in general, as for unpredictable branches cmov will usually be a huge gain, they’ll incur a cycle or two of latency but save from 15+ cycles of penalty on a mispredicted branch (which if the prediction only works half the time is an average 7.5 cycles per iteration).

You can find older posts which demonstrate that side of the coin e.g. https://owen.cafe/posts/six-times-faster-than-c/

Someone2y ago

> Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable

Happens to be extremely predictable for this data. In general, over all possible inputs, it’s not extremely predictable.

If you assume all inputs are different (not something the compiler can assume, of course) the probability of having to update the max value goes down from 1 for the first iteration to 1/n for the last, so, possibly, the loop should be split into two halves. Go through the start of the sequence assuming the value needs updating more often than not, and switch to one where it doesn’t at some point.

For truly large inputs you could even add heuristics looking at how much room there is above the current maximum (in the limit, if you’ve found MAX_INT, you don’t have to look further)

Sorting programs used to have all kinds of such heuristics (and/or command line arguments) trying to detect whether input data already is mostly sorted/mostly reverse sorted, how many different keys there are, etc. to attempt avoiding hitting worst case behavior, but I think that’s somewhat of a lost art

spuz2y ago

Thank you. I was struggling to understand what was going on after reading the article. Your explanation is much better.

projektfu2y ago

It has to make a decision to jump over the print sequence or not, whereas without the print it can eliminate the branch with a conditional move and only have to predict the loop variable. It just makes a bad guess as to whether branch prediction or conditional move will be faster, as explained above by tylerhou.

zhzy00772y ago

I'm a noob. Looking at the disasm: https://godbolt.org/z/766aPTPc3 It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication? I don't get it.

ludiludiOP2y ago

Your disasm is for x86-64. The benchmarks in the blog were run on an M1 MacBook Pro, which is an ARM64.

zhzy00772y ago

Sorry. My bad. But looking at ARM64 https://godbolt.org/z/YEjGKce1Y The difference is CSEL and BLT. The question still stands. Does CSEL have no branch predication?

1 more reply

schemescape2y ago

Why is there a "continue" at all in the first code sample?

Edit to add: does removing it make any difference?

nemetroid2y ago

The inclusion of ”continue” in the non-lol version is pointless and obscures the actual reason for the difference: the addition of the non-pointless ”continue” in the lol version.

As other comments point out, this construct can be replaced by a cmov instruction:

  if a > b:
    b = a

The following construct however, cannot be replaced by cmov:

  if a > b:
    b = a
    continue

Only by first eliminating the pointless "continue" is this replacement valid. But by including it, you can make it look like it's the 'print("lol")' is what makes the difference, which is only true lexically.

fmstephe2y ago

According to the godot compiler explorer removing the `continue` makes no difference to the generated assembly.

https://godbolt.org/z/ds1raTYc9

https://godbolt.org/z/rbWsxM83b

The `print("lol")` output looks remarkably different.

https://godbolt.org/z/c3afrb6bG

ludiludiOP2y ago

Good question. As you can see in the comment in the github repo, it has no effect. https://github.com/ludi317/max/blob/master/blog/max_test.go#...

It is there only to match the continue in the second code sample, where it is needed.

schemescape2y ago

Thanks! In that case, I have to say I'm surprised. I assumed the code generated for the loop would have an instructions that branches, so adding another branching instruction could only hurt (edit: not necessarily a lot), but apparently my intuition is wrong.

I'm curious if the performance difference noted in the article happens on Intel/AMD as well...

ahazred8ta2y ago

unrelated: a few months ago you were asking about small language runtimes https://docs.micropython.org/en/latest/

rep_lodsb2y ago

It is there to skip the print("lol") in the second version if the condition is true. Since the array is sorted in ascending order, it will be true for every value, and that print is never be executed.

Liquid_Fire2y ago

I was curious what this strange assembly language was, as it looked like neither Arm nor x86.

Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm

I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.

rob742y ago

I guess it's for historical reasons. As the document you linked states, "The assembler is based on the input style of the Plan 9 assemblers". It's important to know that at least two of the "founding fathers" of Go (Rob Pike and Ken Thompson) are ex-Bell Labs guys and were involved with Plan 9. The Plan 9 compiler toolchain was available, they were familiar with it, so that's what they used for Go. Some parts of the toolchain (the linker, I think) have been swapped out in the meantime, but the assembly format has stayed.

EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:

> The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.

...which is referring to Ken Thompson I guess.

yencabulator2y ago

It's not just historical, it's more "the same justification as back then".

yencabulator2y ago

Rob Pike's talk The Design of the Go Assembler from GopherCon 2016: https://www.youtube.com/watch?v=KINIAgRpkDA

deschutes2y ago

The explanation is not convincing.

My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.

smcl2y ago

Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?

grose2y ago

Seems like the answer is no[1] and profile-guided optimization is recommended instead, https://go.dev/doc/pgo. I would be curious to see if pgo helps with the author's use case.

[1] https://groups.google.com/g/golang-nuts/c/1erdKe3aV5k

smcl2y ago

Ah thanks! That's interesting but a bit weird to me. That response sounds a little bit like someone who feels like they shouldn't do something and is thinking on-the-fly for reasons they can use to justify that feeling.

> We don't want to complicate the language

So I can understand if this complicates the implementation but I don't know if totally optional pragmas or annotations complicates the language itself. Like C has this but I don't think people say "Ah C is alright but the pragmas are a bit confusing and make things complicated".

> experience shows that programmers are often mistaken as to whether branches are likely or not

Your average programmer may mess that up, but those who would give optimisation hints aren't quite your average programmer. And insisting on introducing PGO to your build process (so build, run-with-profile, rebuild-with-profile) for some cases where someone isn't mistaken as to whether branches are likely (or whether some loops run minimum X times, etc) feels a bit needless.

Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.

1 more reply

_cenw2y ago

Go can have unexpected performance differences way higher up in the stack.

Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.

I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.

[1]: https://pkg.go.dev/runtime#Gosched

morelisp2y ago

The Go scheduler is now (well, for years) preemptive.

Exuma2y ago

There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?

romshark2y ago

I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: https://go.dev/play/p/zHFxwvWOoeT

-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?

AshleysBrain2y ago

Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`

That could be entirely branchless, right?

assbuttbuttass2y ago

A max function still compiles down to some kind of branch

AshleysBrain2y ago

I thought there were specific assembly instructions for this kind of thing, such as MAXSS in x86 [1], plus vector variants like SSE4 PMAXSD. Presumably it's possible the CPU can handle those with special branchless logic, depending on the compiler and CPU implementation. I guess you'd have to know about the CPU internals to know if the instruction is truly branchless, but it is branchless in the sense there is no conditional jump made in the assembly instructions.

[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...

1 more reply

vsnf2y ago

Kind of tangential, but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica? I've only ever meet people like that maybe two or thee times in my career, and yet it seems like every other blog post I read in certain language circles everyone is some kind of ASM and Reverse Engineering expert.

keyle2y ago

They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Case in point, I'm slowly being replaced by Salesforce muppets for all my projects at work. They're little code monkeys with amazon ebook type knowledge, projects cost 20x more and I look like the mad scientist for speaking the truth. The products are worse in every possible metrics, I'm not crazy. The politics at play is the reason why I'm losing ground, not logic.

Cabinet designers are being replaced by Ikea flat pack artists in the software world. All we can do is stand by and watch.

And in regards to this blog, when Medium eventually go, that knowledge will go too. Blogs have died, personal websites as well, and their ability to be found in Google is almost non-existent.

Sorry I don't have anything more positive to add, except maybe that they're still there, slowly being alienated by the modern tech world!

mcv2y ago

> We're forgetting how to look under the hood and understand "why something works".

Partly because that's often not what we're supposed to do; the stuff under the hood "just works" and we're meant to use it to write features, not worry about optimising the stuff that happens under the hood.

And partly it's because the stuff under the hood is increasingly weird and bizarre. Branch prediction is weird, and I still don't understand why that extra print statement changes the branch prediction. Why does it predict `v > maxV` is true when the alternative is to print something, but it doesn't predict that when the alternative is to do nothing?

Is it because printing is expensive, and therefore the branch predictor is going to strongly prefer avoiding that? It's weird that we'd basically have to deceive our code into compiling into a more performant form.

I don't want to have to second guess the compiler.

4 more replies

gv832y ago

> We're forgetting how to look under the hood and understand "why something works".

But under the hood there is a hood. And under that hood there is another hood. And under that hood there is a brand new car you don't know how to open the hood, and so on.

I cannot devote my life to know everything or I won't be able to provide for my family.

5 more replies

Chris20482y ago

> Cabinet designers are being replaced by Ikea flat pack artists in the software world

That analogy doesn't really work if the new projects "cost 20x more" upfront, unless you mean long-term associated costs.

Waterluvian2y ago

It’s what evolution looks like.

My grouchy mother in law also laments that they teach typing in school and not handwriting.

Lamentable? I guess. But it’s not realistic to hope people just learn more every generation.

Despite the tradition of “kids these days with their frameworks and libraries!” complaining, you want to see this evolution.

dikei2y ago

> They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Well, layering abstraction is the most effective way to deal with complexity. Our brain is too limited to know everything.

The people, who understand "under the hood", probably won't know much about what's "above the hood", like writing a website or mobile app. Therefore, we need experts on every layers.

buro92y ago

This feels semi-normal to me... just have the curiosity to ask "why?" and the bias-to-action to move to "I'm going to find out".

You encounter far far more dead-ends than anyone ever says, and every unsolved mystery is a mild nerd snipe, an open case, that years from now you'll see someone else explain something you realise it answers that question from years prior.

For me, the hard bit is not over-indexing on this... you learn things, but biasing too much for them is a sure fire way to over-engineer or increase complexity to the point where something is now worse for you knowing something. But once in a while that tiny thing you learned years before is a 20% savings across the board with associated performance increase and everyone wondering how on Earth you could possibly have made those jumps.

Also related... incidents. "Why" and "I'm going to find out" is the best way these things don't recur in future. A high degree of observation and understanding is a happy engineer life as it can improve what can often be the most stressful parts of the work (on-call, etc).

That XKCD comic about everyone learning something for the first time factors too... there is stuff you know that others do not, share it.

djtango2y ago

I remember someone saying the difference between Physics and Computer Science is that in CS we are the masters of the universe - there are no laws of Physics that bind us.

For me that means that in our world of computers there is infinite curiosities to discover. (Not that the same isn't true for the natural world too)

matwood2y ago

> "Why" and "I'm going to find out"

And it drives me nuts dealing with people who don't think this way. I'm not a jerk about it, but my personality is 'lets what all we can figure out the why'.

totetsu2y ago

https://imgs.xkcd.com/comics/ten_thousand.png for those who haven't heard of it.

2 more replies

ColinWright2y ago

My first machine was a TRS-80. My first large program was a compiler from TRS-80 BASIC to Z80. I subsequently disassembled the ROM in the machine to figure out how things worked.

These skills stay with you, and if you read articles like this then you can keep broadly up-to-date with the insanity that is current CPUs. Things like pipe-lines, branch prediction, and different levels of cacheing are optimisations that you can acquire as you go.

If you're an auto-didact web developer then you never have the opportunity to learn these skills, or the need to do so.

I know a lot of people who are comfortable with doing this, but in my case it's a generational thing. If you want to do it then you can. It's not hard, it's just a different skill from those you already have, though sufficiently related that you wouldn't be starting from scratch.

But starting with modern CPUs can be hard. Learning the basics from older, simpler CPUs can help. Doing some kind of embedded programming might be the way to get started, or working on an emulator.

As always, YMMV.

matwood2y ago

It's odd to me to see people who don't want to understand the next level from any current level. A front end person who doesn't care how the backend works, or a backend person who doesn't care how the database works, etc... Obviously there is only so much time in the day, but being comfortable digging down to whatever level is necessary to debug something is a valuable skill.

Not sure if it's still asked, but reminds me of a class interview question. "When a user presses a button on a web page, describe in as much detail as possible what happens."

pohl2y ago

A compiler from what to Z80? (I think there’s a word missing in what you wrote.) I started on a Model I and I’m curious about your compiler. I miss some aspects of those old machines. It was great, as a learner, that the ROM and a DOS like L-DOS was small enough to fit in your head. It can be intimidating how large everything is today.

1 more reply

deaddodo2y ago

I think you'll find that the deeper you go into "traditional" computer scientists, the more you'll find the problem-solvers, hackers and tinkerers that post these types of blogs. Especially in odd cases where a random print statement doubles your profiled performance.

That being said, of all the people at all of the tech companies I've worked at, maybe ~5% of them had this sort of mentality and drive to execute on it.

distcs2y ago

I don't think I'd call "branch prediction" as "low level esoterica". It is a basic fact about how CPUs are implemented since many decades now. I learnt these things in my university coursework. Any module on CPU or computer system architecture is going to teach you all this stuff. But I'm sure you could learn these things from books on this topic too.

mcv2y ago

I didn't, and frankly, half of the articles I read about it make me think branch prediction is a bug. I mean, I know it's meant to improve performance, which is great, but it has to make assumptions about what's going to happen before it knows it, and those assumptions are going to be wrong. How wrong? How can we con it into making better assumptions? Suddenly programming becomes about second guessing the compiler.

And remember Spectre and Meltdown? Security vulnerabilities caused by branch prediction. If I recall correctly, the pipeline was executing code it wasn't meant to execute because it's executing it before it knows the result of the check that decides if it has to execute it.

Programming is a lot easier if the actual control flow is as linear as I'm writing it.

My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.

4 more replies

lordlimecat2y ago

....And then feel OK resorting to ChatGPT for the explanation.

Seriously that threw me, and maybe it makes sense in this context but it seems strange for someone with such an apparent depth of technical knowledge leaning on an LLM for anything.

layer82y ago

It seems they didn’t want to bother coming up with their own explanation, but then why not just link to https://en.wikipedia.org/wiki/Branch_predictor?

1 more reply

2-718-281-8282y ago

> but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica?

i don't see any assembly here. the analysis is done by using a profiler. a very common tool available for most programming languages.

https://timotijhof.net/wp-content/uploads/2020_profiling_fig...

kuroguro2y ago

There's some bias as topics like these are often on top of HN. Extrapolate the 2 or 3 people you've met to all the programmers in the world - that's how you get some amazing in depth blog posts every other week.

Smaug1232y ago

There are many ways you can get to this point (e.g. I just sort of picked this kind of thing up), but an example of a course which is designed precisely to give you these skills is Casey Muratori's "Performance-Aware Programming".

sailorganymede2y ago

These people genuinely care about their craft to the point they ask: what happened. They then look into it.

People who tend to care tend to talk about what they care about. So, you wind up getting a lot of blogs who sort of self select themselves to do this kind of stuff. And everyone feeds off that energy and gets better :)

rofo12y ago

There are many engineers that can do this, it's just that writing javascript pays 3 times or more so the focus goes there :)

If you visit #c / #asm on any popular IRC network, you'll find a lot of skilled people that can do this routinely.

crvdgc2y ago

This topic among others was covered by a Computer Architecture class back in my university. That class is mandatory for all CS students. Though I have to add that it's never been brought up after I started to work.

amirxyz2y ago

I think it depends on what niche you're in. I'm in the storage business and while not everyone, but quite a few of the people in my group would not be afraid of running a disaseembler, or talk about branch prediction, or unaligned integer access. `likely` and `unlikely` are very common in the code, and so on. AFAIK linux kernel people are very similar, and I assume a few other niches of system programming as well.

BiteCode_dev2y ago

The HN bubble doesn't have only disadvantages.

secondcoming2y ago

It’s not all that uncommon outside of web stuff

layer82y ago

People who maintain a blog and whose blog posts are widely circulated are already highly selected. Conversely, people not comfortable with that stuff will also tend to be less comfortable writing a technical blog.

Also, if you read the other subthreads, there are a number of points to be criticized about this writeup.

j / k navigate · click thread line to collapse

126 comments

Syrail_2y ago

I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:

This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!

In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.

dmurray2y ago

Yes, this part seems wrong

> Following standard practice, I use the benchstat tool to compare their speeds

dataflow2y ago

I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?

Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325

Calavar2y ago

kevincox2y ago

trolan2y ago

tallanvor2y ago

No, this is not true.

dataflow2y ago

> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.

1 more reply

EspressoGPT2y ago

As to branch prediction, for anyone interested: https://stackoverflow.com/a/11227902

bakul2y ago

If you are interested in this sort of thing, check out comp.arch!

aeonik2y ago

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.

mcv2y ago

I would agree, but it's hard to argue with a factor 2 performance boost.

Something like:

  if v > maxV predict true
    maxV = v
    continue

Zinu2y ago

It's only factor 2 with an increasing array though. At which point you can just take the last element, that's way faster.

So really you end up having to make assumptions about the input to get the performance boost.

1 more reply

cyphar2y ago

nomel2y ago

It's nice using these to mark less-likely, but latency sensitive, paths, which is something that profiler guided optimization can't do.

zerr2y ago

No explanation whatsoever. Why the branch predictor is not "invoked" in the first version of the function?

MauranKilom2y ago

Because it's most likely using a conditional move.

distcs2y ago

perryizgr82y ago

masklinn2y ago

The go compiler might have a heuristic where a branch with IO is considered cold compared to a branch without.

In the original, it essentially faces

    If <cond>:
        Mov
    Else:
        Noop

Considering these branches unpredictable, it generates a CMOV.

With

    If <cond>:
        Mov
    Else:
        Print

It now considers the first branch hot and the second cold, and thus branch predication valuable, and generates a branch instead.

Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable, so all the conditional move does is create an unnecessary data dependency.

You can find older posts which demonstrate that side of the coin e.g. https://owen.cafe/posts/six-times-faster-than-c/

Someone2y ago

> Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable

Happens to be extremely predictable for this data. In general, over all possible inputs, it’s not extremely predictable.

For truly large inputs you could even add heuristics looking at how much room there is above the current maximum (in the limit, if you’ve found MAX_INT, you don’t have to look further)

spuz2y ago

Thank you. I was struggling to understand what was going on after reading the article. Your explanation is much better.

projektfu2y ago

zhzy00772y ago

I'm a noob. Looking at the disasm: https://godbolt.org/z/766aPTPc3 It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication? I don't get it.

ludiludiOP2y ago

Your disasm is for x86-64. The benchmarks in the blog were run on an M1 MacBook Pro, which is an ARM64.

zhzy00772y ago

Sorry. My bad. But looking at ARM64 https://godbolt.org/z/YEjGKce1Y The difference is CSEL and BLT. The question still stands. Does CSEL have no branch predication?

1 more reply

schemescape2y ago

Why is there a "continue" at all in the first code sample?

Edit to add: does removing it make any difference?

nemetroid2y ago

The inclusion of ”continue” in the non-lol version is pointless and obscures the actual reason for the difference: the addition of the non-pointless ”continue” in the lol version.

As other comments point out, this construct can be replaced by a cmov instruction:

  if a > b:
    b = a

The following construct however, cannot be replaced by cmov:

  if a > b:
    b = a
    continue

fmstephe2y ago

According to the godot compiler explorer removing the `continue` makes no difference to the generated assembly.

https://godbolt.org/z/ds1raTYc9

https://godbolt.org/z/rbWsxM83b

The `print("lol")` output looks remarkably different.

https://godbolt.org/z/c3afrb6bG

ludiludiOP2y ago

Good question. As you can see in the comment in the github repo, it has no effect. https://github.com/ludi317/max/blob/master/blog/max_test.go#...

It is there only to match the continue in the second code sample, where it is needed.

schemescape2y ago

I'm curious if the performance difference noted in the article happens on Intel/AMD as well...

ahazred8ta2y ago

unrelated: a few months ago you were asking about small language runtimes https://docs.micropython.org/en/latest/

rep_lodsb2y ago

Liquid_Fire2y ago

I was curious what this strange assembly language was, as it looked like neither Arm nor x86.

Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm

I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.

rob742y ago

EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:

...which is referring to Ken Thompson I guess.

yencabulator2y ago

It's not just historical, it's more "the same justification as back then".

yencabulator2y ago

Rob Pike's talk The Design of the Go Assembler from GopherCon 2016: https://www.youtube.com/watch?v=KINIAgRpkDA

deschutes2y ago

The explanation is not convincing.

smcl2y ago

Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?

grose2y ago

Seems like the answer is no[1] and profile-guided optimization is recommended instead, https://go.dev/doc/pgo. I would be curious to see if pgo helps with the author's use case.

[1] https://groups.google.com/g/golang-nuts/c/1erdKe3aV5k

smcl2y ago

> We don't want to complicate the language

> experience shows that programmers are often mistaken as to whether branches are likely or not

Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.

1 more reply

_cenw2y ago

Go can have unexpected performance differences way higher up in the stack.

I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.

[1]: https://pkg.go.dev/runtime#Gosched

morelisp2y ago

The Go scheduler is now (well, for years) preemptive.

Exuma2y ago

There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?

romshark2y ago

I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: https://go.dev/play/p/zHFxwvWOoeT

-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?

AshleysBrain2y ago

Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`

That could be entirely branchless, right?

assbuttbuttass2y ago

A max function still compiles down to some kind of branch

AshleysBrain2y ago

[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...

1 more reply

vsnf2y ago

keyle2y ago

They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Cabinet designers are being replaced by Ikea flat pack artists in the software world. All we can do is stand by and watch.

And in regards to this blog, when Medium eventually go, that knowledge will go too. Blogs have died, personal websites as well, and their ability to be found in Google is almost non-existent.

Sorry I don't have anything more positive to add, except maybe that they're still there, slowly being alienated by the modern tech world!

mcv2y ago

> We're forgetting how to look under the hood and understand "why something works".

I don't want to have to second guess the compiler.

4 more replies

gv832y ago

> We're forgetting how to look under the hood and understand "why something works".

But under the hood there is a hood. And under that hood there is another hood. And under that hood there is a brand new car you don't know how to open the hood, and so on.

I cannot devote my life to know everything or I won't be able to provide for my family.

5 more replies

Chris20482y ago

> Cabinet designers are being replaced by Ikea flat pack artists in the software world

That analogy doesn't really work if the new projects "cost 20x more" upfront, unless you mean long-term associated costs.

Waterluvian2y ago

It’s what evolution looks like.

My grouchy mother in law also laments that they teach typing in school and not handwriting.

Lamentable? I guess. But it’s not realistic to hope people just learn more every generation.

Despite the tradition of “kids these days with their frameworks and libraries!” complaining, you want to see this evolution.

dikei2y ago

> They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Well, layering abstraction is the most effective way to deal with complexity. Our brain is too limited to know everything.

The people, who understand "under the hood", probably won't know much about what's "above the hood", like writing a website or mobile app. Therefore, we need experts on every layers.

buro92y ago

This feels semi-normal to me... just have the curiosity to ask "why?" and the bias-to-action to move to "I'm going to find out".

That XKCD comic about everyone learning something for the first time factors too... there is stuff you know that others do not, share it.

djtango2y ago

I remember someone saying the difference between Physics and Computer Science is that in CS we are the masters of the universe - there are no laws of Physics that bind us.

For me that means that in our world of computers there is infinite curiosities to discover. (Not that the same isn't true for the natural world too)

matwood2y ago

> "Why" and "I'm going to find out"

And it drives me nuts dealing with people who don't think this way. I'm not a jerk about it, but my personality is 'lets what all we can figure out the why'.

totetsu2y ago

https://imgs.xkcd.com/comics/ten_thousand.png for those who haven't heard of it.

2 more replies

ColinWright2y ago

My first machine was a TRS-80. My first large program was a compiler from TRS-80 BASIC to Z80. I subsequently disassembled the ROM in the machine to figure out how things worked.

If you're an auto-didact web developer then you never have the opportunity to learn these skills, or the need to do so.

But starting with modern CPUs can be hard. Learning the basics from older, simpler CPUs can help. Doing some kind of embedded programming might be the way to get started, or working on an emulator.

As always, YMMV.

matwood2y ago

Not sure if it's still asked, but reminds me of a class interview question. "When a user presses a button on a web page, describe in as much detail as possible what happens."

pohl2y ago

1 more reply

deaddodo2y ago

That being said, of all the people at all of the tech companies I've worked at, maybe ~5% of them had this sort of mentality and drive to execute on it.

distcs2y ago

mcv2y ago

Programming is a lot easier if the actual control flow is as linear as I'm writing it.

My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.

4 more replies

lordlimecat2y ago

....And then feel OK resorting to ChatGPT for the explanation.

Seriously that threw me, and maybe it makes sense in this context but it seems strange for someone with such an apparent depth of technical knowledge leaning on an LLM for anything.

layer82y ago

It seems they didn’t want to bother coming up with their own explanation, but then why not just link to https://en.wikipedia.org/wiki/Branch_predictor?

1 more reply

2-718-281-8282y ago

i don't see any assembly here. the analysis is done by using a profiler. a very common tool available for most programming languages.

https://timotijhof.net/wp-content/uploads/2020_profiling_fig...

kuroguro2y ago

Smaug1232y ago

sailorganymede2y ago

These people genuinely care about their craft to the point they ask: what happened. They then look into it.

rofo12y ago

There are many engineers that can do this, it's just that writing javascript pays 3 times or more so the focus goes there :)

If you visit #c / #asm on any popular IRC network, you'll find a lot of skilled people that can do this routinely.

crvdgc2y ago

amirxyz2y ago

BiteCode_dev2y ago

The HN bubble doesn't have only disadvantages.

secondcoming2y ago

It’s not all that uncommon outside of web stuff

layer82y ago

Also, if you read the other subthreads, there are a number of points to be criticized about this writeup.

j / k navigate · click thread line to collapse