How to speed up the Rust compiler one last time (opens in new tab)

(blog.mozilla.org)

319 pointsnnethercote5y ago57 comments

57 comments

__s5y ago

Nicholas Nethercote didn't just speed up Rust. He went in & did the dirty work of dredging through Firefox profiling

> It’s rare that a single micro-optimization is a big deal, but dozens and dozens of them are. Persistence is key

Persistence is work. Mozilla is cutting the people who put in the work of staving off bitrot

nnethercoteOP5y ago

Thank you for the kind words.

To clarify: I am still at Mozilla! But I will be working fully on Firefox for the foreseeable future. I have edited the opening paragraph of the post to make this clearer.

qzw5y ago

Rust's loss is Firefox's gain. I'm sorry to see you leave your Rust work, but I think Mozilla is right to have their best engineers focus on their core product. If we hope to see Firefox survive and remain relevant, then Mozilla really needed to refocus their energies onto it. Also, I assume someone of Nicholas's caliber has a great deal of agency over their own career path, so perhaps a return to Firefox is not entirely unwelcomed by him.

Ygg25y ago

I'd honestly rather see Rust survive, than Mozilla. Mozilla is few years away from going Blink, and on deathbed.

Rust is up and coming language.

2 more replies

exmozilla5y ago

Glad you're still there. I had 10 years of time with Mozilla but was let go with the layoffs. First contribution was 20 years ago. From the folks I still stay in touch with morale is at an all time low but I hope Firefox recovers.

__s5y ago

Glad to hear, as someone who continues to use Firefox it's reassuring to know we'll continue to benefit from your skills

agumonkey5y ago

I still remember your blog entries about chasing memory use in firefox before quantum :)

ndesaulniers5y ago

> The improvements I did are mostly what could be described as “bottom-up micro-optimizations”.

> I also did two larger “architectural” or “top-down” changes

My summer intern started doing profiling work on compile times with clang: https://lists.llvm.org/pipermail/llvm-dev/2020-July/143012.h...

Some things we found:

* for a large C codebase like the Linux kernel, we're spending way more time in the front-end (clang) than the backend (llvm). This was surprising based on rustc's experience with llvm. Experimental patches simplifying header inclusion dependencies in the kernel's sources can potentially cut down on build times by ~30% with EITHER gcc or clang.

* There's a fair amount of low hanging fruit that stands out from bottom up profiling. We've just started fixing these, but the most immediate was 13% of a Linux kernel build recomputing target information for every inline assembly statement in a way that was accidentally quadratic and not being memoized when it could be (in fact, my intern wrote patches to compute these at compile time, even). Fixed in clang-11. That was just the first found+fixed, but we have a good list of what to look at next. The only real samples showing up in the llvm namespace (vs clang) is llvm's StringMap bucket lookup but that's from clang's preprocessor.

* GCC beats the crap out of Clang in compile times of the Linux kernel; we need to start looking for top down optimizations to do less work overall. I suspect we may be able to get some wins out of lazy parsing at the cost of missing diagnostics (warnings and errors) in dead code.

* Don't speculate on what could be slow; profiles will surprise you.

> Using instruction counts to compare the performance of two entirely different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable to use them to compare the performance of two almost-identical programs

Agree. We prefer cycle counts via LBR, but only for comparing diffs of the same program, as you describe.

epage5y ago

> for a large C codebase like the Linux kernel, we're spending way more time in the front-end (clang) than the backend (llvm). This was surprising based on rustc's experience with llvm.

rustc sends large, generally unoptimized chunks to llvm, compared to clang. In Rust, the translation unit is at the crate level, causing llvm to do more analysis. MIR is also still relatively new and I think there is still work to be done doing optimizations in it to get less data sent to llvm.

ndesaulniers5y ago

Clang doesn't do optimizations though. It generates llvm via a simple tree walk.

fluffything5y ago

What's big in C or C++ translation units are header files, but since these mainly contain declarations, and declarations do not require code generation, they don't create any work for the compiler backend.

Rust translation units do not have the header file problem (so the frontend does less work), and they are also much larger in terms of definitions, often spawning multiple files, so there is more for the backend to do per translation unit.

The consequence is that Rust spends more time on LLVM relatively speaking than C and C++.

The solution to this problem in Rust is naively simple: write smaller translation units.

Rust programmers just want to structure their code however their want, and still have good compile-times. Which is kind of the opposite of how C and C++ programmers typically structure their code in the largests projects, because they value faster compile-times over that kind of "ergonomics"/code organization.

1 more reply

cesarb5y ago

> I was surprised by how many people said they enjoyed reading this blog post series. The appetite for “I squeezed some more blood from this stone” tales is high.

There's something satisfying about seeing code get cleaned up and optimized. I also enjoyed following the LibreOffice commits back when they were in their "heavy cleanup" phase after it became clear OpenOffice was dead (which meant they didn't have to worry about diverging from the upstream anymore).

zem5y ago

the early neovim posts were very absorbing too

vlovich1235y ago

> Contrary to what you might expect, instruction counts have proven much better than wall times when it comes to detecting performance changes on CI, because instruction counts are much less variable than wall times (e.g. ±0.1% vs ±3%; the former is highly useful, the latter is barely useful). Using instruction counts to compare the performance of two entirely different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable to use them to compare the performance of two almost-identical programs (e.g. rustc before PR #12345 and rustc after PR #12345). It’s rare for instruction count changes to not match wall time changes in that situation. If the parallel version of the rustc front-end ever becomes the default, it will be interesting to see if instruction counts continue to be effective in this manner.

This is a supremely surprising conclusion, especially in 2020. Is instruction count really still tied to wall clock count? I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction. Similarly, cache effects & data dependencies can result in more instructions being faster than fewer instructions.

I think what the author is trying to say is that when evaluating micro-optimizations, cycle counts are pretty valuable still because you're making a small intentional change & evaluating its impact & usually the correlation holds. The dashboard clearly still measures wall-clock since just comparing instruction count over time would be misleading.

I'm curious if the Rust team has evaluated stabilizer to be more robust about the optimizations they choose: https://emeryberger.com/research/stabilizer/

nnethercoteOP5y ago

> This is a supremely surprising conclusion

That's why I started the paragraph with "Contrary to what you might expect".

As for Stabilizer: "Stabilizer eliminates measurement bias by comprehensively and repeatedly randomizing the placement of functions, stack frames, and heap objects in memory." Those placements can affect cycle counts and wall times a lot, but don't affect instruction counts.

vlovich1235y ago

So have you not found in practice any data dependencies or cache issues show up as bottle necks? Or do current tools just make this more of a blind spot for optimization?

Also is there any work to multi-thread the Rust compiler on a more fine-grained level like the recent GCC work? I know you allude to that potentially that would make the instruction counts potentially less reliable so wondering if that's something being explored.

Finally, while I have you, I'm wondering if there's been any exploration of the idea of keeping track of information across builds so that incremental compilation is faster (i.e. only bother recompiling/relinking the parts of the code impacted by a code change). I've always thought that should almost completely eliminate compilation/linking times (at least for debug builds where full utmost optimization is less important).

nnethercoteOP5y ago

I mentioned in the post several areas I myself haven't looked at, including cache misses. There may be room for improvements there.

There is an experimental parallel rustc front-end, e.g. see https://internals.rust-lang.org/t/help-test-parallel-rustc/1...

> any exploration of the idea of keeping track of information across builds so that incremental compilation is faster

That's exactly what incremental compilation does.

1 more reply

JoshuaRLi5y ago

It isn't surprising to me that instruction counts as a performance metric would have less variance than wall time. Did you know that the environment size can actually affect wall time?

> We see that something external and orthogonal to the program, i.e., changing the size (in bytes) of an unused environment variable, can dramatically (frequently by about 33% and once by almost 300%) change the performance of our program. This phenomenon occurs because the UNIX environment is loaded into memory before the call stack. Thus, changing the UNIX environment size changes the location of the call stack which in turn affects the alignment of local variables in various hardware structures.

From https://www.inf.usi.ch/faculty/hauswirth/publications/asplos....

vlovich1235y ago

That's exactly my point! The only thing that matters is wall clock time so if cache layout is making a 300% difference to wall clock, focusing on instructions counts will mislead you on the most impactful optimization to make (i.e. cache layout optimizations won't show up in instruction counts).

And yes. I'm aware of that result because of Professor Berger's talks on Coz & the other work he's done in this space.

JoshuaRLi5y ago

> The only thing that matters is wall clock time so if cache layout is making a 300% difference to wall clock, focusing on instructions counts will mislead you on the most impactful optimization to make

Ahh, very well said!

Twirrim5y ago

> I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction..

There are some fun cases where that is definitely true, to whit pdep / pexp on Zen based architectures. https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-rep...

https://twitter.com/uops_info/status/1202950247900684290

> I just ran some tests: the performance seems to depend heavily on the value in the last operand; this is also the case for the register variants. If the last operand is set to -1 (i.e., all bits are 1), the instr. has 518 uops and needs more than 289 cycles!

makomk5y ago

From what I can tell, Zen based architectures have a slow compatibility-only emulation of those two instructions because the fast implementation is patented - and it's patented by a university who've got some kind of deal with Intel involving co-development of the instruction, rather than by Intel themselves, so AMD's patent cross-license doesn't cover it.

hajile5y ago

I'd guess there are instructions (eg, x87) that are simply forbidden from use as being universally slower. Once you account for those and are comparing only desirable instructions, there should be a pretty reasonable correlation barring the occasional edge case like mixing AVX512 lightly with other instructions (due to downclocking).

est315y ago

It's sad to see your rustc contributions stop, nnethercote. I guess rustc now has to run an experiment on how quickly performance improves without you :(.

IMO compiler speed still remains the main ergonomics hurdle in developing Rust software.

steveklabnik5y ago

Thanks for all you've done over the years here. I'm sad you won't be able to do more of it.

The_rationalist5y ago

Nnerthercote own the best blog on performance profiling that I've ever seen, congrats to your huge skill set, Firefox, chromium, and programming languages need more people like you.

Ar-Curunir5y ago

Thank you for your excellent work over the years! Your efforts have gone a long way to making Rust enjoyable to write =)

If there are any smart rust-using company, they should definitely hire nnethercote to continue their excellent work!

dblohm75y ago

Considering that nnethercote is still with us at Mozilla, I sure hope that he doesn’t get hired away! :-)

Ar-Curunir5y ago

Oops, missed that part :)

alex_reg5y ago

> ... Perhaps this relates to the high level of interest in Rust ...

I would have loved these blog posts regardless of what code was actually being optimised.

They offer a fascinating glimpse into a workflow that requires expertise, experimentation and creativity.

Sadly something that most developers can't engage in very often, due to the nature of their work or time constraints.

oshea64bit5y ago

This is a fascinating blog series. I've been dabbling in Rust lately and really appreciate how powerful and helpful the compiler is even to beginners.

> Due to recent changes at Mozilla my time working on the Rust compiler is drawing to a close.

This sort of statement makes me a bit worried though. I don't mean to echo what a lot of the community has said over the past month, but I really hope that development on Rust doesn't stagnate because of the layoffs.

jimbob455y ago

How hasn’t Google taken over and hired the Rust team? Weren’t they practically funding them by funding their parent, Mozilla?

steveklabnik5y ago

Well, most of the Rust team was not employed by Mozilla, so that’s one reason why they have not.

nanagojo5y ago

What about the Servo team? Wouldn't they have an impact on Rust development?

steveklabnik5y ago

They certainly contributed, yes, but there are like two hundred people total on all of the Rust teams. Losing them hurts, they’re fantastic folks, but Rust is just way bigger these days.

1 more reply

FartyMcFarter5y ago

Does Google care about Rust?

pjmlp5y ago

They use it on ChromeOS, Fuchsia, are thinking of introducing it on Android, and driving the conversations of using Linux kernel modules written in Rust.

FartyMcFarter5y ago

Thanks, that is more involvement than I expected / knew about!

1 more reply

stackzero5y ago

gg man. Really enjoyed your posts since starting rust.

xiaodai5y ago

Hmmm... Rust needs alot more given its slow reputation

cp-r5y ago

Yes, compilation got a lot slower in 1.46 https://github.com/rust-lang/rust/issues/75992 and there has been some breakage with `type_length_limit` errors.

xiaodai5y ago

Don't know why I get down votes for pointing out the slowness of Rust which is well known

lucozade5y ago

Probably because you just stated something that is well known without adding anything constructive?

k__5y ago

The title made it sound like the Rust compiler is at its performance limit and they doing the last possible optimization.

j / k navigate · click thread line to collapse

57 comments

__s5y ago

Nicholas Nethercote didn't just speed up Rust. He went in & did the dirty work of dredging through Firefox profiling

> It’s rare that a single micro-optimization is a big deal, but dozens and dozens of them are. Persistence is key

Persistence is work. Mozilla is cutting the people who put in the work of staving off bitrot

nnethercoteOP5y ago

Thank you for the kind words.

To clarify: I am still at Mozilla! But I will be working fully on Firefox for the foreseeable future. I have edited the opening paragraph of the post to make this clearer.

qzw5y ago

Ygg25y ago

I'd honestly rather see Rust survive, than Mozilla. Mozilla is few years away from going Blink, and on deathbed.

Rust is up and coming language.

2 more replies

exmozilla5y ago

__s5y ago

Glad to hear, as someone who continues to use Firefox it's reassuring to know we'll continue to benefit from your skills

agumonkey5y ago

I still remember your blog entries about chasing memory use in firefox before quantum :)

ndesaulniers5y ago

> The improvements I did are mostly what could be described as “bottom-up micro-optimizations”.

> I also did two larger “architectural” or “top-down” changes

My summer intern started doing profiling work on compile times with clang: https://lists.llvm.org/pipermail/llvm-dev/2020-July/143012.h...

Some things we found:

* Don't speculate on what could be slow; profiles will surprise you.

Agree. We prefer cycle counts via LBR, but only for comparing diffs of the same program, as you describe.

epage5y ago

> for a large C codebase like the Linux kernel, we're spending way more time in the front-end (clang) than the backend (llvm). This was surprising based on rustc's experience with llvm.

ndesaulniers5y ago

Clang doesn't do optimizations though. It generates llvm via a simple tree walk.

fluffything5y ago

The consequence is that Rust spends more time on LLVM relatively speaking than C and C++.

The solution to this problem in Rust is naively simple: write smaller translation units.

1 more reply

cesarb5y ago

> I was surprised by how many people said they enjoyed reading this blog post series. The appetite for “I squeezed some more blood from this stone” tales is high.

zem5y ago

the early neovim posts were very absorbing too

vlovich1235y ago

I'm curious if the Rust team has evaluated stabilizer to be more robust about the optimizations they choose: https://emeryberger.com/research/stabilizer/

nnethercoteOP5y ago

> This is a supremely surprising conclusion

That's why I started the paragraph with "Contrary to what you might expect".

vlovich1235y ago

So have you not found in practice any data dependencies or cache issues show up as bottle necks? Or do current tools just make this more of a blind spot for optimization?

nnethercoteOP5y ago

I mentioned in the post several areas I myself haven't looked at, including cache misses. There may be room for improvements there.

There is an experimental parallel rustc front-end, e.g. see https://internals.rust-lang.org/t/help-test-parallel-rustc/1...

> any exploration of the idea of keeping track of information across builds so that incremental compilation is faster

That's exactly what incremental compilation does.

1 more reply

JoshuaRLi5y ago

It isn't surprising to me that instruction counts as a performance metric would have less variance than wall time. Did you know that the environment size can actually affect wall time?

From https://www.inf.usi.ch/faculty/hauswirth/publications/asplos....

vlovich1235y ago

And yes. I'm aware of that result because of Professor Berger's talks on Coz & the other work he's done in this space.

JoshuaRLi5y ago

Ahh, very well said!

Twirrim5y ago

> I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction..

There are some fun cases where that is definitely true, to whit pdep / pexp on Zen based architectures. https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-rep...

https://twitter.com/uops_info/status/1202950247900684290

makomk5y ago

hajile5y ago

est315y ago

It's sad to see your rustc contributions stop, nnethercote. I guess rustc now has to run an experiment on how quickly performance improves without you :(.

IMO compiler speed still remains the main ergonomics hurdle in developing Rust software.

steveklabnik5y ago

Thanks for all you've done over the years here. I'm sad you won't be able to do more of it.

The_rationalist5y ago

Nnerthercote own the best blog on performance profiling that I've ever seen, congrats to your huge skill set, Firefox, chromium, and programming languages need more people like you.

Ar-Curunir5y ago

Thank you for your excellent work over the years! Your efforts have gone a long way to making Rust enjoyable to write =)

If there are any smart rust-using company, they should definitely hire nnethercote to continue their excellent work!

dblohm75y ago

Considering that nnethercote is still with us at Mozilla, I sure hope that he doesn’t get hired away! :-)

Ar-Curunir5y ago

Oops, missed that part :)

alex_reg5y ago

> ... Perhaps this relates to the high level of interest in Rust ...

I would have loved these blog posts regardless of what code was actually being optimised.

They offer a fascinating glimpse into a workflow that requires expertise, experimentation and creativity.

Sadly something that most developers can't engage in very often, due to the nature of their work or time constraints.

oshea64bit5y ago

This is a fascinating blog series. I've been dabbling in Rust lately and really appreciate how powerful and helpful the compiler is even to beginners.

> Due to recent changes at Mozilla my time working on the Rust compiler is drawing to a close.

jimbob455y ago

How hasn’t Google taken over and hired the Rust team? Weren’t they practically funding them by funding their parent, Mozilla?

steveklabnik5y ago

Well, most of the Rust team was not employed by Mozilla, so that’s one reason why they have not.

nanagojo5y ago

What about the Servo team? Wouldn't they have an impact on Rust development?

steveklabnik5y ago

They certainly contributed, yes, but there are like two hundred people total on all of the Rust teams. Losing them hurts, they’re fantastic folks, but Rust is just way bigger these days.

1 more reply

FartyMcFarter5y ago

Does Google care about Rust?

pjmlp5y ago

They use it on ChromeOS, Fuchsia, are thinking of introducing it on Android, and driving the conversations of using Linux kernel modules written in Rust.

FartyMcFarter5y ago

Thanks, that is more involvement than I expected / knew about!

1 more reply

stackzero5y ago

gg man. Really enjoyed your posts since starting rust.

xiaodai5y ago

Hmmm... Rust needs alot more given its slow reputation

cp-r5y ago

Yes, compilation got a lot slower in 1.46 https://github.com/rust-lang/rust/issues/75992 and there has been some breakage with `type_length_limit` errors.

xiaodai5y ago

Don't know why I get down votes for pointing out the slowness of Rust which is well known

lucozade5y ago

Probably because you just stated something that is well known without adding anything constructive?

k__5y ago

The title made it sound like the Rust compiler is at its performance limit and they doing the last possible optimization.

j / k navigate · click thread line to collapse