Lolbench: automagically and empirically discovering Rust performance regressions (opens in new tab)

(blog.anp.lol)

199 pointsanp7y ago40 comments

40 comments

Do you do have any plans to better distinguish between noise and regressions? I run a similar performance testing infrastructure for Chakra, and found that comparing against the previous run makes the results noisy. That means more manual review of results, which gets old fast.

What I do now is run a script that averages results from the preceding 10 runs and compares that to the average of the following 5 runs to see if the regression is consistent or anomalous. If the regression is consistent, then the script automatically files a bug in our tracker.

There is still some noise in the results, but it cuts down on those one-off issues.

anpOP7y ago

I talked about this a little bit in the meetup talk I linked and I intend to write a bit more about this, but I'll try to summarize.

There are kind of three prongs here:

First, using criterion.rs does a ton for giving us more stable metrics. It handles things like warmups, accounting for obvious statistical outliers in the sample runs, postprocessing the raw data to provide more meaningful statistics, etc. I'm currently using a fork of the library which additionally does this recording and processing of a variety of metrics we get from `perf_event_open` on Linux but which I assume you could get through ETW or Intel/AMD's userspace PMC libraries.

Second, I try to provide a stable environment so that results over long time deltas are comparable and we can store the data for offline analysis rather than having to checkout recent prior commits and compare the current PR/nightly/etc against them. Prior to the current deployment I was using cgroups to move potentially competing processes off of the benchmark cores which produced some nice results. However I had some issues with the version of the cpuset utility I installed on the debian machines and I haven't sorted it out yet.

Third, we do a few things with the time-series-esque data we get from measuring multiple toolchains to try and only surface relevant results. Those are mostly in src/analysis.rs if you want to poke around. It basically boils down to calculating the Kernel Density Estimate of the current toolchain's value being from the same population (I hope these terms are halfway correct) as all prior toolchains' value.

I hope that with a few extensions to the above we can get close to being reliable enough to include in early PR feedback, but I think the likely best case scenario is a manually invoked bot on PRs followed by me and a few other people triaging the regressions surfaced by the tool after something merges.

Here are a few issues that I think will help improve this too:

https://github.com/anp/lolbench/issues/20

https://github.com/anp/lolbench/issues/17

https://github.com/anp/lolbench/issues/14

mkl7y ago

Do you mean 10 preceding versions, or 10 repeated timings of the same version? If you repeat the timing for the each version many times, why is that not enough to smooth out the noise?

dsamarin7y ago

Instead of averaging, I can recommend my go-to L-estimator for this sort of thing: the midsummary. Take the average of the 40% and 60% percentile as your measure of central tendency of performance.

gus_massa7y ago

You usually have to kill the outliers. A cheap trick is to measure 12 times the same thing and dropping the biggest and the smallest one.

chriswarbo7y ago

For those wanting to do similar tracking of benchmarks across commits, I've found Airspeed Velocity to be quite nice ( https://readthedocs.org/projects/asv ). It allows (but doesn't require) benchmarks to be kept separate to the project's repo, can track different configurations separately (e.g. using alternative compilers, dependencies, flags, etc.), keeps results from different machines separated, generates JSON data and HTML reports, performs step detection to find regressions, etc.

It was intended for use with Python (virtualenv or anaconda), but I created a plugin ( http://chriswarbo.net/projects/nixos/asv_benchmarking.html ) which allows using Nix instead, so we can provide any commands/tools/build-products we like in the benchmarking environment (so far I've used it successfully with projects written in Racket and Haskell).

anpOP7y ago

hi! author here if you want to ask questions or (nicely pls) let me know where I've made mistakes!

valarauca17y ago

How do you determine baseline load of the test machine in order to qualify the correctness of the benchmark?

Assuming the compiling, and testing is done in the cloud how do you ensure the target platform (processor) doesn't change, and that you aren't being subjected to neighbors who are stealing RAM bandwidth, or CPU cache resources from your VM and impacting the results?

anpOP7y ago

Each benchmark result is only compared against values from running on literally the same machine, actually. I agree that good results here would be extremely difficult to produce on virtualized infra, so I rented a few cheap dedicated servers from Hetzner. I'm glad that I decided to pin results to a single machine, because even between these identically binned machines from Hetzner I saw 2-4% variance between them when I ran some phoronix benches to compare.

I go into a little bit of detail on this in the talk I link to towards the bottom of the post, here's a direct link for convenience: https://www.youtube.com/watch?v=gSFTbJKScU0.

usefulcat7y ago

A suggestion: consider using callgrind to measure performance (instructions retired, cache misses, branch mispredictions, whatever) instead of wall clock time. It will be much slower per run, but since it will also be precise you shouldn't need to do multiple runs, and you should be able to run a bunch of different benchmarks concurrently without them interfering with each other or having anything else interfere with them.

anpOP7y ago

I currently do something pretty similar by using the perf subsystem in the Linux kernel to track the behavior of each benchmark function. In my early measurements I found concurrent benchmarking to introduce unacceptable noise even with this measurement tool and with cgroups/cpusets used to pin the different processes to their own cores. Instead of trying to tune the system to account for this, I chose to build tooling for managing a single runner per small cheap machine.

1 more reply

v_lisivka7y ago

Please don't, because memory access pattern will be very different.

CUViper7y ago

Some of those target benchmarks are on Rayon, and we've found that valgrind interferes with threading way too much to be useful there.

shepmaster7y ago

This is one of the many metrics of the official Rust compiler performance benchmarks [1].

[1]: https://perf.rust-lang.org/nll-dashboard.html

MikeHolman7y ago

I haven't used callgrind, but wouldn't running benchmarks concurrently still lead to cache interference?

1 more reply

valarauca17y ago

Thanks for the link. I'll give it watch :D

panic7y ago

The "More Like Rocket Science Rule of Software Engineering" has been WebKit policy for a while: https://web.archive.org/web/20061011203328/http://webkit.org... (now at https://webkit.org/performance/).

twtw7y ago

> Common excuses people give when they regress performance are, “But the new way is cleaner!” or “The new way is more correct.” We don’t care. No performance regressions are allowed, regardless of the reason. There is no justification for regressing performance. None.

This seems a bit extreme. Would they accept a regression to fix a critical security vulnerability? Code can be infinitely fast if it need not be correct.

bloomer7y ago

A better version...

Common excuses people give when introducing security vulnerabilities are, “but the new way is faster” or “the new way is more clever”. We don’t care. No security vulnerabilities are allowed, regardless of the reason. There is no justification for security vulnerabilities. None.

I love that “correct” is a “justification” in the original. I would be embarrassed to be associated with such a juvenile page. Move fast with broken things...

SatvikBeri7y ago

The bigger and less centralized a team is, the less you can afford edge cases in your rules. Even running a team of 6 people I found I had to make rules more rigid than I would ideally have liked.

On my own, I'll have many internal "rules" that are very flexible. But in a team I need a small number of rules that are rigidly enforced.

samontar7y ago

Besides I know they’ve sacrificed speed at the altar of correctness many times. My own renderer actually is much faster than theirs though at the price of a mild incorrectness: it discards all input and draws nothing.

habitue7y ago

This project looks awesome, but as a complete aside:

How long do we expect it to take before "automagically" completely replaces "automatically" in English?

I am guessing less than a decade to go now

anpOP7y ago

I use this word the way we did when I worked as a PC technician and help desker, where there's a lot of automation but then we sneak a bit of manual labor in to make it actually useful. Like how user accounts would be maintained in the correct state automagically.

daurnimator7y ago

I've used it for decades to mean "automatically as if by magic" in the sense of Arthur C Clake's quote: "Any sufficiently advanced technology is indistinguishable from magic."

To flip this adage around: calling your own tech as performing something "automagically" is tantamount to calling it "sufficiently advanced technology".

LukeShu7y ago

automagically: /aw·toh·maj´i·klee/, adv. Automatically, but in a way that, for some reason (typically because it is too complicated, or too ugly, or perhaps even too trivial), the speaker doesn't feel like explaining to you.

http://www.catb.org/jargon/html/A/automagically.html

anpOP7y ago

Well this interpretation certainly explains some of the reactions that word has gotten in my use of it!

habitue7y ago

Sorry, I didn't mean to imply disapproval, it was basically an idle thought since I see that replacement pretty commonly

1 more reply

hsivonen7y ago

Very nice!

Do you track opt_level=2 (the Firefox Rust opt level) in addition to the default opt_level=3?

anpOP7y ago

Thanks!

Not yet, but I am tracking this as a desired feature: https://github.com/anp/lolbench/issues/9. The benchmark plan generation, storage keys, and results presentation will at a minimum need to be extended to support a matrix of inputs to each benchmark function. Right now there are a number of implicit assumptions that each benchmark function is tracked as a single series of results.

thsowers7y ago

This is really cool, love the project and the writeup! I regularly use nightly (I work with Rocket) and I had always wondered about this. Thank you!

Twirrim7y ago

Can I suggest you consider putting https://github.com/anp/lolbench/issues/1 in to the README.md file, so people can easily see where to look for some TODO items?

dikaiosune7y ago

great idea!

awake7y ago

Is there any equivalent project for java.

kodablah7y ago

Surely JDK devs have a corpus of projects they test releases against, but JVM devs tend to not do the microbenchmarks this would need very often. So in general a corpus of centralized JMH benchmarks would probably have more value than referencing them from other projects. I'm sure an entity could offer this service if, e.g., projects with JMH benchmarks invoked in a common form from maven central or github or whatever were submitted to a central curator but not sure who would want to curate that.

dralley7y ago

Java has a JIT, right? Seems more difficult to get consistent results.

anothergoogler7y ago

You can benchmark reliably by warming up the JVM and disabling garbage collection.

https://www.ibm.com/developerworks/java/library/j-benchmark1...

https://www.ibm.com/developerworks/library/j-benchmark2/inde...

http://www.ellipticgroup.com/html/benchmarkingArticle.html

Java now ships with microbenchmarking helpers:

http://openjdk.java.net/projects/code-tools/jmh/

LukeShu7y ago

https://tratt.net/laurie/blog/entries/why_arent_more_users_m...

j / k navigate · click thread line to collapse

40 comments

MikeHolman7y ago

There is still some noise in the results, but it cuts down on those one-off issues.

anpOP7y ago

I talked about this a little bit in the meetup talk I linked and I intend to write a bit more about this, but I'll try to summarize.

There are kind of three prongs here:

Here are a few issues that I think will help improve this too:

https://github.com/anp/lolbench/issues/20

https://github.com/anp/lolbench/issues/17

https://github.com/anp/lolbench/issues/14

mkl7y ago

Do you mean 10 preceding versions, or 10 repeated timings of the same version? If you repeat the timing for the each version many times, why is that not enough to smooth out the noise?

dsamarin7y ago

Instead of averaging, I can recommend my go-to L-estimator for this sort of thing: the midsummary. Take the average of the 40% and 60% percentile as your measure of central tendency of performance.

gus_massa7y ago

You usually have to kill the outliers. A cheap trick is to measure 12 times the same thing and dropping the biggest and the smallest one.

chriswarbo7y ago

anpOP7y ago

hi! author here if you want to ask questions or (nicely pls) let me know where I've made mistakes!

valarauca17y ago

How do you determine baseline load of the test machine in order to qualify the correctness of the benchmark?

anpOP7y ago

I go into a little bit of detail on this in the talk I link to towards the bottom of the post, here's a direct link for convenience: https://www.youtube.com/watch?v=gSFTbJKScU0.

usefulcat7y ago

anpOP7y ago

1 more reply

v_lisivka7y ago

Please don't, because memory access pattern will be very different.

CUViper7y ago

Some of those target benchmarks are on Rayon, and we've found that valgrind interferes with threading way too much to be useful there.

shepmaster7y ago

This is one of the many metrics of the official Rust compiler performance benchmarks [1].

[1]: https://perf.rust-lang.org/nll-dashboard.html

MikeHolman7y ago

I haven't used callgrind, but wouldn't running benchmarks concurrently still lead to cache interference?

1 more reply

valarauca17y ago

Thanks for the link. I'll give it watch :D

panic7y ago

twtw7y ago

This seems a bit extreme. Would they accept a regression to fix a critical security vulnerability? Code can be infinitely fast if it need not be correct.

bloomer7y ago

A better version...

I love that “correct” is a “justification” in the original. I would be embarrassed to be associated with such a juvenile page. Move fast with broken things...

SatvikBeri7y ago

The bigger and less centralized a team is, the less you can afford edge cases in your rules. Even running a team of 6 people I found I had to make rules more rigid than I would ideally have liked.

On my own, I'll have many internal "rules" that are very flexible. But in a team I need a small number of rules that are rigidly enforced.

samontar7y ago

habitue7y ago

This project looks awesome, but as a complete aside:

How long do we expect it to take before "automagically" completely replaces "automatically" in English?

I am guessing less than a decade to go now

anpOP7y ago

daurnimator7y ago

I've used it for decades to mean "automatically as if by magic" in the sense of Arthur C Clake's quote: "Any sufficiently advanced technology is indistinguishable from magic."

To flip this adage around: calling your own tech as performing something "automagically" is tantamount to calling it "sufficiently advanced technology".

LukeShu7y ago

http://www.catb.org/jargon/html/A/automagically.html

anpOP7y ago

Well this interpretation certainly explains some of the reactions that word has gotten in my use of it!

habitue7y ago

Sorry, I didn't mean to imply disapproval, it was basically an idle thought since I see that replacement pretty commonly

1 more reply

hsivonen7y ago

Very nice!

Do you track opt_level=2 (the Firefox Rust opt level) in addition to the default opt_level=3?

anpOP7y ago

Thanks!

thsowers7y ago

This is really cool, love the project and the writeup! I regularly use nightly (I work with Rocket) and I had always wondered about this. Thank you!

Twirrim7y ago

Can I suggest you consider putting https://github.com/anp/lolbench/issues/1 in to the README.md file, so people can easily see where to look for some TODO items?

dikaiosune7y ago

great idea!

awake7y ago

Is there any equivalent project for java.

kodablah7y ago

dralley7y ago

Java has a JIT, right? Seems more difficult to get consistent results.

anothergoogler7y ago

You can benchmark reliably by warming up the JVM and disabling garbage collection.

https://www.ibm.com/developerworks/java/library/j-benchmark1...

https://www.ibm.com/developerworks/library/j-benchmark2/inde...

http://www.ellipticgroup.com/html/benchmarkingArticle.html

Java now ships with microbenchmarking helpers:

http://openjdk.java.net/projects/code-tools/jmh/

LukeShu7y ago

https://tratt.net/laurie/blog/entries/why_arent_more_users_m...

j / k navigate · click thread line to collapse