Here's my advice for how to run benchmarks and be happy with the results.
- Any experiment you perform has the risk of producing an outcome that misleads you. You have to viscerally and spiritually accept this fact if you run any benchmarks. Don't rely on the outcome of a benchmark as if it's some kind of Truth. Even if you do everything right, there's something like a 1/10 risk that you're fooling yourself. This is true for any experiment, not just ones involving JavaScript, or JITs, or benchmarking.
- Benchmark large code. Language implementations (including ahead of time compilers for C!) have a lot of "winning in the average" kind of optimizations that will kick in or not based on heuristics, and those heuristics have broad visibility into large chunks of your code. AOTs get there by looking at the entire compilation unit, or sometimes even your whole program. JITs get to see a random subset of the whole program. So, if you have a small snippet of code then the performance of that snippet will vary wildly depending on how it's used. Therefore, putting some small operation in a loop and seeing how long it runs tells you almost nothing about what will happen when you use that snippet in anger as part of a larger program.
How do you benchmark large code? Build end-to-end benchmarks that measure how your whole application is doing perf-wise. This is sometimes easy (if you're writing a database you can easily benchmark TPS, and then you're running the whole DB impl and not just some small snippet of the DB). This is sometimes very hard (if you're building UX then it can be hard to measure what it means for your UX to be responsive, but it is possible). Then, if you want to know whether some function should be implemented one way or another way, run an A:B test where you benchmark your whole app with one implementation versus the other.
Why is that better? Because then, you're measuring how your snippet of code is performing in the context of how it's used, rather than in isolation. So, your measurement will account for how your choices impact the language implementation's heuristics.
Even then, you might end up fooling yourself, but it's much less likely.
For UX stuff, 2 steps I’d add if you're expecting a big improvement:
1) Ship some way of doing a sampled measurement in production before the optimization goes out. Networks and the spec of the client devices may be really important to the UX thing you're trying to improve. Likely user devices are different from your local benchmarking environment.
2) Try to tie it to a higher level metric (e.g. time on site, view count) that should move if the UI thing is faster. You probably don't just want it to be faster, you want the user to have an easier time doing their thing, so you want something that ties to that. At the very least this will build your intuition about your product and users.
[1] https://github.com/leeoniya/uDSV/issues/2
You’re being generous or a touch ironic. It’s at least 1/10 and probably more like 1/5 on average and 1/3 for people who don’t take advice.
Beyond testing changes in a larger test fixture, I also find that sometimes multiplying the call count for the code under examination can help clear things up. Putting a loop in to run the offending code 10 times instead of once is a clearer signal. Of course it still may end up being a false signal.
I like a two phase approach, wheee you use a small scale benchmark while iterating on optimization ideas, with checking the larger context once you feel you’ve made progress, and again before you file a PR.
At the end of the day, eliminating accidental duplication of work is the most reliable form of improvement, and one that current and previous generation analysis tools don’t do well. Make your test cases deterministic and look at invocation counts to verify that you expect n calls of a certain shape to call the code in question exactly kn times. Then figure out why it’s mn instead. (This is why I say caching is the death of perf analysis. Once it’s added this signal disappears)
Oh, and if each op takes under a nanosecond than your benchmark is almost certainly completely broken.
It tries to deal with the uncertainties of different browsers, JITs, GCs, CPU throttling, varying hardware, etc., several ways:
- Runs benchmarks round-robin to hopefully subject each implementation to varying CPU load and thermal properties evenly.
- It reports the confidence interval for an implementation, not the mean. Doesn't throw out outlier samples.
- For multiple implementations, compares the distributions of samples, de-emphasizing the mean
- For comparisons, reports an NxM difference table, showing how each impl compares to the other.
- Can auto-run until confidence intervals for different implementations no longer overlap, giving high confidence that there is an actual difference.
- Uses WebDriver to run benchmarks in multiple browsers, also round-robin, and compares results.
- Can manage npm dependencies, so you can run the same benchmark with different dependencies and see how different versions change the result.
Lit and Preact use Tachometer to tease out performance changes of PRs, even on unreliable GitHub Action hardware. We needed the advanced statistical comparisons exactly because certain things could be faster or slower in different JIT tiers, different browsers, or different code paths.
We wanted to be able to test changes that might have small but reliable overall perf impact, in the context of a non-micro-benchmark, and get reliable results.
Tachometer is browser-focused, but we made it before there were so many server runtimes. It'd be really interesting to make it run benchmarks against Node, Bun, Deno, etc. too.
(a) sometimes the jit doesn’t run
(b) sometimes it makes performance worse
(c) sometimes you don’t even get to a steady state with performance
(d) and obviously in the real world you may not end up with the same jitted version that you get in your benchmarks
Benchmarks can be a useful tool, but they should not be mistaken for real-world performance.
My wheelhouse is making lots of 4-15% improvements and laptops are no good for those.
Iirc, the effects on long running benchmarks in that paper are usually < 1%, which is a big deal for runtime optimizations, but typically dwarfed by the differences between two methods you might measure.
And since I didn’t eliminate the logic I just halved the cost, that means we were spending about twice that much. But I did lower the slope of the regression line quite a lot, and I believe enough that new nodeJS versions improve response time faster than it was organically decaying. There were times it took EC2 instance type updates to see forward progress.
The primary motivation for limiting timer resolution was the rise of speculative execution attacks (Spectre / Meltdown), where high-resolution timers are integral for differentiating between timings within the memory hierarchy.
https://github.com/google/security-research-pocs/tree/master...
If you look at when various browsers changed their timer resolutions, it's entirely a response to Spectre.
https://blog.mozilla.org/security/2018/01/03/mitigations-lan...
https://issues.chromium.org/issues/40556716 (SSCA -> "speculative side channel attacks")
Only final-stage, fully-JIT-ted and profile-optimized code is what matter.
Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.
Sure, if you make a 100% consistent environment of a VM running just the single microbenchmark you may get a consistent result on one system, but is a consistent result in any way meaningful if it may be a massive factor away from what you'd get in a real environment? And even then I've had cases of like 1.5x-2x differences for the exact same benchmark run-to-run.
Granted, this may be less of a benchmarking issue, more just a JIT performance issue, but it's nevertheless also a benchmarking issue.
Also, for JS, in browser specifically, pre-JIT performance is actually a pretty meaningful measurement, as each website load starts anew.
For simple methods I usually run the benchnarkes method 100k times, 10k is minimum for full JIT.
For large programs I have noticed the performance keeps getting better for the first 24 hours, after which I take a profiling dump.
Java gives you exceptional control over the JVM allowing you to create really good benchmark harnesses. That today is not the case with JavaScript and the proliferation of different runtimes makes that also harder. To the best of my knowledge there is no JMH equivalent for JavaScript today.
When JITing JavaScript, every single fundamental operation has profiling. Adding stuff has multiple bits of profiling. Every field access. Every array access. Like, basically everything, including also callsites. And without that profiling, the JS JIT can't do squat, so it depends entirely on that profiling. So the randomness due to profiling has a much more extreme effect on what the compiler can even do.
This is true for servers but extremely not true for client-side GUI applications and web apps. Often, the entire process of [ user starts app > user performs a few tasks > user exits app ] can be done in a second. Often, the JIT never has a chance to warm up.
In such case you need "binary" benchmark: does user need to wait or not? You don't need some fancy graphics, percentiles, etc.
And in such case your worse enemy is not JIT but variance of user's hardware, from old Atom netbook to high-end working station with tens of 5Ghz cores. Same for RAM and screen size.
- running the benchmark in thin slices, interspersed and suffled, rather than in one big batch per item (which also avoids having one scenario penalized by transient noise)
- displaying a graphs that show possible multi-modal distributions when the JIT gets in the way
- varying the lengths of the thin slices between run to work around the poor timer resolution in browsers
- assigning the results of the benchmark to a global (or a variable in the parent scope as it is in the WEB demo below) avoid dead code elimination
This isn't a panacea, but it is better than the existing solutions AFA I'm aware.
There are still issues because, sometimes, even if the task order is shuffled for each slice, the literal source order can influence how/if a bit of code is compiled, resulting in unreliable results. The "thin slice" approach can also dilute the GC runtime between scenarios if the amount of garbage isn't identical between scenarios.
I think it is, however, a step in the right direction.
- CLI runner for NODE: https://github.com/pygy/bunchmark.js/tree/main/packages/cli
- WIP WEB UI: https://flems.io/https://gist.github.com/pygy/3de7a5193989e0...
In both case, if you've used JSPerf you should feel right at home in the WEB UI. The CLI UI is meant to replicate the WEB UI as close as possible (see the example file).
You can have some function that iterates over something and benchmark two different implementations and draw conclusions that one is better than the other.
Then, in real world, when it's in the context of some other code, you just can't draw conclusions because different engines will optimize the very same paths differently in different contexts.
Also, your micro benchmark may tell you that A is faster than B...when it's a hot function that has been optimized due to being used frequently. But then you find that B which is used only few times and doesn't get optimized will run faster by default.
It is really not easy nor obvious to benchmark different implementations. Let alone the fact that you have differences across engines, browsers, devices and OSs (which will use different OS calls and compiler behaviors).
Dead code elimination is the most obvious way this happens, but you can also have issues where you give the branch predictor "help", or you can use a different number of implementations of a method so you get different inlining behavior (this can make a benchmark better or worse than reality), and many others.
As for runtime, if you're creating a library, you probably care at least a little bit about alternate runtimes, though you may well just target node/V8 (on the JVM, I've done limited benchmarking on runtimes other than HotSpot, though if any of my projects get more traction, I'd anticipate needing to do more).
You might get there eventually but a lot of people don’t persevere where perf is concerned. They give up early which leaves either a lot of perf on the table or room for peers to show them up.
Not knowing what performance to expect is what allows you to build a website and expect it to run properly years later, on browsers that haven’t been released yet, running on future mobile phones that use chips that haven’t been designed yet, over a half-working WiFi connection in some cafe somewhere.
Being ignorant of performance is what allows you to create Docker images that work on random servers in arbitrary datacenters, at the same time that perfect strangers are running their jobs and arbitrarily changing what hardware is available for your code to use.
It’s also what allows you to depend on a zillion packages written by others and available for free, and upgrade those packages without things horribly breaking due to performance differences, at least most of the time.
If you want fixed performance, you have to deploy on fixed, dedicated hardware, like video game consoles or embedded devices, and test on the same hardware that you’ll use in production. And then you drastically limit your audience. It’s sometimes useful, but it’s not what the web is about.
But faster is better than slower, so we try anyway. Understanding the performance of portable code is a messy business because it’s mostly not the code, it’s our assumptions about the environment.
We run tests that don’t generalize. For scientific studies, this is called the “external validity” problem. We’re often doing the equivalent of testing on mice and assuming the results are relevant for humans.
Most optimization will improve on all or some VMs. Most will not make it slower on others.
If you write code that will be scaled up, optimization can save a lot of money, give better uptime, and it’s not a bad thing, the better code is not less portable in most cases.
In short, the JavaScript backend people now need to do what we JavaScript frontend people been doing since SPAs became a thing, run benchmarks across multiple engines instead of just one.
The right design is probably one that:
1) runs different tests in different forked processes, to avoid variance based on the order in which tests are run changing the JIT’s decisions.
2) runs tests for a long time (seconds or more per test) to ensure full JIT compilation and statistically meaningful results
Then you need to realize that your micro benchmarks give you information and help you understand, but the acid test is improving the performance of actual code.
I hate measuring things because accuracy is hard. I wish I could just make up my own numbers to make myself feel better.
It is surprising to me how many developers cannot measure things, do so incorrectly, and then look for things to blame for their emotional turmoil.
Here is quick guide to solve for this:
1. Know what you are measuring and what its relevance is to your product. It is never about big or small because numerous small things make big things.
2. Measuring things means generating numbers and comparing those numbers against other numbers from a different but similar measure. The numbers are meaningless is there is no comparison.
3. If precision is important use the high performance tools provided by the browser and Node for measuring things. You can get greater than nanosecond precision and then account for the variance, that plus/minus range, in your results. If you are measuring real world usage and your numbers get smaller, due to performance refactoring, expect variance to increase. It’s ok, I promise.
4. Measure a whole bunch of different shit. The point of measuring things isn’t about speed. It’s about identifying bias. The only way to get faster is to know what’s really happening and just how off base your assumptions are.
5. Never ever trust performance indicators from people lacking objectivity. Expect to have your results challenged and be glad when they are. Rest on the strength of your evidence and ease of reproduction that you provide.
had this discussion about GC pressure a bit ago: https://github.com/leeoniya/uDSV/issues/2
It does JIT warmup and ensures that your code doesn't get optimized out (by making it produce a side effect in result).
it provides bunch of features to help avoiding jit optimization foot-guns during benchmarking and dips into more advanced stuff like hardware cpu counters to see what’s the end result of jit on cpu
I think they are used to waiting because they no longer know the speed of desktop applications.
it's all fun and games until your battery dies 3 hours too soon.
you'd have to run benchmarks for all sort of little thibgs because no browser would leave things be. If they thought one popular benchmark was using string+string it was all or nothing to optimize that, harming everything else. next week if that benchmark changed to string[].join... you get the idea. your code was all over the place in performance. Flying today, molasses next week... sometimes chrome and ff would switch the optimizations, so you'd serve string+string to one and array.join to the other. sigh.
Look at quickjs, and use your own very lean OS interfaces.
More code does not inherently mean worse performance.
For performance, don't use javascript anyway...
That said, a much less worse middleground would be to have performance critical block written in assembly (RISC-V Now!) orchestrated by javascript.
for (int i = 0; i<1000; i++) {
console.time()
// do some expensive work
console.timeEnd()
}
Take your timing before and after the loop and divide by the count. Too much jitter otherwise.d8 and node have many options for benchmarking and if you really care, go command line. JSC is what is behind Bun so you can go that direction as well.
And BTW: console.time et al does a bunch of stuff itself. You will get the JIT looking to optimize it as well in that loop above, lol.
Which gives an average rather than a time?
var innerCount = 2000; // should run about 2 seconds for (var i=0; i<1000; i++) { var start = currentMillis(); for (var j=0; j<innerCount; j++) { benchmark method(); } best = min(best, (currentMillis() - start) / (double) innerCount); }
That way I can both get enough precision form the millisecond resolution and run the whole thing enough times to get the best result without JIT/GC pauses. The result is usually very stable, even when benchmarking calls to database (running locally).
I upvoted that.
Evolution chart:
https://egbert.net/blog/articles/javascript-jit-engines-time...