https://github.com/TeXitoi/benchmarksgame-rs/pull/44
I'm also working on some additional changes that make it even faster than the C version (again, on my computer) by improving how it divides work across CPUs:
https://github.com/TeXitoi/benchmarksgame-rs/pull/46
These improved Rust programs have not yet been added to the benchmarksgame site. Previous entries are ranked at:
http://benchmarksgame.alioth.debian.org/u64q/performance.php...
Minor improvements to the Rust programs for "mandelbrot" and "binary-trees" are also awaiting review!
UPDATE: Pushed a commit to my latest PR to replace the unsafe `File::from_raw_fd` with your `File::open`. Thanks!
I like Rust for what it does in advancing the state of the art of the languages, but I also like how this example demonstrates how hard it is to avoid "unsafe" constructs and remain competitive.
https://github.com/TeXitoi/benchmarksgame-rs/blob/master/src...
https://github.com/mbrubeck/benchmarksgame-rs/blob/reverse_c...
It's very easy to write extremely fast safe Rust code. (The safe Rust version above is faster than the fastest C++ submission, on my computer.) Using "unsafe" for optimization is usually only helpful to get a few extra percent speedup in an inner loop. If this were production code rather than the benchmarks game, I'd probably ship the safe version.
[46.03 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...)
and then the hash function was changed to FnvHasher --
[17.10 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
and then better use of quad core with futures_cpupool --
[9.44 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
and now use of std::collections::HashMap has been replaced with an experimental hash table inspired by Python 3.6's new dict implementation --
[5.30 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
afaict comparing #4 to #5 is all about differences between that experimental hash table and std::collections::HashMap --
[9.14 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
There was also a small contribution (~6%) of working on bytes rather than strings according to the original PR: https://github.com/TeXitoi/benchmarksgame-rs/pull/39
When discussing whether or not to use this at least someone mentioned that there company was using it in production. I don't think it really counts as experimental.
> Experimental hash table implementation in just Rust (stable, no unsafe code).
People who were trying to castigate Go early on as having "Java-like speeds" were really just showing their ignorance of the state of the art of JIT compilation for managed languages and the JVM. Such outdated folk knowledge of performance in the programming field seems to be a constant over the decades. (Programmers have had such distorted views since the mid 80's at least.) Maybe this kind of knowledge needs to be a used in job interview questions for awhile? Very soon, people will just memorize such trivia for interviews, but it would serve to squash this form of folk programming "alternative fact."
The part where Java really falls down is in memory use & management, which you can see on the binary-tree & mandelbrot benchmarks, where it's roughly 4x slower than C. There are inherent penalties to pointer chasing that you can't get around. While HotSpot is often (amazingly) smart enough to inline & stack-allocate small private structs, typical Java coding style relies on complex object graphs. In C++ or Rust these would all have well-defined object ownership and be contained within a single block of memory, so access is just "add a constant to this pointer, and load". In Java, you often need to trace a graph of pointers 4-5 levels deep, each of which may cause a cache miss.
Rule of thumb while I was at Google was to figure on real-world Java being about 2-3x slower than real-world C++.
binary-tree is not useful for comparing GCed and non-GCed languages. For non-GCed languages, you are allowed to use a memory pool of your choice (the C version uses the Apache Portable Runtime library), for GCed languages you are required to use the standard GC with the default settings (no adjustment of GC parameters permitted). This is apples and oranges.
For mandelbrot, the C version uses handcoded SIMD intrinsics. I.e. it's not even portable to non-x86 processors.
Before I left the Smalltalk part of my career behind, someone had the occasion to compare the parser-compiler of one Smalltalk which was implemented in C with Yacc/Lex with one implemented in pure Smalltalk with a JIT VM. IT turns out, once the console logging was disabled, the JIT VM's parser was just as fast as the one in C.
In my experience of almost 2 decades, it has been a constant that uninformed programmers are especially uninformed about the relative performance of managed languages.
http://benchmarksgame.alioth.debian.org/u64q/java.html
(And as always, it's possible that someone can write a much faster Java program than the ones submitted so far.)
Also many younger developers believe that C was always fast, and are unaware that early 8 and 16 bit compilers for home computers were like managed languages. The compilers generated way worse code than hobby Assembly developers.
Yeah, I wrote several published games entirely in assembly back when that was really the only reasonable option.
Of course some of the things we had to deal with meant that even the compilers were good, they couldn't have been good enough.
Extreme limited memory is the obvious one, but pages that need to be swapped out at runtime is the other one. An example: The Game Boy had 64k of memory addressing, but would ship with 128k and larger cartridges. 32k of the memory space was reserved for video memory, RAM, and hardware registers (if I remember correctly). The first 16k of memory was always mapped to the first 16k of ROM, but the second 16k of memory was mapped to arbitrary 16k blocks of ROM. 16k wasn't enough space to hold your entire game, so some code would leak into other pages (hopefully not more than ONE other page), but you also had to be able to swap other ROM pages into that second 16k memory region to, e.g., load graphics and game data.
There isn't a compiler around even today that can juggle all of that automatically. At a minimum you'd need to be marking different functions as belonging to different memory regions, but you'd still have to manually keep track of which function was where and ensure you don't try to call a region-2 function from a region-1 function when region-2 is some arbitrary ROM page instead of the extra code page. But often something in region-2 needs to call region-1 to load graphics into sprite registers and then restore region-2 so that the stack frame becomes valid again. :)
And 32k is SUCH a small amount of memory that being able to chip away at every single function was important. You have a function that does two things, but sometimes you just need to do the second? Put a label halfway through and call directly into the middle. You can save a byte here by loading one register into another, because you know that the registers will have the right values? Or you can save another byte there because you've realized that a particular constant is loaded a lot, and you can store that constant into one of the 128 cheap-memory locations? Do it! We need every byte...
Yes it would be possible to write such a compiler, but the level of effort to customize it to the specific architecture would be extreme. Easier to just make programmers do the hard work.
359% more RAM isn't very impressive. Even less so when one considers the 20+ years of effort spent to achieve it.
You know what impresses me? A 22 month old language besting everything else while guaranteeing no segfaults or NPEs at compile time. That's impressive.
It's also worth remembering that this is just a game, and as such it's somewhat of an apples-to-oranges comparison. Java's RAM usage may not be impressive compared to C, but that doesn't make the results overall any less impressive, especially knowing what it was like before those 20+ years of effort.
I don't see a reason that both Rust and Java's results can't be impressive in their own right. Java's numbers are (for the most part) impressive compared to C and Rust's numbers are also impressive compared to C, just in a different way.
I can sell a 2x slowdown to my boss if I can show that that's the only cost of a significantly more elegant and productive language, but at 100x that's a much harder sell.
1. Cache performance and locality may differ considerably for large applications. For example, a microbenchmark may perform well because it can monopolize the L1 cache in a way that is not possible for larger applications.
2. Aggressive inlining, a common factor in microbenchmark performance, may not scale well for large code bases. The tradeoffs here can be rather complex: https://webkit.org/blog/2826/unusual-speed-boost-size-matter...
3. Microbenchmarks often avoid abstraction in order to achieve high performance. This can create unacceptable software engineering costs.
But come on, now Rust can legitimately be called "faster than C" ;)
At least until the Clang C version is added... or maybe it will still be faster.
Actually it's pretty easy: if you can write a faster version in any language, given whatever is there in the language, even if it's c implemented standard library stuff, do it.
The implementations are not meant to be final -- people can contribute faster ones.
FFI is a fact of life. I can imagine it would be perverting the intent of the game if you explicitly used FFI to delegate the actual core of the benchmark program to C as opposed to using "standard" building blocks, but the distinciton is necessarily vague.
So, the usual. :)
I stopped taking into account the benchmarks game completely after I saw apples-to-potatoes comparisons a few years back.
Long ago the Ada program contributors told me -- It's only the same program if the same assembler is generated ;-)
Perhaps sort by cpu (seconds) rather than (elapsed) secs.
http://benchmarksgame.alioth.debian.org/u64q/performance.php...
e.g. felix or pony would dominate it then, over C++ with OpenMP.
Here only a missing fast C/C++ hash table is scewing the picture.
EDIT, Nevermind Felix compiles to C++ and Pony looks like an academic project. If you want to provide benchmarks or try to change my mind, I am open to it, but it would require significant evidence.
> Please don't implement your own custom "hash table" - it will not be accepted.
> The work is to use the built-in or library hash table implementation to accumulate count values - lookup the count for a key and update the count in the hash table.
The C++ implementation is thereby testing an old version of a non-standard extension of libstdc++ that I had never heard of and which was likely contributed once by IBM and never really looked at again (by either maintainers or users ;P), while the C implementation is testing the specified khash library, which is apparently something a number of people actively contribute to and attempt to optimize, giving it some notoriety.
If I were to do this in C++, and I wasn't allowed to use my hash table, I would almost certainly not be using __gnu_pbds::cc_hash_table<>. If I were to just want to use something "included", I would use the C++11 std::unordered_map<> (note that this code is compiled already as C++11). But we all know the STL is designed for flexibility and predictability, not performance, and the culture of C++ is "that's OK, as if you actually care about performance no off-the-shelf data structure is going to be correct". If I decided I wanted speed, I know I'd want to check out Folly, and I might even end up using khash.
Reading other comments, what happened here is the Rust version is now using some "experimental" hash table based on ongoing work to optimize the Python 3000 dict implementation. This is just not a useful benchmark. What we are benchmarking is "how maintained is the implementation's built in hash table and is it tunable for this particular workload".
That's why you should not be surprised to see Java doing so well: the code actually being written here is just some glue... your programming language has to be incompetent to do poorly at this benchmark (especially as many commenters here are using a "within a power of 2" rule of thumb). There are even multiple listings for the Java one, and the one that is faster is using it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?
What we really should be asking here is: why is any language doing "poorly" in this benchmark? It just isn't surprising that Rust is competitive with C/C++, nor is it surprising that Java is also; what is surprising is that Swift, Go, Haskell, and C# are not, and so I bet the issue is something (such as "is allocating memory for a thing which is not required") that can be trivially fixed for each (though by the rules of engagement, it might... or might not :/ as Java "cheated", right? ;P... require a minor fix upstream).
I mean, since the "work" explicitly is not "write a hash table using nothing but primitives from this language", there is no particular reason why Perl and Python (which is using a non-destructive array .replace, which is likely brutal... again: I bet this is almost always a benchmark of ancillary memory allocations) should be doing as poorly as they are: if we all made "optimize for this benchmark" a top priority for a weekend hackathon, I bet we could get every open source language to nail this under 25s. But do we care?
http://benchmarksgame.alioth.debian.org/play.html#contribute
I am not lamenting that someone should spend more time on this: I am lamenting that tons of people seem to care about it at all, it is not a "microbenchmark" (as some are calling it), and I think the main lesson we can learn from it is "there is something subtely wrong, either with the implementation that was contributed for this benchmark, the language's runtime, or it's standard library", as given these rules we really should expect every language to be similarly in performance.
And so, if we all cared about this benchmark, I bet we could figure out what is going on and get every open source language down under 25s. Past that point, I think the rules are such that this isn't even a fun game much less a useful metric of anything worth measuring, and you are probably wasting your time contributing. I guess, to make this subthread go somewhere: why do you disagree?
So thanks for that!
That's just a library that provides data structures without boxing of int,double... in Java. This eliminates a lot of overhead.
impl Hasher for NaiveHasher {
fn finish(&self) -> u64 {
self.0
}
fn write(&mut self, _: &[u8]) {
unimplemented!()
}
fn write_u64(&mut self, i: u64) {
self.0 = i ^ i >> 7;
}
}It implements a very simple hashing function for 64-bit numbers, each hashed number n overwrites the state with n xor (n >> 7). finish() will return the last written state.
This seems to work okay since the hashed structure "Code" contains exactly one 64-bit field.
I don't know if it's fishy, but it's certainly very custom. For a generic hashing implementation you'd at least want to mix in the previous state. I assume it was done more for brevity than performance though.
// Define a custom hash function to use instead of khash's default hash
// function. This custom hash function uses a simpler bit shift and XOR which
// results in several percent faster performance compared to when khash's
// default hash function is used.
#define CUSTOM_HASH_FUNCTION(key) (khint32_t)((key) ^ (key)>>7)https://doc.rust-lang.org/std/collections/struct.HashMap.htm...
Though that's also an open addressed map.
Do the other programs use that same hash table algorithm?
http://www.sebastiansylvan.com/post/robin-hood-hashing-shoul...
I think an adaptive hash that switches from fast to secure when collisions are detected would make a better default choice (at the cost of some implementation complexity).
Or possibly even an implementation with log(n) worst case complexity.
Also look how ginormous the binaries are for all the VM languages. Kinda would have thought it would be the opposite what with not needing to link in as much runtime?
http://science.raphael.poss.name/rust-for-functional-program...
This post was a total revelation to me, even if I assume it's well known in the community, because it is very credible on the Sophie's Choice issue of mathematical purity versus acknowledgement of the reality of the instruction pointer-based imperative machine that exists underneath.
I've looked at other languages that "do" multiprocessing recently. Go is great, but it's essentially about getting large teams of variable-skill people to work together well. It's not an inspiring language, whereas Rust clearly is. Erlang (via Elixir) is very interesting, but the actor model will never be as performant in reality as the shared memory architecture. Julia is just a modern interpretation of matlab. A number of the JVM languages are great, but the "culture" of that ecosystem will always be corporate.
This is why I believe the science crowd could really gravitate to Rust, because it may have the ability, like the functional crowd, to satisfy the "search for beauty" aspect which motivates many academics, scientists, and indeed, programmers, all the while staying just the right side of pragmatism. And clearly targeting "where the puck is going" on massively multicore hardware.
The trial-and-error nature of scientific/data science discovery inevitably requires a REPL. If I had the right compiler/interpreter skills I would gladly contribute to making a Rust REPL happen. Unfortunately I don't. As it stands, all I can say is that if the REPL happens, I would be axed to contribute on the Rust scientific ecosystem with great motivation and pleasure.
This was my favourite feature of the benchmarks game.
In this situation: it's helpful to slow-down, look at the source-code, think about what's being compared …
#pragma omp parallel sections