https://github.com/senderista/hashtable-benchmarks/blob/mast...
https://github.com/senderista/hashtable-benchmarks/wiki/64-b...
...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.
[0] https://mlochbaum.github.io/bencharray/pages/search.html
[1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...
A snapshot of my happiness after running first experiments with Robin Hood: https://twitter.com/jerrinot/status/1730147245285150743 :)
I made the initial suggestion to look into Robin Hood hashing when it was first posted on Reddit.
Glad to see it make its way into the repo!
Research results from the last five years shows that Robin Hood hashing performs better than the other approaches under the right conditions. See this eval paper:
https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjo...
> For example, we could already start searching for elements at the slot with expected (average) displacement from their perfect slot and probe bidirectional from there. In practice, this is not very efficient due to high branch misprediction rates and/or unfriendly access pattern.
I think this indicates a regular Robin Hood insertion and modified search, which doesn't sound that similar to Amble and Knuth's method. And anyway the relative costs of mispredictions and cache misses vary wildly based on workflow (paper studies 8-byte keys only). The paper also doesn't present Robin Hood as a clear winner, which is how I interpreted your comment. It's shown as one of five suggestions in the decision graph at the end, and only recommended for load factors between 50% and 80% among other conditions.
Edit: And the paper is from 2015, not the last five years. Is this the right link?
Can highly recommend his personal blog as well: https://puzpuzpuz.dev/
"Imagine that we run this query over a few hundred million rows. This means at least a few hundred million hash table operations. As you might imagine, a slow hash table would make for a slower query. A faster hash table? Faster queries!"
I'll read the article properly after this, this is just a quick skim, but I can't see this quote can be correct. Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference.
Okay, I'll read the article now…
Edit:
"If you insert "John" and then "Jane" string keys into a FastMap, then that would later become the iteration order. While it doesn't sound like a big deal for most applications, this guarantee is important in the database world.
If the underlying table data or index-based access returns sorted data, then we may want to keep the order to avoid having to sort the result set. This is helpful in case of a query with an ORDER BY clause. Performance-wise, direct iteration over the heap is also beneficial as it means sequential memory access."
but "...if the underlying table data or index-based access returns sorted data..." Then you've got sorted data, in which case use a merge join instead of a hash join surely.
In a GROUP BY, you may have a few hundred million rows, but only a few hundred groups within them. A slow function would slow down things dramatically in that case since the hash table remain small and data access is potentially linear.
> Then you've got sorted data, in which case use a merge join instead of a hash join surely.
This property is beneficial for GROUP BY which includes a timestamp or a function over timestamp. QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.
As for merge join, we also use it in ASOF join: https://questdb.io/docs/reference/sql/join/#asof-join
ISWYM although that is rather a specific case. For your purposes though it may be a common case, I don't know.
> QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.
If data is already sorted and you have an 'order by' then just use the data directly – bingo, instant merge join, no hash table needed.
Especially considering they use unsafe "heavily", for big joins they could easily just call out to some native code if the surrounding code reaaaaally must be Java (again, why?). It's the worst of both worlds using unsafe Java: you don't get native speed, there's loads of memory overhead from everything being an Object (besides the rest of the VM stuff), and get to "enjoy" GC pauses in the middle of your hot loops, and with fewer safety guarantees than something like Rust.
https://questdb.io/blog/leveraging-rust-in-our-high-performa...
Would be interesting to see benchmarks here, do you know if there are any public ones available, specific to QuestDB's interop code?
I couldn't even try to count the number of great posts I've read about fast hash tables from e.g. Paul Khuong alone...