(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).
Andy and I have had this debate going for a long time already.
Isn't LMBD closer to an embedded key-value store than an RDBMS, though? Also there's a section in the paper that mentions it's single-writer.
Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.
<<<There is one significant drawback that should not be understated. Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.>>>
- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls
- mmap() complicates transactionality and error-handling
- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks
2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.
3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.
Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?
Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?
We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.