I always like it when a company focus on their real problems in job interviews and manage to avoid the brain teaser trap. That way you can have a feeling about the job that you are going to work on there, and see if you really like both work and people.
It is really closely aligned to what our core service is (distributing and streaming files) and is a great chance to talk with the interviewee and figure out where their strengths are.
Totally encourage other people to interview this way. It's what I've done at the past few companies I've been at and really worked excellently — just think of a problem you're working or have worked on, and distill it into an interview problem.
[1]: https://github.com/StefanKarpinski/bam [2]: http://cmph.sourceforge.net/
That said, I intend to publish some sort of performance comparison code / results. The downside with me doing it is that: 1) I know the sparkey code much better than I know level-db or any other solution, so the tuning parameters will probably be suboptimal for the other solutions. 2) I will only focus on our specific usecase (write large bulks, do lots of random reads), which may seem a bit unfair to the most general solutions.
The sparkey usage is fairly optimized, but I just randomly put something together for the level-db, so consider the results extremely biased.
How does your test compare to http://symas.com/mdb/microbench/ ? If you're going to try to talk about numbers, talk about them in a meaningful context. Right now you're just handwaving.
On monday I added some slightly more proper benchmark code, you can find it on https://github.com/spotify/sparkey/blob/master/src/bench.c
I didn't add the level-db code to this benchmark however, since I 1) didn't want to manage that dependency 2) didn't know how to write optimized code for usage of it.
I'm using very small records, a couple of bytes of key and value. The insert order is strictly increasing (key_0, key_1, ...), though that doesn't really matter for sparkey since it uses a hash for lookup instead of ordered lists or trees.
As for the symas mdb microbench, I only looked at it briefly but it seems like it's not actually reading the value it's fetching, only doing the lookup of where it actually is, is that correct?
"MDB's zero-memcpy reads mean its read rate is essentially independent of the size of the data items being fetched; it is only affected by the total number of keys in the database."
Doing a lookup and not using the values seems like a very unrealistic usecase.
Here's the part of the benchmark I'm referring to: for (int i = 0; i < reads_; i++) { const int k = rand_.Next() % reads_; key.mv_size = snprintf(ckey, sizeof(ckey), "%016d", k); mdb_cursor_get(cursor, &key, &data, MDB_SET); FinishedSingleOp(); }
leveldb on the other hand, supports concurrent writes and provides features to handle data consistency and cheap gradual reindexing.
A quick glance over LevelDB's features gives me the impression that its bulk-write performance would not be sufficient.
What problem are you solving?
If existing solutions existed what hurdles did you face with them and how did you overcome them with your custom solution?
How do you compare from a performance view? (granted they still need to do this, but at least put in a section about it)
Not only that but it provides lightning-fast conjunctive normal form queries, a.k.a logical combinations of primitive keys. Plus it has Python / Erlang bindings.
The command line argument processing is also quite haphazardly done, it's not like it using getopt or whatever that poses compatibility issues. Is writing and packaging with a Makefile that difficult?
I think it's a miracle they produced something they feel comfortable sharing with the world. If you write a database in house, and the tool chain and the argument processing are the only things done haphazardly, then hats off to you :)
I wrote something like this to optimize disk seeks heavily by returning a reference of 8 byte and keeping a hashtable in memory. A mostly-append only records store that allowing mutations of same key and by rounding size of blobs by power of 2. Written to optimize storage layer for Membase.