More like the "BTRFS for key-value stores" ;)
Kidding aside, I dislike when new unproven software claims the name of industry standards like this. When I saw the headline, I was hoping this somehow actually leveraged ZFS's storage layer, but actually it is just a new database that thinks Copy-on-Write is cool.
I implemented pretty much the same trade off set in an authenticated storage system.
single writer, radix merkle tree, persistent storage, hashed keys, proofs.
I guess it is a local maxima within that trade off space.
I like how the time travelling/history is always touted as a feature (which it is), but it really just means the garbage collector/pruning part of the transaction engine is missing. Postgres and other mvcc systems could all be doing this, but they don't. The hard part of the feature is being able to turn it off.
I'll probably have a look around later, the diffing looks interesting, not sure yet if it's done using the merkle tree (likely) or some commit walking algorithm.
Postgres actually did tout it as a feature in "THE IMPLEMENTATION OF POSTGRES" by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1] search for "time travel" in the PDF. I added the relevant quotes below for easier access ;)
This was back when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:
> The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.
Quoting the paper again:
> For example to find the salary of Sam at time T one would query:
retrieve (EMP.salary)
using EMP [T]
where EMP.name = "Sam"
> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.My use-case is a system that serves as an OLAP data warehouse of representations of how another system’s state looked at various points in history. You’d open a handle against the store, passing in a snapshot version; and then do OLAP queries against that snapshot.
Things that make this a hard problem: The dataset is too large to just store the versions as independent copies; so it really needs some level of data-sharing between the snapshots. But it also needs to be fast for reads, especially whole-bucket reads—it’s an OLAP data warehouse. Merkle-tree-based designs really suck for doing indexed table scans.
But, things that can be traded off: there’d only need to be one (trusted) writer, who would just be batch-inserting new snapshots generated by reducing over a CQRS/ES event stream. It’d be that (out-of-band) event stream that’d be the canonical, integrity-verified, etc. representation for all this data. These CQRS state-aggregate snapshots would just be a cache. If the whole thing got corrupted, I could just throw it all away and regenerate it from the CQRS/ES event stream; or, hopefully, “rewind” the database back to the last-known-good commit (i.e. purge all snapshots above that one) and then regenerate only the rest from the event stream.
I’m not personally aware of anything that targets exactly this use case. I’m working on something for it myself right now.
Two avenues I’m looking into:
• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with copy-on-write ref-counted pages shared between snapshots, where those snapshots appear as B-tree nodes themselves)
• “keyframe” snapshots as regular independent B-trees, maybe relying on L2ARC-like block-level dedup between them; “interstitial” snapshots as on-disk HAMT ‘overlays’ of the last keyframe B-tree, that share nodes with other on-disk HAMTs, but only within their “generation” (i.e. up to the next keyframe), such that they can all be rewritten/compacted/finalized once the next keyframe arrives, or maybe even converted into “B-frames” that have forward-references to data embedded in the next keyframe.
LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects (see https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)
Except, unlike git, this database wouldn’t need to be able to create new commits off of anywhere but the HEAD; and also wouldn’t need to be able to have more than one in-progress write transaction open at a time. No need for MVCC at all; and no need for a DAG. The “refs” would always just be a dense linear sequence.
Also, unlike git (or a cryptographically-verified / append-only store), there’s no need to keep around “deleted” snapshots. It would actually be a huge benefit to be able to purge arbitrary snapshots from the database, without needing to do a mark-and-sweep liveness pass to write out a new copy of the database store.
The key constraint that differentiates this from e.g. a Kafka Streams-based CQRS/ES aggregate, is that you should be able to reopen and work with any historical database version instantly, with equal-amortized-cost lookups from any version, without needing to first do O(N) work to “replay” from the beginning of time to the snapshot, or to “rewind” from the HEAD to the snapshot. This system would need all snapshots to be equally “hot” / “online” for query access, not just the newest one.
In other words, such a database should work just like filesystem snapshots do in a copy-on-write filesystem.
Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.
I haven't explored it for this usecase, but seems like it might work...
If you used a stored procedure to compute the range that becomes the view, then all you need to store are the parameters to feed to the stored procedure again, which data you could itself store in a separate table.
Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.
Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)
Supporting efficient range scans is hard though.
EDIT: yeah, OLAP will be hard.
Also,
>Superfast proof generation time of around 1000 proofs per second per core.
Does this limit in any way things like read/write perfomance or usability in general?
The cardinal rule of database development:
http://www.dbms2.com/2013/03/18/dbms-development-marklogic-h...
I truly like the project and I have a few things in mind that could make use of it already. (Heck, one of them is pretty much Graviton + a front end).
But I cannot just jump into it as there's some real money involved and no one wants to experiment with that.
I see a bright future for Graviton, once it becomes tested and stable in production environments.
So proof "generation" is just fetching the nodes and sending them to the client.
The post claims: "The features included in Graviton provide the missing functionality that prevented Stargate RC1 from reaching deployment on our mainnet."
I'm not sure, but I guess that this checksumming is relevant for storing the Merkle trees encoding the blockchain. I don't know why the previous choice of database wasn't suitable.
https://aws.amazon.com/about-aws/whats-new/2020/07/announcin...
For best results, run Graviton on a Graviton:
We can just call this: ad58cd9088995cfb528187b11c275dad60ce2ec5
And the chip: 59b54f61dd17c27744e884542e35b34172e2cc79
So easy!
Unfortunately I don't think this one is multiprocess safe.