Graviton Database: ZFS for key-value stores (opens in new tab)

(github.com)

86 pointsautopoiesis5y ago56 comments

56 comments

>Graviton is currently alpha software.

More like the "BTRFS for key-value stores" ;)

Kidding aside, I dislike when new unproven software claims the name of industry standards like this. When I saw the headline, I was hoping this somehow actually leveraged ZFS's storage layer, but actually it is just a new database that thinks Copy-on-Write is cool.

innagadadavida5y ago

Title is very click baity, this is just another kv store, completely unrelated to ZFS.

mayama5y ago

Even their README is click baity then. I quickly glanced at their repo and thought it somehow is related to ZFS, before reading comments here.

BryanG25y ago

Every new unproven software is alpha during launch. Yet to see any software perfect since launch. Graviton provides ZFS leverages and features beyond ZFS. :)

ysleepy5y ago

Nice!

I implemented pretty much the same trade off set in an authenticated storage system.

single writer, radix merkle tree, persistent storage, hashed keys, proofs.

I guess it is a local maxima within that trade off space.

I like how the time travelling/history is always touted as a feature (which it is), but it really just means the garbage collector/pruning part of the transaction engine is missing. Postgres and other mvcc systems could all be doing this, but they don't. The hard part of the feature is being able to turn it off.

I'll probably have a look around later, the diffing looks interesting, not sure yet if it's done using the merkle tree (likely) or some commit walking algorithm.

mulander5y ago

> I like how the time travelling/history is always touted as a feature (which it is), but it really just means the garbage collector/pruning part of the transaction engine is missing. Postgres and other mvcc systems could all be doing this, but they don't.

Postgres actually did tout it as a feature in "THE IMPLEMENTATION OF POSTGRES" by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1] search for "time travel" in the PDF. I added the relevant quotes below for easier access ;)

This was back when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:

> The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.

Quoting the paper again:

> For example to find the salary of Sam at time T one would query:

    retrieve (EMP.salary)
    using EMP [T]
    where EMP.name = "Sam"

> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.

[1] - https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

rmetzler5y ago

Is this still possible with Postgres?

1 more reply

derefr5y ago

Does anyone know of an embedded key-value store that does do versioning/snapshots, but doesn’t bother with cryptographic integrity (and so gets better OLAP performance than a Merkle-tree-based implementation)?

My use-case is a system that serves as an OLAP data warehouse of representations of how another system’s state looked at various points in history. You’d open a handle against the store, passing in a snapshot version; and then do OLAP queries against that snapshot.

Things that make this a hard problem: The dataset is too large to just store the versions as independent copies; so it really needs some level of data-sharing between the snapshots. But it also needs to be fast for reads, especially whole-bucket reads—it’s an OLAP data warehouse. Merkle-tree-based designs really suck for doing indexed table scans.

But, things that can be traded off: there’d only need to be one (trusted) writer, who would just be batch-inserting new snapshots generated by reducing over a CQRS/ES event stream. It’d be that (out-of-band) event stream that’d be the canonical, integrity-verified, etc. representation for all this data. These CQRS state-aggregate snapshots would just be a cache. If the whole thing got corrupted, I could just throw it all away and regenerate it from the CQRS/ES event stream; or, hopefully, “rewind” the database back to the last-known-good commit (i.e. purge all snapshots above that one) and then regenerate only the rest from the event stream.

I’m not personally aware of anything that targets exactly this use case. I’m working on something for it myself right now.

Two avenues I’m looking into:

• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with copy-on-write ref-counted pages shared between snapshots, where those snapshots appear as B-tree nodes themselves)

• “keyframe” snapshots as regular independent B-trees, maybe relying on L2ARC-like block-level dedup between them; “interstitial” snapshots as on-disk HAMT ‘overlays’ of the last keyframe B-tree, that share nodes with other on-disk HAMTs, but only within their “generation” (i.e. up to the next keyframe), such that they can all be rewritten/compacted/finalized once the next keyframe arrives, or maybe even converted into “B-frames” that have forward-references to data embedded in the next keyframe.

noahdesu5y ago

Worked on a project with similar goals [0]. By no means a production level implementation, but much of the system exists as a proof-of-concept.

[0]: https://makedist.com/projects/cruzdb/

random35y ago

You need something like HBase, but embedded. MVCC would give you the snapshot isolation (perhaps there's something with less guarantees?) and you'd need key lexicographic ordering to do efficient scanning. You'd only need the memory layout (e.g. LSM) if you'd keep a write-ahead-log from which to recover.

LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects (see https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)

derefr5y ago

You misinterpreted, I think. The point isn’t “snapshot isolation” in the MVCC sense (working with multiple snapshots-in-progress); it’s the ability to, in est, work with the database the way git works with commits: opening a transaction “on top of” a base commit, then “committing” that transaction to create a new commit object, with its own explicit ref, where you can later “check out” an arbitrary ref.

Except, unlike git, this database wouldn’t need to be able to create new commits off of anywhere but the HEAD; and also wouldn’t need to be able to have more than one in-progress write transaction open at a time. No need for MVCC at all; and no need for a DAG. The “refs” would always just be a dense linear sequence.

Also, unlike git (or a cryptographically-verified / append-only store), there’s no need to keep around “deleted” snapshots. It would actually be a huge benefit to be able to purge arbitrary snapshots from the database, without needing to do a mark-and-sweep liveness pass to write out a new copy of the database store.

The key constraint that differentiates this from e.g. a Kafka Streams-based CQRS/ES aggregate, is that you should be able to reopen and work with any historical database version instantly, with equal-amortized-cost lookups from any version, without needing to first do O(N) work to “replay” from the beginning of time to the snapshot, or to “rewind” from the HEAD to the snapshot. This system would need all snapshots to be equally “hot” / “online” for query access, not just the newest one.

In other words, such a database should work just like filesystem snapshots do in a copy-on-write filesystem.

1 more reply

smbrian5y ago

Have you looked at LSM k-v stores (RocksDB being the obvious one)?

Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.

I haven't explored it for this usecase, but seems like it might work...

derefr5y ago

LSM trees are fast for writes, but not fast for reads. In fact, there’s essentially no difference in performance between an LSM tree, and a Merkle-tree whose keyspace is encoded within a B-tree.

shrubble5y ago

If I am understanding you properly, couldn't you do this with SQL by specifying the range of data that represents the timeseries you care about, selected via materialized view?

If you used a stored procedure to compute the range that becomes the view, then all you need to store are the parameters to feed to the stored procedure again, which data you could itself store in a separate table.

derefr5y ago

As I said in a sibling comment — every version needs to be “hot” / “online” at the same time. The point of this system is to allow for random access to OLAP queries for arbitrary historical versions of the system; and, in fact, to even do time-series reports that perform a given analysis against every available version of the data, hopefully with some degree of parallelism. In matview terms, that means that every version of the data needs to be concurrently materialized.

Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.

Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)

1 more reply

ysleepy5y ago

Well, it all depends on what operations you need. exact key lookup/write? -> Easy, just use any KV store and append the version to the key, then do a ceil/floor lookup in the KV with the key+target version.

Supporting efficient range scans is hard though.

EDIT: yeah, OLAP will be hard.

moralestapia5y ago

I love the idea but I think you (author) need a lot of time/support polishing this. You need a team probably.

Also,

>Superfast proof generation time of around 1000 proofs per second per core.

Does this limit in any way things like read/write perfomance or usability in general?

jjirsa5y ago

> You need a team probably

The cardinal rule of database development:

http://www.dbms2.com/2013/03/18/dbms-development-marklogic-h...

moralestapia5y ago

Yup, I do not mean to discourage the authors.

I truly like the project and I have a few things in mind that could make use of it already. (Heck, one of them is pretty much Graviton + a front end).

But I cannot just jump into it as there's some real money involved and no one wants to experiment with that.

I see a bright future for Graviton, once it becomes tested and stable in production environments.

1 more reply

ysleepy5y ago

The proof is just the nodes in the path of the merkle tree to the root. (or all sibling nodes of the path to the root).

So proof "generation" is just fetching the nodes and sending them to the client.

jopari5y ago

I asked the devs, and apparently "proof generation won't affect general read/write performance". But I guess this is the kind of thing where benchmarks might be useful, and there don't seem to be any public yet.

Rochus5y ago

What is the use case? Why is it important that "All keys, values are backed by blake 256 bit checksum"?

jopari5y ago

It seems to be intended as the backend database for the Dero blockchain smart contract platform: https://medium.com/deroproject/graviton-zfs-for-key-value-st...

The post claims: "The features included in Graviton provide the missing functionality that prevented Stargate RC1 from reaching deployment on our mainnet."

I'm not sure, but I guess that this checksumming is relevant for storing the Merkle trees encoding the blockchain. I don't know why the previous choice of database wasn't suitable.

naivedevops5y ago

ZFS stores the checksums of files to prevent bit rotting. Since they are comparing their database to ZFS, I guess it stores the checksums for the same reason. If bit rotting occurs, you don't need to discard the entire database, just the affected entry. If the entry was already there for some time, you might even be able to restore it from a backup.

Rochus5y ago

I can understand it with a file system; but in a typical key/value store application the data elements are much smaller (likely even smaller than the hash result).

GordonS5y ago

Isn't a 256-bit Blake hash a little OTT, versus a simple CRC, or even a faster, smaller hash like MurmurHash or Jenkins-one-at-a-time?

1 more reply

bdcravens5y ago

You can run a Graviton database. You can also run a database on a Graviton:

https://aws.amazon.com/about-aws/whats-new/2020/07/announcin...

For best results, run Graviton on a Graviton:

https://aws.amazon.com/ec2/graviton/

Mister_Snuggles5y ago

Naming things is difficult.

ethbr05y ago

Not according to git!

We can just call this: ad58cd9088995cfb528187b11c275dad60ce2ec5

And the chip: 59b54f61dd17c27744e884542e35b34172e2cc79

So easy!

TomTinks5y ago

This is definitely something to look into. so far dero looks like a pretty solid project with out of the box thinking.

byteshock5y ago

If latency and performance is an issue there are also solutions like RocksDB or LevelDB

jopari5y ago

There's a brief comparison with RocksDB and LevelDB in the README file, which concludes: "If you require a high random write throughput or you need to use spinning disks then LevelDB could be a good choice unless there are requirements of versioning, authenticated proofs or other features of Graviton database."

byteshock5y ago

This was a reply to another comment in the thread that suggested a user use sqlite. I commented using the Octal ios app. Not sure why it didn’t post it correctly....

ramoz5y ago

Comparison to Badger? Badger is also go-native and, for me, has been exceptional at scale and for read-heavy workloads on SSD.

Ref: https://github.com/dgraph-io/badger

jopari5y ago

I think the key differentiating feature of Graviton is the tree of authenticated proofs of data consistency. (AFAICT this is particularly important for scalably updating and verifying a large blockchain history.)

ramoz5y ago

Ah, figured as much but am not as familiar with that use case. Thanks!

jjirsa5y ago

Define “at scale”

ramoz5y ago

sounds like you want to define that for me

BryanG25y ago

Someone paste timing results of diffing for very large data sets.

nickcw5y ago

What I'd really like is a multiprocess safe embeddable database written in pure Go. So a database which is safe to read and write from separate processes.

Unfortunately I don't think this one is multiprocess safe.

sneak5y ago

I too feel the “pure Go” pull, but is your use case so precarious or latency-sensitive that you can’t simply use SQLite? That’s what I do in these situations.

jopari5y ago

I checked with the devs and they say writing is only safe from a single process, but reading is multiprocess safe. So I think this confirms your thought.

carlosf5y ago

Not sure if easy to embed, but Consul does offer strong consistency if explicitly configured.

ramoz5y ago

Badger supports concurrent ACID.

https://github.com/dgraph-io/badger

AtlasBarfed5y ago

... doesn't cassandra do a lot of this?

ramoz5y ago

Cassandra is not, traditionally/practically, an embedded db.

j / k navigate · click thread line to collapse

56 comments

aftbit5y ago

>Graviton is currently alpha software.

More like the "BTRFS for key-value stores" ;)

innagadadavida5y ago

Title is very click baity, this is just another kv store, completely unrelated to ZFS.

mayama5y ago

Even their README is click baity then. I quickly glanced at their repo and thought it somehow is related to ZFS, before reading comments here.

BryanG25y ago

Every new unproven software is alpha during launch. Yet to see any software perfect since launch. Graviton provides ZFS leverages and features beyond ZFS. :)

ysleepy5y ago

Nice!

I implemented pretty much the same trade off set in an authenticated storage system.

single writer, radix merkle tree, persistent storage, hashed keys, proofs.

I guess it is a local maxima within that trade off space.

I'll probably have a look around later, the diffing looks interesting, not sure yet if it's done using the merkle tree (likely) or some commit walking algorithm.

mulander5y ago

This was back when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:

Quoting the paper again:

> For example to find the salary of Sam at time T one would query:

    retrieve (EMP.salary)
    using EMP [T]
    where EMP.name = "Sam"

> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.

[1] - https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

rmetzler5y ago

Is this still possible with Postgres?

1 more reply

derefr5y ago

I’m not personally aware of anything that targets exactly this use case. I’m working on something for it myself right now.

Two avenues I’m looking into:

• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with copy-on-write ref-counted pages shared between snapshots, where those snapshots appear as B-tree nodes themselves)

noahdesu5y ago

Worked on a project with similar goals [0]. By no means a production level implementation, but much of the system exists as a proof-of-concept.

[0]: https://makedist.com/projects/cruzdb/

random35y ago

LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects (see https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)

derefr5y ago

In other words, such a database should work just like filesystem snapshots do in a copy-on-write filesystem.

1 more reply

smbrian5y ago

Have you looked at LSM k-v stores (RocksDB being the obvious one)?

I haven't explored it for this usecase, but seems like it might work...

derefr5y ago

LSM trees are fast for writes, but not fast for reads. In fact, there’s essentially no difference in performance between an LSM tree, and a Merkle-tree whose keyspace is encoded within a B-tree.

shrubble5y ago

If I am understanding you properly, couldn't you do this with SQL by specifying the range of data that represents the timeseries you care about, selected via materialized view?

derefr5y ago

1 more reply

ysleepy5y ago

Supporting efficient range scans is hard though.

EDIT: yeah, OLAP will be hard.

moralestapia5y ago

I love the idea but I think you (author) need a lot of time/support polishing this. You need a team probably.

Also,

>Superfast proof generation time of around 1000 proofs per second per core.

Does this limit in any way things like read/write perfomance or usability in general?

jjirsa5y ago

> You need a team probably

The cardinal rule of database development:

http://www.dbms2.com/2013/03/18/dbms-development-marklogic-h...

moralestapia5y ago

Yup, I do not mean to discourage the authors.

I truly like the project and I have a few things in mind that could make use of it already. (Heck, one of them is pretty much Graviton + a front end).

But I cannot just jump into it as there's some real money involved and no one wants to experiment with that.

I see a bright future for Graviton, once it becomes tested and stable in production environments.

1 more reply

ysleepy5y ago

The proof is just the nodes in the path of the merkle tree to the root. (or all sibling nodes of the path to the root).

So proof "generation" is just fetching the nodes and sending them to the client.

jopari5y ago

Rochus5y ago

What is the use case? Why is it important that "All keys, values are backed by blake 256 bit checksum"?

jopari5y ago

It seems to be intended as the backend database for the Dero blockchain smart contract platform: https://medium.com/deroproject/graviton-zfs-for-key-value-st...

The post claims: "The features included in Graviton provide the missing functionality that prevented Stargate RC1 from reaching deployment on our mainnet."

I'm not sure, but I guess that this checksumming is relevant for storing the Merkle trees encoding the blockchain. I don't know why the previous choice of database wasn't suitable.

naivedevops5y ago

Rochus5y ago

I can understand it with a file system; but in a typical key/value store application the data elements are much smaller (likely even smaller than the hash result).

GordonS5y ago

Isn't a 256-bit Blake hash a little OTT, versus a simple CRC, or even a faster, smaller hash like MurmurHash or Jenkins-one-at-a-time?

1 more reply

bdcravens5y ago

You can run a Graviton database. You can also run a database on a Graviton:

https://aws.amazon.com/about-aws/whats-new/2020/07/announcin...

For best results, run Graviton on a Graviton:

https://aws.amazon.com/ec2/graviton/

Mister_Snuggles5y ago

Naming things is difficult.

ethbr05y ago

Not according to git!

We can just call this: ad58cd9088995cfb528187b11c275dad60ce2ec5

And the chip: 59b54f61dd17c27744e884542e35b34172e2cc79

So easy!

TomTinks5y ago

This is definitely something to look into. so far dero looks like a pretty solid project with out of the box thinking.

byteshock5y ago

If latency and performance is an issue there are also solutions like RocksDB or LevelDB

jopari5y ago

byteshock5y ago

This was a reply to another comment in the thread that suggested a user use sqlite. I commented using the Octal ios app. Not sure why it didn’t post it correctly....

ramoz5y ago

Comparison to Badger? Badger is also go-native and, for me, has been exceptional at scale and for read-heavy workloads on SSD.

Ref: https://github.com/dgraph-io/badger

jopari5y ago

ramoz5y ago

Ah, figured as much but am not as familiar with that use case. Thanks!

jjirsa5y ago

Define “at scale”

ramoz5y ago

sounds like you want to define that for me

BryanG25y ago

Someone paste timing results of diffing for very large data sets.

nickcw5y ago

What I'd really like is a multiprocess safe embeddable database written in pure Go. So a database which is safe to read and write from separate processes.

Unfortunately I don't think this one is multiprocess safe.

sneak5y ago

I too feel the “pure Go” pull, but is your use case so precarious or latency-sensitive that you can’t simply use SQLite? That’s what I do in these situations.

jopari5y ago

I checked with the devs and they say writing is only safe from a single process, but reading is multiprocess safe. So I think this confirms your thought.

carlosf5y ago

Not sure if easy to embed, but Consul does offer strong consistency if explicitly configured.

ramoz5y ago

Badger supports concurrent ACID.

https://github.com/dgraph-io/badger

AtlasBarfed5y ago

... doesn't cassandra do a lot of this?

ramoz5y ago

Cassandra is not, traditionally/practically, an embedded db.

j / k navigate · click thread line to collapse