120K distributed consistent writes per second with Calvin (opens in new tab)

(fauna.com)

85 pointszenithm9y ago68 comments

68 comments

itp9y ago

This seems cool, and I sincerely wish them nothing but success. That said, I had a major sense of déjà vu while reading this post -- I worked at FoundationDB prior to the Apple acquisition, when we published a blog post with a very similar feel:

http://web.archive.org/web/20150325003241/http://blog.founda...

I'm not trying to make a comparison between a system I used to work on and one that I frankly know little to nothing about; rather, I'd suggest that building a system like this just isn't enough to be compelling on its own.

halestock9y ago

Bah! I was so dissappointed when I heard about the Apple acquisition of FoundationDB. Will any of the technology behind it ever see the light of day?

itp9y ago

Unfortunately I'm the last person to ask. While I did start at FoundationDB pretty early (second employee), I ceased to be involved at the point of the acquisition, and beyond that I've only heard a few rumors from former coworkers.

As a business it was always an ambitious effort, and I'm not sure what could or should have been done differently. But since then I've used a number of other systems and thought to myself "boy, I wish I had FDB right now."

3 more replies

evanweaver9y ago

That's one of my favorite benchmarks!

You're right that distributed consistency is a beginning, not an end. We are painfully aware of all the startups that have died or are dying on this beach.

It's great to be scalable and consistent, but you have to be more than an operationally-better replacement for legacy SQL. That's one reason we built our own query language that plays to modern application development patterns (serverless, functional, change feeds, etc.) instead of the typical slow, never-quite-there, distributed SQL planner.

TranceMan9y ago

Totally off topic - why does O2 UK block this without adult verification?

On mobile so cannot read article

g0del_was_wr0ng9y ago

Including your 9x write amplification in the number of "consistent writes" doesn't count -- like at all. I'm amazed nobody called you out on this yet.

You're doing 3k batches per second with 4 logical writes each, right? So that is at most 3-12k writes per second using the way that every other distributed database benchmark and paper counts.

Or otherwise - if you continue counting writes in this special/misleading way - you'd have to multiply every other distributed db benchmark's performance numbers with a factor of 3-15x to get an apples-to-apples comparison.

The 12k batched writes/sek through what I assume is a paxos variant is still pretty impressive though! Good to get more competition/alternatives for zookeeper & friends!

evanweaver9y ago

No, that's not write amplification. Replication and storage engine fanout are not included. Instead, that number is the number of logical partition updates per row, per transaction. This makes the test comparable to tests of key-value stores that can only update one key per transaction. The FoundationDB test mentioned elsewhere here was reported the same way.

If you want to include write amplification, then multiply by 6x again to account for the replicated log and the tables themselves.

g0del_was_wr0ng9y ago

It's doing 12k rows in 3k user-issued write operations/transaction per second.

Counting any kind of "internal write effects" that result from a user write (i.e. write amplification) is obviously done to mislead in the benchmark and does not make it comparable to key-value stores.

12k writes/s is the number of rows that are written from a user perspective. So 12k/s is also the number you have to use when comparing it to key value stores. But of course, comparing Fauna with eventually consistent systems is not a really fair comparison. You don't make it fairer by misleading in your benchmark though.

Also, just because some other vendor posted a misleading benchmark on hn (I don't know if they did) that doesn't make it right or means you should do it. Just call them out on it too.

1 more reply

rystsov9y ago

Even if we focus on 120k writes and ignore the low transaction number then it's still an average result. They use c3.4xlarge to run 5 shards each with 3 replication factor. It's 1500 writes per core (120k / (5*16)). Etcd easily does 5227 writes (I believe it can do even more) on "Standard DS12 v2" so it's 1306 per core which is very similar.

eternalban9y ago

> nobody called you out on this yet.

https://news.ycombinator.com/item?id=13728738

web0079y ago

This description is very misleading.

120,000 writes per second is accurate, talking about actual durable storage (disk) writes. But it's only 3,330 transactions, which should be the number that a user cares about.

I don't have proper data and I'm a bit rusty, but I feel like Cassandra could blow that away if you set similar consistency requirements on the client side (QUORUM on read, same for write?). Am I understanding this correctly, or does Fauna/Calvin give you something functionally better than what C* can do?

freels9y ago

A more apples-to-apples comparison with Cassandra would be FaunaDB transactions and Cassandra's atomic batch mutations, or its PAXOS-based lightweight transactions as opposed to single-cell writes tested in most Cassandra benchmarks.

YMMV, but we've found the performance of Cassandra writing out similar-sized multi-row atomic batches at QUORUM to be similar in this hardware configuration.

FaunaDB transactions are quite a bit more powerful, as they can span multiple keys, use conditionals and read-modify-write logic, and still resolve with serializable semantics.

web0079y ago

That makes a lot more sense then. It's still a misleading statement to say "writes" vs "transactions" since you could (potentially) make fewer writes and support more transactions. The ratio between the two is a measure of efficiency, but only transactions matter to end-users.

1 more reply

qaq9y ago

Maybe I am missing some special point but a decent PG box will do 1,000,000+ TPS vs 3,000+ TPS here. When pgXact lands it will do close to 2,000,000 TPS. So reading all the posts about the amazing new db "X" that can do about N times less than PG on a multi-node cluster I get confused why the numbers are being presented as some sort of achievement.

otterley9y ago

That doesn't sound right. Even the newest NVMe devices can't do 1M writes per second; they're maxing out at around 330k IOPS.

The 1M "TPS" you're referring to is a read-only benchmark (e.g. http://akorotkov.github.io/blog/2016/05/09/scalability-towar...). Those are reads (most likely from the buffer cache), not writes or transactions in any real sense.

qaq9y ago

330K iops for a single device you are very unlikely to be running a single device. There are Fusion IO models that can do 1M IOPS but they are on the exotic side. If you are optimising for throughput you can configure commit_delay so you will fsync multiple commits.

3 more replies

pg3149y ago

Like others have commented, those numbers seem to be too high for writes.

On the other hand, the Fauna numbers don't seem that impressive to me. On a mid-2011 Macbook Air, I get 2600 transactions per second (read-committed) in PostgreSQL 9.6. Setup is as follows:

  CREATE TABLE IF NOT EXISTS foo(a TEXT, b TEXT, c TEXT, d TEXT);
  CREATE INDEX IF NOT EXISTS idx_foo_a ON foo(a);
  CREATE INDEX IF NOT EXISTS idx_foo_b ON foo(b);
  CREATE INDEX IF NOT EXISTS idx_foo_c ON foo(c);
  CREATE INDEX IF NOT EXISTS idx_foo_d ON foo(d);

  -- prepared statement, the inserted strings are 4 chars wide
  INSERT INTO foo(a, b, c, d)
  VALUES
    ($1, $2, $3, $4),
    ($5, $6, $7, $8),
    ($9, $10, $11, $12),
    ($13, $14, $15, $16);

These numbers are for one thread doing the writing.

Am I missing something?

g0del_was_wr0ng9y ago

Well, you're missing that in Faunas case the writes are durably stored on N machines. I.e. their system provides fault tolerance in case a machine fails. You can't really do the same thing with postgres (without trading off full ACID compliance).

2 more replies

g0del_was_wr0ng9y ago

I agree that your PG numbers don't sound likely/factual -- the reason for your confusion is probably that somebody gave you untrue performance numbers for postgres or you're not comparing the same things. Is the 1m+ TPS something you measured yourself or "heard from a friend"?

If you ran the benchmark yourself, how did you achieve 1m durable writes/sec on a postgres machine/instance? [It's quite an achievement] On what kind of crazy hardware? How large was each write/row? Did you use the postgres network protocol to perform the writes?

qaq9y ago

By extrapolation writes are IO bound you don't need crazy expensive things to get to the needed number of IOPS Intel 750 is 230,000 random writes @ $320 per PCI-E SSD. 9 drive config is over 2,000,000 IOPS for less than 3K.

1 more reply

zenithmOP9y ago

This is a new one to me...the referenced paper is here: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

How does this algorithm compare to whatever Google Spanner does?

evanweaver9y ago

That's a good and complicated question. They both are fully ACID-compliant systems. The biggest difference as a developer is that Calvin never blocks reads, contested or not. You get causally consistent single-replica reads with no coordination.

This makes the read performance equivalent to something like Cassandra at CONSISTENCY.ONE, without giving up the cross-partition write linearization of something like Spanner.

imglorp9y ago

Can Calvin scale beyond the OP claim?

I've personally seen a Cassandra ring go to more than 2M ops/sec.

1 more reply

imownbey9y ago

"Calvin's primary trade-off is that it doesn't support session transactions, so it's not well suited for SQL. Instead, transactions must be submitted atomically. Session transactions in SQL were designed for analytics, specifically human beings sitting at a workstation. They are pure overhead in a high-throughput operational context."

Is this specifically for distributed SQL only? I think there are some scalable SQL systems that don't support sessions either.

jchrisa9y ago

Calvin is a generalized consistency protocol, that we use in FaunaDB to support relational semantics (but not SQL) in our database.

Multi-query transactions can be useful, but the FaunaDB query language is functional, rather than declarative like SQL, so composing queries that can do everything you want is usually easier than SQL.

cakoose9y ago

How would you perform the classic "transfer money from one account to another" operation?

Would you create a single operation that reads one record, checks that it's enough, then adds the amount to another record?

Or maybe you'd first read both accounts, then issue a conditional write operation that makes sure the data hasn't changed before doing the write?

1 more reply

ddorian439y ago

voltdb has no client-sessions though you have udf ones

olegkikin9y ago

2011: Benchmarking Cassandra Scalability on AWS - Over a million writes per second

http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

Also a single SSD from 2015 is rated at 120K writes per second:

PM1725: http://www.samsung.com/semiconductor/global/file/insight/201...

lngnmn9y ago

Consistent writes to a permanent storage or didn't happen.

evanweaver9y ago

They are durable; will clarify.

lngnmn9y ago

Otherwise it would be like Mongo did - we have put your data in the OS buffers - what could possibly go wrong?

1 more reply

rystsov9y ago

Is it possible to download fauna to play with it on my own?

jchrisa9y ago

You can sign up for the cloud version in a few seconds here: https://fauna.com/serverless-cloud-sign-up

rystsov9y ago

With the cloud version, it's impossible to run jepsen-like tests to validate consistency and to observe cluster's behavior when the network is unstable and nodes tend to crush.

1 more reply

j / k navigate · click thread line to collapse

68 comments

itp9y ago

http://web.archive.org/web/20150325003241/http://blog.founda...

halestock9y ago

Bah! I was so dissappointed when I heard about the Apple acquisition of FoundationDB. Will any of the technology behind it ever see the light of day?

itp9y ago

3 more replies

evanweaver9y ago

That's one of my favorite benchmarks!

You're right that distributed consistency is a beginning, not an end. We are painfully aware of all the startups that have died or are dying on this beach.

TranceMan9y ago

Totally off topic - why does O2 UK block this without adult verification?

On mobile so cannot read article

g0del_was_wr0ng9y ago

Including your 9x write amplification in the number of "consistent writes" doesn't count -- like at all. I'm amazed nobody called you out on this yet.

You're doing 3k batches per second with 4 logical writes each, right? So that is at most 3-12k writes per second using the way that every other distributed database benchmark and paper counts.

The 12k batched writes/sek through what I assume is a paxos variant is still pretty impressive though! Good to get more competition/alternatives for zookeeper & friends!

evanweaver9y ago

If you want to include write amplification, then multiply by 6x again to account for the replicated log and the tables themselves.

g0del_was_wr0ng9y ago

It's doing 12k rows in 3k user-issued write operations/transaction per second.

Also, just because some other vendor posted a misleading benchmark on hn (I don't know if they did) that doesn't make it right or means you should do it. Just call them out on it too.

1 more reply

rystsov9y ago

eternalban9y ago

> nobody called you out on this yet.

https://news.ycombinator.com/item?id=13728738

web0079y ago

This description is very misleading.

120,000 writes per second is accurate, talking about actual durable storage (disk) writes. But it's only 3,330 transactions, which should be the number that a user cares about.

freels9y ago

YMMV, but we've found the performance of Cassandra writing out similar-sized multi-row atomic batches at QUORUM to be similar in this hardware configuration.

FaunaDB transactions are quite a bit more powerful, as they can span multiple keys, use conditionals and read-modify-write logic, and still resolve with serializable semantics.

web0079y ago

1 more reply

qaq9y ago

otterley9y ago

That doesn't sound right. Even the newest NVMe devices can't do 1M writes per second; they're maxing out at around 330k IOPS.

qaq9y ago

3 more replies

pg3149y ago

Like others have commented, those numbers seem to be too high for writes.

On the other hand, the Fauna numbers don't seem that impressive to me. On a mid-2011 Macbook Air, I get 2600 transactions per second (read-committed) in PostgreSQL 9.6. Setup is as follows:

  CREATE TABLE IF NOT EXISTS foo(a TEXT, b TEXT, c TEXT, d TEXT);
  CREATE INDEX IF NOT EXISTS idx_foo_a ON foo(a);
  CREATE INDEX IF NOT EXISTS idx_foo_b ON foo(b);
  CREATE INDEX IF NOT EXISTS idx_foo_c ON foo(c);
  CREATE INDEX IF NOT EXISTS idx_foo_d ON foo(d);

  -- prepared statement, the inserted strings are 4 chars wide
  INSERT INTO foo(a, b, c, d)
  VALUES
    ($1, $2, $3, $4),
    ($5, $6, $7, $8),
    ($9, $10, $11, $12),
    ($13, $14, $15, $16);

These numbers are for one thread doing the writing.

Am I missing something?

g0del_was_wr0ng9y ago

2 more replies

g0del_was_wr0ng9y ago

qaq9y ago

1 more reply

zenithmOP9y ago

This is a new one to me...the referenced paper is here: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

How does this algorithm compare to whatever Google Spanner does?

evanweaver9y ago

This makes the read performance equivalent to something like Cassandra at CONSISTENCY.ONE, without giving up the cross-partition write linearization of something like Spanner.

imglorp9y ago

Can Calvin scale beyond the OP claim?

I've personally seen a Cassandra ring go to more than 2M ops/sec.

1 more reply

imownbey9y ago

Is this specifically for distributed SQL only? I think there are some scalable SQL systems that don't support sessions either.

jchrisa9y ago

Calvin is a generalized consistency protocol, that we use in FaunaDB to support relational semantics (but not SQL) in our database.

cakoose9y ago

How would you perform the classic "transfer money from one account to another" operation?

Would you create a single operation that reads one record, checks that it's enough, then adds the amount to another record?

Or maybe you'd first read both accounts, then issue a conditional write operation that makes sure the data hasn't changed before doing the write?

1 more reply

ddorian439y ago

voltdb has no client-sessions though you have udf ones

olegkikin9y ago

2011: Benchmarking Cassandra Scalability on AWS - Over a million writes per second

http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

Also a single SSD from 2015 is rated at 120K writes per second:

PM1725: http://www.samsung.com/semiconductor/global/file/insight/201...

lngnmn9y ago

Consistent writes to a permanent storage or didn't happen.

evanweaver9y ago

They are durable; will clarify.

lngnmn9y ago

Otherwise it would be like Mongo did - we have put your data in the OS buffers - what could possibly go wrong?

1 more reply

rystsov9y ago

Is it possible to download fauna to play with it on my own?

jchrisa9y ago

You can sign up for the cloud version in a few seconds here: https://fauna.com/serverless-cloud-sign-up

rystsov9y ago

With the cloud version, it's impossible to run jepsen-like tests to validate consistency and to observe cluster's behavior when the network is unstable and nodes tend to crush.

1 more reply

j / k navigate · click thread line to collapse