InfluxDB vs. Cassandra for timeseries data (opens in new tab)

(influxdata.com)

61 pointsrar_ram9y ago32 comments

32 comments

The linked article is an obviously bullshit benchmark that makes influxdb look good and cassandra look bad (by, surprise, the influxdb folks).

I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.

Some thoughts:

- The reason why cassandra uses so much more space to store the same data is that they've set up the cassandra table schema in such a way that cassandra needs to write the series ID string for each sample (while influxdb only needs to write the values). You easily get a 10-100x blowup just from that. There is no superior "compression" technology here but just an apples-to-oranges comparison.

- Then, comparing the queries is even worse, because they are testing a kind of query (aggregation) that cassandra does not support. To still get a benchmark where they're much faster, they just wrote some code that retrieves all the data from cassandra into a process and then executes the query within their own process. If anything, they're benchmarking one query tool they've written against another one of their own tools.

- Also, if I didn't miss anythin, the article doesn't say on what kind of cluster they actually ran this on or even if they ran both tests on the same hardware. There definitely are cassandra clusters handling more than 100k writes/sec in production right now. So I guess they picked a peculiar configuration in which they outperform cassandra in terms of write ops (given a good distribution of keys, cassandra is more or less linearly scalable in this dimension)

- A better target to benchmark against would probably be http://opentsdb.net/ or http://prometheus.io/ - both seem to have somewhat similar semantics to InfluxDB (which cassandra and elasticsearch do not)

DISC: I also work on a distributed database product (https://eventql.io) but it's neither a direct competitor to Cassandra nor InfluxDB nor any of the other products I've mentioned. I hope the comment doesn't come across as too harsh. The article raised some very big (and harsh) claims so I think it's fair to respond in tone.

brianwawok9y ago

I don't understand this benchmark at all. It says performance of a 1000 node cluster, but then shows 100k inserts per second in Cassandra. Then later follow up comments say that this test was on a single machine. Without seeing the schema, 100k inserts / sec is reasonable for a single machine. For 1000 machines it would mean there is a pretty massive configuration issue.

If you are going to benchmark a distributed system, you really need to set up more than 1 server.

(Disclaimer - work at Datastax)

paulasmuth9y ago

This confused me, too.

I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).

And the way they generated the synthetic data set is by having 1000 imaginative servers produce one sample per second, (i.e. have a script that writes out 1000 * duration_in_sec fake samples -- I believe this is the code that does it https://github.com/influxdata/influxdb-comparisons/tree/mast...)

1 more reply

bsg759y ago

Does it ever make sense to use Cassandra on a single node for anything but dev/test?

I am under the impression that Cassandra's performance comes from its distribution capabilities.

1 more reply

pauldix9y ago

The tests were run on the same hardware, a single server. Bare metal, not VMs. InfluxDB writes the series string with everything. We tried to imitate what you'd need to do to get close to similar functionality doing time series like InfluxDB does in Cassandra.

If you're just going to write a bunch of uint64 keys with float64 values, of course Cassandra will get much faster. It would be trivial to make a time series database that outperforms InfluxDB with those limitations as well.

The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.

Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

paulasmuth9y ago

> The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance. [...] if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

Fair enough. I'm sure InfluxDB is very good/fast at timeseries data (allthough I have to admit to not actually having tried it out so far). Still, if that was your point, consider removing these statements from the blog.

> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.

> InfluxDB outperformed Cassandra by delivering 10.8x better compression.

> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.

I think it would help make the point and not put the reader in a defensive position (when the statements are clearly not based on a fair comparison of the two products and will not hold under most conditions). Just my two cents.

1 more reply

coredog649y ago

Hasn't that work already been done? Cyanite and KairosDB both plug in to the broader Graphite ecosystem (more or less) and use Cassandra as a data store.

Time series data has also been a particular focus in the Cassandra community. DTCS was too complicated, so they came up with the easier and faster TWCS. I don't think this is on you, but I'd love to see a comparison with the latest stable 3.x and a multiple node cluster.

1 more reply

twa9279y ago

Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.

> There is no superior "compression" technology

Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.

paulasmuth9y ago

Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.

1 more reply

msiebuhr9y ago

Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.

daenney9y ago

The conclusion isn't entirely surprising, "we from X say that engine X is better than engine Y" but there are many companies that have monitoring stacks built on top of Cassandra, like SignalFX. They have a presentation or two on the topic too that might be interesting: http://www.slideshare.net/planetcassandra/signalfx-making-ca...

Ultimately this benchmark will be heavily influenced by the code written to "emulate" the InfluxDB parts on top of Cassandra and how much of that code puts Cassandra at a disadvantage. I'd like to hear from some people that have built such solutions on top of Cassandra what they think about the benchmark and see how that benchmark would evolve.

soundoflight9y ago

From using InfluxDB (up to v0.10 I think it was), it's a great database but performance REALLY depends on the cardinality of your data.

I can't stress it enough, calculate your cardinality before switching over to it. If your cardinality looks good, InfluxDB is a perfect, logical choice. I really enjoyed it and it is dirt simple to figure out. We had a junior dev just out of college with little experience set it up and get a high level of proficiency in a matter of hours.

Edit: I should point out, I was doing about 10 million records on my db (hosted on a Mac Mini in development!) a day with a 2 week sliding window. I was pushing the data from InfluxDB into custom D3 visualizations. I would cache certain queries in Redis, so I wasn't always hitting InfluxDB with each read request.

pauldix9y ago

We're working on the cardinality problem. Will be resolved in an upcoming release. Moving the index over to a disk based format that will hopefully still be fast and not sacrifice lookup performance.

bsg759y ago

Can you explain the cardinality problem in a bit more detail? Its come up more than once in this thread.

1 more reply

soundoflight9y ago

Good to hear! I have a project coming up soon that I want to use it on.

tychuz9y ago

Just looking at the domain is easy to guess which one will win...

klucar9y ago

Has anyone successfully compiled their benchmark code? https://github.com/influxdata/influxdb-comparisons

I added code to the data generator to work with Timely (https://nationalsecurityagency.github.io/timely/) but can't get it compiled.

Also, it seemed that ingest and query were separate stages. Queries should be run while ingest is running to get real-world performance, but I understand it is more difficult to test this way.

dz0ny9y ago

It would be interesting to compare memory requirements, I chose Influxdb because it had 10 times lower memory usage. The dataset was small (couple of million datapoints)... but stil

dx0349y ago

That only works when you have one series with a lot of observations. If you have many series with fewer observations (say 50k per series) influxDB uses absurd amounts of memory. I had to switch back to Cassandra because I constantly ran out of memory.

pauldix9y ago

We're working on solving the high cardinality problem. Hopefully soon

deluvas9y ago

How much memory are we talking about? How often did you execute queries on the data?

I'm asking because my first impression of Influxdb involved lots of memory gobbling.

LogicX9y ago

Not sure why this blog post from July made it to the front page now.

Though 1.0 GA is being released today.

j / k navigate · click thread line to collapse

32 comments

paulasmuth9y ago

The linked article is an obviously bullshit benchmark that makes influxdb look good and cassandra look bad (by, surprise, the influxdb folks).

I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.

Some thoughts:

brianwawok9y ago

If you are going to benchmark a distributed system, you really need to set up more than 1 server.

(Disclaimer - work at Datastax)

paulasmuth9y ago

This confused me, too.

I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).

1 more reply

bsg759y ago

Does it ever make sense to use Cassandra on a single node for anything but dev/test?

I am under the impression that Cassandra's performance comes from its distribution capabilities.

1 more reply

pauldix9y ago

The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.

Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

paulasmuth9y ago

> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.

> InfluxDB outperformed Cassandra by delivering 10.8x better compression.

> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.

1 more reply

coredog649y ago

Hasn't that work already been done? Cyanite and KairosDB both plug in to the broader Graphite ecosystem (more or less) and use Cassandra as a data store.

1 more reply

twa9279y ago

Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.

> There is no superior "compression" technology

paulasmuth9y ago

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

1 more reply

msiebuhr9y ago

Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

daenney9y ago

soundoflight9y ago

From using InfluxDB (up to v0.10 I think it was), it's a great database but performance REALLY depends on the cardinality of your data.

pauldix9y ago

We're working on the cardinality problem. Will be resolved in an upcoming release. Moving the index over to a disk based format that will hopefully still be fast and not sacrifice lookup performance.

bsg759y ago

Can you explain the cardinality problem in a bit more detail? Its come up more than once in this thread.

1 more reply

soundoflight9y ago

Good to hear! I have a project coming up soon that I want to use it on.

tychuz9y ago

Just looking at the domain is easy to guess which one will win...

klucar9y ago

Has anyone successfully compiled their benchmark code? https://github.com/influxdata/influxdb-comparisons

I added code to the data generator to work with Timely (https://nationalsecurityagency.github.io/timely/) but can't get it compiled.

Also, it seemed that ingest and query were separate stages. Queries should be run while ingest is running to get real-world performance, but I understand it is more difficult to test this way.

dz0ny9y ago

It would be interesting to compare memory requirements, I chose Influxdb because it had 10 times lower memory usage. The dataset was small (couple of million datapoints)... but stil

dx0349y ago

pauldix9y ago

We're working on solving the high cardinality problem. Hopefully soon

deluvas9y ago

How much memory are we talking about? How often did you execute queries on the data?

I'm asking because my first impression of Influxdb involved lots of memory gobbling.

LogicX9y ago

Not sure why this blog post from July made it to the front page now.

Though 1.0 GA is being released today.

j / k navigate · click thread line to collapse