I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.
Some thoughts:
- The reason why cassandra uses so much more space to store the same data is that they've set up the cassandra table schema in such a way that cassandra needs to write the series ID string for each sample (while influxdb only needs to write the values). You easily get a 10-100x blowup just from that. There is no superior "compression" technology here but just an apples-to-oranges comparison.
- Then, comparing the queries is even worse, because they are testing a kind of query (aggregation) that cassandra does not support. To still get a benchmark where they're much faster, they just wrote some code that retrieves all the data from cassandra into a process and then executes the query within their own process. If anything, they're benchmarking one query tool they've written against another one of their own tools.
- Also, if I didn't miss anythin, the article doesn't say on what kind of cluster they actually ran this on or even if they ran both tests on the same hardware. There definitely are cassandra clusters handling more than 100k writes/sec in production right now. So I guess they picked a peculiar configuration in which they outperform cassandra in terms of write ops (given a good distribution of keys, cassandra is more or less linearly scalable in this dimension)
- A better target to benchmark against would probably be http://opentsdb.net/ or http://prometheus.io/ - both seem to have somewhat similar semantics to InfluxDB (which cassandra and elasticsearch do not)
DISC: I also work on a distributed database product (https://eventql.io) but it's neither a direct competitor to Cassandra nor InfluxDB nor any of the other products I've mentioned. I hope the comment doesn't come across as too harsh. The article raised some very big (and harsh) claims so I think it's fair to respond in tone.
If you are going to benchmark a distributed system, you really need to set up more than 1 server.
(Disclaimer - work at Datastax)
I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).
And the way they generated the synthetic data set is by having 1000 imaginative servers produce one sample per second, (i.e. have a script that writes out 1000 * duration_in_sec fake samples -- I believe this is the code that does it https://github.com/influxdata/influxdb-comparisons/tree/mast...)
I am under the impression that Cassandra's performance comes from its distribution capabilities.
If you're just going to write a bunch of uint64 keys with float64 values, of course Cassandra will get much faster. It would be trivial to make a time series database that outperforms InfluxDB with those limitations as well.
The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.
Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.
Fair enough. I'm sure InfluxDB is very good/fast at timeseries data (allthough I have to admit to not actually having tried it out so far). Still, if that was your point, consider removing these statements from the blog.
> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.
> InfluxDB outperformed Cassandra by delivering 10.8x better compression.
> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.
I think it would help make the point and not put the reader in a defensive position (when the statements are clearly not based on a fair comparison of the two products and will not hold under most conditions). Just my two cents.
Time series data has also been a particular focus in the Cassandra community. DTCS was too complicated, so they came up with the easier and faster TWCS. I don't think this is on you, but I'd love to see a comparison with the latest stable 3.x and a multiple node cluster.
> There is no superior "compression" technology
Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.
(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)
My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.
Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.
Ultimately this benchmark will be heavily influenced by the code written to "emulate" the InfluxDB parts on top of Cassandra and how much of that code puts Cassandra at a disadvantage. I'd like to hear from some people that have built such solutions on top of Cassandra what they think about the benchmark and see how that benchmark would evolve.
I can't stress it enough, calculate your cardinality before switching over to it. If your cardinality looks good, InfluxDB is a perfect, logical choice. I really enjoyed it and it is dirt simple to figure out. We had a junior dev just out of college with little experience set it up and get a high level of proficiency in a matter of hours.
Edit: I should point out, I was doing about 10 million records on my db (hosted on a Mac Mini in development!) a day with a 2 week sliding window. I was pushing the data from InfluxDB into custom D3 visualizations. I would cache certain queries in Redis, so I wasn't always hitting InfluxDB with each read request.
I added code to the data generator to work with Timely (https://nationalsecurityagency.github.io/timely/) but can't get it compiled.
Also, it seemed that ingest and query were separate stages. Queries should be run while ingest is running to get real-world performance, but I understand it is more difficult to test this way.
I'm asking because my first impression of Influxdb involved lots of memory gobbling.
Though 1.0 GA is being released today.