4Bn rows/sec query benchmark: ClickHouse vs. QuestDB vs. Timescale (opens in new tab)

(questdb.io)

20 pointsbluestreak3y ago15 comments

15 comments

This table schema: https://github.com/timescale/tsbs/blob/bcc00137d72d889e6059e...

...seems like a quite odd way to store time-series in ClickHouse. If I understood that code correctly (and I am really not sure), they partition their data by some tag value (the first one in a list?), and sort each partition by "tags ID", while timescaledb partitions by time afaik.

Of course there will be large discrepancies if data is sorted one way in one database schema, and another way in another schema. It seems that at least their query of "ORDER BY time LIMIT 10" would greatly benefit from partitioning or sorting the table by time.

Whether that makes sense depends on your usecase. But I don't think a benchmark with completely dfferent schemas, partitioning and primary keys across databases is fair.

Another thing I noticed is that their version of ClickHouse is quite old, at least aroudn the time the test was written. The shown CREATE TABLE syntax is deprecated since a few versions and cannot be found in recent docs, only github: https://github.com/ClickHouse/ClickHouse/blob/v18.16/docs/en...

hodgesrm3y ago

Actually that table syntax partitions by month, which is a little non-intuitve. In modern ClickHouse SQL it would look like the following.

  CREATE TABLE tsbs_modern (
    created_date    Date     DEFAULT today(),
    created_at      DateTime DEFAULT now(),
    time            String,
    tags_id         UInt32,
    additional_tags String   DEFAULT ''
  ) ENGINE = MergeTree
  PARTITION BY toYYYYMM(created_date)
  ORDER BY (tags_id, created_at)

untitaker_3y ago

While the partitioning is okay, the order by is still problematic. See https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...

hodgesrm3y ago

That article was a good response.

untitaker_3y ago

Thanks, you're right. Mea culpa

hodgesrm3y ago

But your point that it's out of date is definitely right.

arunmu3y ago

Really disappointing post from QuestDB. I would have expected them to do some research on how to design CH table schema before doing such kind of benchmark. The queries used does not take into account the primary key/order by fields. Based on query to be optimized, once could use Projections or MV. Perhaps a bit more work is needed, but that's the clickhouse way of doing it.

bluestreakOP3y ago

Last year we released QuestDB 6.0 and achieved an ingestion rate of 1.4 million rows per second (per server). We compared those results to popular open source databases [1] and explained how we dealt with out of order ingestion under the hood while keeping the underlying storage model read-friendly. Since then, we focused our efforts on making queries faster, in particular filter queries with WHERE clauses. To do so, we once again decided to make things from scratch and built a JIT (Just-in-Time) compiler for SQL filters, with tons of low-level optimisations such as SIMD. We then parallelized the query execution to improve the execution time even further. In this blog post, we first look at some benchmarks against Clickhouse and TimescaleDB, before digging deeper in how this all works within QuestDB's storage model. Once again, we use the Time Series Benchmark Suite (TSBS) [2], developed by TimescaleDB,: it is an open source and reproducible benchmark.

We'd love to get your feedback!

[1]:https://news.ycombinator.com/item?id=27411307

[2]:https://github.com/timescale/tsbs

ysleepy3y ago

This looks cool, I've been looking at time series DBs lately and mostly landed with timescale because of the ability to have complete freedom querying the dataset with postgres kitchen sink.

The post here really focuses on one query and that is weirdly without a time sort. Would similar queries be also fast? - What about a join, aggregates, lag()-over, subqueries, unions, etc queries

zX41ZdbW3y ago

It does not look really convincing if you look deeper: https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...

nhourcard3y ago

You're right, this post focuses on filter queries with the WHERE clause. Aggregates are also pretty fast, a live demo is currently available with three datasets, one of them being more than 1.6 billion rows: https://demo.questdb.io/ On this you can also perform a join, in particular ASOF joins (where timestamps do not exactly match).

NB: The launch of this demo was done some time ago on HN: https://news.ycombinator.com/item?id=23616878

ericb3y ago

It is definitely useful to be able to consume a lot of data quickly, especially high-cardinality data. Inevitably, an infinite flood of data will eventually consume any finite space limitations. I'm wondering what QuestDB's story for data aggregation and cleanup looks like?

bluestreakOP3y ago

Aggregation is also optimised quite a bit via SIMD and map-reduce. They are as fast as the “where” predicates. Multiple field keyed aggregation is not as optimal yet. I would also suggest our demo site (free and fully open) to see how queries that you use work.

Cleanup is semi manual for now. Time partitions can be removed or detached via SQL. We’re working on automating that.

ericb3y ago

> Time partitions can be removed or detached via SQL. We’re working on automating that.

Cool! Will that be continuous queries that can be used for downsampling?

I'm working on load testing and monitoring tools. Since either can produce enough metrics to overflow available storage, the downsampling story ends up as important as write speed for me. I imagine that's true of a lot of metric database scenarios--what happens if they go on...forever?

nojvek3y ago

Very cool blog post.

Looks similar to SingleStore, I wonder how it would perform on the same benchmark. They also use JIT and scan parallelization.

j / k navigate · click thread line to collapse

15 comments

untitaker_3y ago

This table schema: https://github.com/timescale/tsbs/blob/bcc00137d72d889e6059e...

Whether that makes sense depends on your usecase. But I don't think a benchmark with completely dfferent schemas, partitioning and primary keys across databases is fair.

hodgesrm3y ago

Actually that table syntax partitions by month, which is a little non-intuitve. In modern ClickHouse SQL it would look like the following.

  CREATE TABLE tsbs_modern (
    created_date    Date     DEFAULT today(),
    created_at      DateTime DEFAULT now(),
    time            String,
    tags_id         UInt32,
    additional_tags String   DEFAULT ''
  ) ENGINE = MergeTree
  PARTITION BY toYYYYMM(created_date)
  ORDER BY (tags_id, created_at)

untitaker_3y ago

While the partitioning is okay, the order by is still problematic. See https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...

hodgesrm3y ago

That article was a good response.

untitaker_3y ago

Thanks, you're right. Mea culpa

hodgesrm3y ago

But your point that it's out of date is definitely right.

arunmu3y ago

bluestreakOP3y ago

We'd love to get your feedback!

[1]:https://news.ycombinator.com/item?id=27411307

[2]:https://github.com/timescale/tsbs

ysleepy3y ago

This looks cool, I've been looking at time series DBs lately and mostly landed with timescale because of the ability to have complete freedom querying the dataset with postgres kitchen sink.

The post here really focuses on one query and that is weirdly without a time sort. Would similar queries be also fast? - What about a join, aggregates, lag()-over, subqueries, unions, etc queries

zX41ZdbW3y ago

It does not look really convincing if you look deeper: https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...

nhourcard3y ago

NB: The launch of this demo was done some time ago on HN: https://news.ycombinator.com/item?id=23616878

ericb3y ago

bluestreakOP3y ago

Cleanup is semi manual for now. Time partitions can be removed or detached via SQL. We’re working on automating that.

ericb3y ago

> Time partitions can be removed or detached via SQL. We’re working on automating that.

Cool! Will that be continuous queries that can be used for downsampling?

nojvek3y ago

Very cool blog post.

Looks similar to SingleStore, I wonder how it would perform on the same benchmark. They also use JIT and scan parallelization.

j / k navigate · click thread line to collapse