The order of the columns in the composite index might matter then. We got some nice savings by reordering columns in indexes and changing the join order in queries (our DB wasn't that smart) or adding a "useless" filters to the where clause, allowing us to consolidate multiple indexes into one composite one.
[1]: might be using the wrong terminology here, I'm not talking about a partial index[2], but using only the first N dimensions of a M-dimensional index.
[2]: https://www.postgresql.org/docs/current/indexes-partial.html
IIRC Oracle does support this and calls it a skip-scan. If the first column in the index is not very selective this can be very efficient, though if the first column is good in terms of selectivity then a skip-scan will be very inefficient and the DB might as well just scan the whole index. Given that a good choice of index usually has a fairly selective column as its first, and that unless storage space is very cramped you are likely better off just defining an extra index without the column that you want to not touch in some queries, this means that a skip-scan isn't helpful as often as one might think which is one of the reasons DB engine maintainers give for not spending time implementing and maintaining the feature.
There are some use cases where having the option to skip-scan is definitely a significant bonus, but they are rare enough that I understand why most engine maintainers have not implemented them.
https://www.postgresql.org/docs/current/indexes-multicolumn....
Also worth noting that for very complex workloads that need to support arbitrary subsets of equality matches over columns, a Bloom filter might work best:
Maybe eventually PG will also get index skip scans for btrees, which could allow for arbitrary columns searches in the btree; but we're not there yet by a large margin.
https://dev.mysql.com/doc/refman/8.0/en/range-optimization.h...
create table if not exists test_table
(
id UInt64,
text1 String,
text2 String,
int1000 UInt64,
int100 UInt64,
int10 UInt64,
int10_2 UInt64
)
engine = MergeTree()
order by (id)
;
insert into test_table
with
repeat('b', 1024) as one_kib,
repeat('b', 255) as bytes_255
select
number as id,
one_kib,
bytes_255,
rand() % 1000 as int1000,
rand() % 100 as int100,
rand() % 10 as int10,
rand() % 10 as int10_2
from numbers(10e6)
;
> select count(*) from test_table where int1000 = 1 and int100 = 1;
┌─count()─┐
│ 9949 │
└─────────┘
1 row in set. Elapsed: 0.034 sec. Processed 10.00 million rows, 160.00 MB (290.93 million rows/s., 4.65 GB/s.)
The same table but with 1B rows instead, runs in ~1800ms > select count(*) from test_table where int1000 = 1 and int100 = 1;
┌─count()─┐
│ 999831 │
└─────────┘
1 row in set. Elapsed: 1.804 sec. Processed 1.00 billion rows, 16.00 GB (554.24 million rows/s., 8.87 GB/s.)
[1] Converted the table create and insert logic from here: https://github.com/sirupsen/napkin-math/blob/master/newslett...That first steatement about "no indexing being used" is not quite correct if the query is run exactly as you show in your nice example.
ClickHouse performs what is known as PREWHERE processing which will effectively use the int1000 and int100 columns as indexes. It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions. It then performs a scan on the remaining blocks to get the actual counts.
PREWHERE is effective because columns are compressed and scans are fast. If there's any pattern to the filter columns (for example monotonically increasing counters) or their values have high cardinality PREWHERE processing will remove a large number of blocks. This will make the rest of the scan far faster.
In your dataset it may not be especially efficient because you use random values, which don't necessarily compress well, and the values will appear in many blocks. It works much better in real datasets where data are more correlated.
EDIT: PREWHERE is much faster in cases where you are doing more complex aggregation on many columns. Counts of course don't need to scan any extra values so it's not helpful in this case.
p.s. Scans are ridiculously fast.
this is really the lesson of SOLR. full-scan all the things, aggregate as you go, broadcast disk IO to multiple listeners.
why do a bunch of 4K random IO when you could full-scan at bus speed? yeah you can make the 4K random IO super fast but that's not where hardware is going, and it's also scalable/clusterable/shardable where RDBMS caps out at one machine and clustering is kinda ugly.
Good point, I should have mentioned this was basically a worst-case scenario for Clickhouse as the data layout is totally random (same approach as OP used in their benchmark) and isn't able to utilize any granule pruning, sorting, or skip indexing, but is still able to achieve such remarkable speeds.
> It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions
How is that not just a sequence scans? Of course it pre-emptively filters away entire blocks that do not contain the data, but indexes typically work differently: they’re calculated upon write, so that they can be queried really fast.
Is there a detail missing here, e.g. like bloom filters being used or something else that makes it different from a regular sequence scan?
I would think ClickHouse is tuned for analytics workloads, so it will throw plenty of cores at the problem, and not care much for the overhead. Meanwhile, I believe PostgreSQL is more tuned to transactional workloads, where it will not pay the query parallelism overhead, but optimize for multiple parallel workloads.
https://altinity.com/blog/clickhouse-black-magic-skipping-in...
You can also check out the following webinar, which explains how ClickHouse indexes work in general. Here's a link to the discussion of indexes.
https://youtu.be/1TGGCIr6dMY?t=1933
p.s. The blog article is missing some images that WordPress seems to have lost but you'll still get the idea. (Should be fixed shortly.)
Disclaimer: I work for Altinity
[1] https://altinity.com/blog/
[2] https://www.alibabacloud.com/blog/clickhouse-kernel-analysis...
[3] https://www.alibabacloud.com/blog/clickhouse-analysis-of-the...
For these 64-bit index entries we’d expect to have to scan roughly:
index_row_size⋅rows=2⋅64bit⋅10^5=1.5MiB
Where do the 10^5 rows come from? With a composite index and a point query doesn't the database scan just the 100 returned rows?I forgot to move this around when I updated the article's structure. This is only relevant when doing the index merge. The article has been updated
Composite index (int64, int64): ~70 MiB in Postgres, ~350 MiB in MySQL
Single index (int64): ~70 MiB in Postgres, ~240 MiB in MySQL
If you assume the majority of an index are index entries of (int64, int64, int64) where the third number is some sort of identifier for the record on the heap, you'd expect this to be ~230 MiB. So Postgres does some compression very well here, and MySQL has a bit more overhead for its indexes it seems.
https://www.postgresql.org/docs/current/btree-implementation...
> 67.4.3. Deduplication
> A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.
> Deduplication works by periodically merging groups of duplicate tuples together, forming a single posting list tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of TIDs that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly.
Or does it decide to use only one index and filter based on the covered values?
(1) Table size on the x-axis, and time on the y-axis for index merge vs composite index
(2) Number of columns on the x-axis, and time on the y-axis for both
(3) Number of final matches on the x-axis, and time on the y-axis for both
But ran out of time and decided to test with the table size of 10M rows, and a 100-ish result set. That's in my experience a decent representation for what you might be doing with a relational database.
Can PostgreSQL respond to any query straight from index without touching the rows if the index contains all necessary fields?