PostgreSQL's Hash Indexes Are Now Cool (opens in new tab)

(rhaas.blogspot.com)

214 pointsrachbelaid8y ago38 comments

38 comments

Are there any plans for allowing hash indexes in uniqueness constraints such as the ones created for primary keys? It seems like a good fit for an index that is specialized for equality checks.

snaky8y ago

Maybe it would be easier just to remove "A foreign key must reference columns that either are a primary key or form a unique constraint" restriction, stating that any unique index is enough.

anarazel8y ago

> Maybe it would be easier just to remove "A foreign key must reference columns that either are a primary key or form a unique constraint" restriction, stating that any unique index is enough.

The problem is that only btree indexes support uniqueness atm. That's the relevant unsupported features, not the ability to have constraints (which essentially just requires uniqueness support of the underlying index):

    postgres[27716][1]# SELECT amname, pg_indexam_has_property(oid, 'can_unique') FROM pg_am;
    ┌────────┬─────────────────────────┐
    │ amname │ pg_indexam_has_property │
    ├────────┼─────────────────────────┤
    │ btree  │ t                       │
    │ hash   │ f                       │
    │ gist   │ f                       │
    │ gin    │ f                       │
    │ spgist │ f                       │
    │ brin   │ f                       │
    └────────┴─────────────────────────┘
    (6 rows)

Edit: different uses of word constrain (to constrain, and a constraint) seemed too confusing.

1 more reply

smitherfield8y ago

Perhaps I'm missing something, but wouldn't the most efficient possible hash function for a set of unique integers be identity? So I don't see how an ID column would benefit.

anarazel8y ago

You can use such a hashfunction, but it'd not be a good one for general purpose. Remember that hashtables, and also our hash-indexes, use buckets / ranges of hash values to keep the hash-table at reasonable sizes. If you don't use a hash function that has a good bit perturbation, you can end up with a lot of values in the same bucket.

Consider e.g. the common implementation where buckets are determined by masking out either the lowest or the highest bits of the hashvalue. If you mask out the highest bits and your values aren't sequential (pretty common), you end up with a lot of collisions. More extremely, if you mask the low bits and shift, if you only have small values everything ends up in the first bucket. Therefore what you want is a hashfunction where a one bit change at "one side" of the input value, is likely to affect most of the remaining bits.

misterbowfinger8y ago

Wasn't sure what a hash index was vs. btree

Short version - hash indexes are faster in PG11, but they only apply to "where = foobar" queries, giving a 0(1) time. Btree indexes have O(logn)

But hash indexes can't be applied to range clauses, like "where < 50". You can still use a btree index however.

SO post:

https://stackoverflow.com/a/398921

colanderman8y ago

Keep in mind that the log n factor of a B+tree is generally very low; B+tree branching factors are typically in the 100s. Also, the first few levels are generally kept in cache, so you'll only have to hit disk for inner nodes past 100 million entries or so.

Finally, hash indexes always require that the found row be confirmed in the data table, even for simple existence queries, since the keys themselves aren't stored in the hash table. (This is why hash indexes can't be UNIQUE.) B+trees can often answer such queries without the extra lookup (an "index-only scan"). If your B+tree is so large that its inner nodes spill onto disk (necessitating a 2nd disk seek), chances are the equivalent hash index will as well, which, combined with the consult of the data table, kind of negates the benefit.

ankrgyl8y ago

You're right, but I don't think the use case you mentioned (as a competitor to B+ trees for direct lookups) is the target use case for a hash index.

A hash index can be a big win for nested loop joins, especially at high concurrency. It is quite common to build a hash table over a subset of the inner table of an equijoin. This is (a) slow to construct and (b) memory-intensive (especially if many of these queries are run concurrently).

With a hash index, a lot of cases that required building a hash table to speed up a query can just use the hash index directly. Furthermore, every concurrent instance of the query can use the same hash index. This is a big win for both performance of a single query (latency) and query scalability.

3 more replies

colanderman8y ago

Too late to edit now but "100 million" should read "20 billion"… I missed a tree layer. (Assuming a branching factor of 200 and 4 GiB of index cache: 500 thousand inner nodes of 8 KiB each can fit in cache, corresponding to 100 million leaves, which can contain 20 billion entries.)

So hash indexes really only begin to show benefits for individual lookups when your data is terabyte-scale, and below that can even be harmful for that use case (if you could otherwise benefit from an index-only lookup). But see ankrgyl's comment (sibling to this one) for better reasons to consider hash indexes.

anarazel8y ago

> Finally, hash indexes always require that the found row be confirmed in the data table, even for simple existence queries, since the keys themselves aren't stored in the hash table. (This is why hash indexes can't be UNIQUE.)

I don't think that's really true - even for btree indexes the heap is accessed to check row visibility. That's necessary because postgres doesn't store visibility information in indexes. There's imo no really big problem making hash indexes support uniqueness - it's "just" work.

1 more reply

Terr_8y ago

> Finally, hash indexes always require that the found row be confirmed in the data table

Probably to avoid returning the wrong result for statistically-inevitable collisions, right?

agacera8y ago

Probably it is needed to check if the row is still visible to that transaction.

1 more reply

snaky8y ago

> We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.

the link is from the post http://amitkapila16.blogspot.com/2017/03/hash-indexes-are-fa...

slagfart8y ago

Don't forget that the performance improvement of hash over b-tree gets better for wide keys - the wider the key, the better the improvement, as the key width is irrelevant to the hash index.

Based on the width of the key, the row count and the amount of memory available, there might be instances where a disk-hit gets replaced with an in-memory cache, which is awesome for spinning-disk based systems.

1 more reply

twoodfin8y ago

Hopefully Postgres will also use them for joins? That’s the real performance win, since an equijoin is like N * WHERE X = Y.

anarazel8y ago

Yes, it will for index nested loops. But e.g. a mergejoin can't really benefit from a hash index, in contrast to a btree index which can provide the ordering without a sort step.

lathiat8y ago

It is for this reason that the InnoDB Adaptive Hash Index exists: https://dev.mysql.com/doc/refman/5.7/en/innodb-adaptive-hash...

> Based on the observed pattern of searches, MySQL builds a hash index using a prefix of the index key. The prefix of the key can be any length, and it may be that only some of the values in the B-tree appear in the hash index. Hash indexes are built on demand for those pages of the index that are often accessed. > If a table fits almost entirely in main memory, a hash index can speed up queries by enabling direct lookup of any element, turning the index value into a sort of pointer. InnoDB has a mechanism that monitors index searches. If InnoDB notices that queries could benefit from building a hash index, it does so automatically.

I am really excited by some of the PostgresQL developments lately.. in particular with parallel query execution. To the best of my knowledge the only place that currently exists within MySQL (or MariaDB) is MySQL Cluster.

ww5208y ago

I think hash index help in join as well. "where table1.c1 = table2.c2" A hash index on c2 would let the join be done in O(n) of c1 + O(1) of c2.

alexnewman8y ago

For now, the big problem with range queries in my mind would be how you get it to return in any reasonable order in any reasonable amount of time.

zitterbewegung8y ago

I hope Postgres keeps on getting Cooler. Reading all the Change Logs makes me feel warm and fuzzy inside .

JohnCohorn8y ago

I remember someone commenting a while back that with hash indexes allowing you to navigate relationships in O(1} time, relational DBs can approximate the perf characteristics of a graph DB. When I did a quick and dirty test a couple years ago(before they would have been usable anyway due to durability and replication) I found that hash indexes performed noticeably worse than btree for navigating a few test tables with 10-100m rows each. Curious whether this is a major enough improvement for hash indexes that btree will not still be faster for many common equality lookups.

qaq8y ago

Any good articles that go into detail on when to use hash indexes in PG 10?

Annatar8y ago

Is it just me, or has PostgreSQL kept on getting better and better, to the point of being #1 DB in the open source market?

alexnewman8y ago

We have been talking about this for a while. Consider the coming of more distributed and nvme based storage and a lot of these things eventually make sense.

frandroid8y ago

> A report from a tester who goes by "AP" in July tipped us off to the need for a few further tweaks. AP found that trying to insert 2 billion rows into a newly-created hash index was causing an error.

This is what I call trying things "at Indian scale" :D

j / k navigate · click thread line to collapse

38 comments

tofflos8y ago

Are there any plans for allowing hash indexes in uniqueness constraints such as the ones created for primary keys? It seems like a good fit for an index that is specialized for equality checks.

snaky8y ago

Maybe it would be easier just to remove "A foreign key must reference columns that either are a primary key or form a unique constraint" restriction, stating that any unique index is enough.

anarazel8y ago

> Maybe it would be easier just to remove "A foreign key must reference columns that either are a primary key or form a unique constraint" restriction, stating that any unique index is enough.

    postgres[27716][1]# SELECT amname, pg_indexam_has_property(oid, 'can_unique') FROM pg_am;
    ┌────────┬─────────────────────────┐
    │ amname │ pg_indexam_has_property │
    ├────────┼─────────────────────────┤
    │ btree  │ t                       │
    │ hash   │ f                       │
    │ gist   │ f                       │
    │ gin    │ f                       │
    │ spgist │ f                       │
    │ brin   │ f                       │
    └────────┴─────────────────────────┘
    (6 rows)

Edit: different uses of word constrain (to constrain, and a constraint) seemed too confusing.

1 more reply

smitherfield8y ago

Perhaps I'm missing something, but wouldn't the most efficient possible hash function for a set of unique integers be identity? So I don't see how an ID column would benefit.

anarazel8y ago

misterbowfinger8y ago

Wasn't sure what a hash index was vs. btree

Short version - hash indexes are faster in PG11, but they only apply to "where = foobar" queries, giving a 0(1) time. Btree indexes have O(logn)

But hash indexes can't be applied to range clauses, like "where < 50". You can still use a btree index however.

SO post:

https://stackoverflow.com/a/398921

colanderman8y ago

ankrgyl8y ago

You're right, but I don't think the use case you mentioned (as a competitor to B+ trees for direct lookups) is the target use case for a hash index.

3 more replies

colanderman8y ago

anarazel8y ago

1 more reply

Terr_8y ago

> Finally, hash indexes always require that the found row be confirmed in the data table

Probably to avoid returning the wrong result for statistically-inevitable collisions, right?

agacera8y ago

Probably it is needed to check if the row is still visible to that transaction.

1 more reply

snaky8y ago

the link is from the post http://amitkapila16.blogspot.com/2017/03/hash-indexes-are-fa...

slagfart8y ago

Don't forget that the performance improvement of hash over b-tree gets better for wide keys - the wider the key, the better the improvement, as the key width is irrelevant to the hash index.

1 more reply

twoodfin8y ago

Hopefully Postgres will also use them for joins? That’s the real performance win, since an equijoin is like N * WHERE X = Y.

anarazel8y ago

Yes, it will for index nested loops. But e.g. a mergejoin can't really benefit from a hash index, in contrast to a btree index which can provide the ordering without a sort step.

lathiat8y ago

It is for this reason that the InnoDB Adaptive Hash Index exists: https://dev.mysql.com/doc/refman/5.7/en/innodb-adaptive-hash...

ww5208y ago

I think hash index help in join as well. "where table1.c1 = table2.c2" A hash index on c2 would let the join be done in O(n) of c1 + O(1) of c2.

alexnewman8y ago

For now, the big problem with range queries in my mind would be how you get it to return in any reasonable order in any reasonable amount of time.

zitterbewegung8y ago

I hope Postgres keeps on getting Cooler. Reading all the Change Logs makes me feel warm and fuzzy inside .

JohnCohorn8y ago

qaq8y ago

Any good articles that go into detail on when to use hash indexes in PG 10?

Annatar8y ago

Is it just me, or has PostgreSQL kept on getting better and better, to the point of being #1 DB in the open source market?

alexnewman8y ago

We have been talking about this for a while. Consider the coming of more distributed and nvme based storage and a lot of these things eventually make sense.

frandroid8y ago

This is what I call trying things "at Indian scale" :D

j / k navigate · click thread line to collapse