create index users_email on users(email) where status != 'delete'
If your query doesn't contain a proper match with the WHERE clause of the index - the index will not be used. It is easy to forget about it or to get it wrong in subtle ways. Here is an example from work.
There was an event tracing structure which contained the event severity_id. Id values 0-6 inclusive are user facing events. Severity 7 and up is debug events. In practice all debug events were 7 and there were no other values above 7. This table had a partial index with WHERE severity_id < 7. I tracked down a performance regression, when an ORM (due to programmer error) generated WHERE severity_id != 7. The database is obviously not able to tell that there will never be any values above 7 so the index was not used slowing down event handling. Turning the query to match < 7 fixes the problem. The database might also not be able to infer that the index can be indeed used, for example when prepared statements are involved WHERE severity_id < ?. The database will not be able to tell that all bindings of ? will satisfy < 7 so will not use the index (unless you are running PG 12, then that might depend on the setting of plan_cache_mode[1] but I have not tested that yet).
Another thing is that HOT updates in PostgreSQL can't be performed if the updated field is indexed but that also includes being part of a WHERE clause in a partial index. So you could have a site like HN and think that it would be nice to index stories WHERE vote > 100 to quickly find more popular stories. That index however would nullify the possiblity of a hot update when the vote tally would be updated. Again, not a problem but you need to know the possible drawbacks.
That said, they are great when used for the right purpose. Kudos to the author for a nice article!
You say "obviously", but with updated statistics this is the exactly the kind of thing you might expect the planner to know and aid index decisions.
I'm a huge fan of Postgres, coming to it around 5 years ago from at least 10 previous years with SQL Server, but I have hit a few things like this in that time. IME the planner is much more fickle about how you specify your predicates than SQL Server is.
You could argue that it should be able to use it if the table has a check constraint preventing severity_id above 7 being ever inserted. That is something that could be done, I don't know if PostgreSQL does it (I doubt it) or how feasable it would be.
Is SQL Server able to make an assumption like that purely based on statistics? Genuine question.
On the other hand, with sqlite, the WHERE clause of the query and the partial index must match *literally*. So let's say you have a partial index with WHERE severity_id != 0, and a query with WHERE severity_id = 1. All the rows with severity_id = 1 are already indexed, but the engine is still not able to make use of the partial index. This one bit us hard.
Is PG smart enough to avoid that if the query patterns are frequently or exclusively covered by the index?
That left me wondering how, if all indexes are by default partial in Oracle… how does one make an unpartial? nonpartial? index.
https://use-the-index-luke.com/sql/where-clause/null/index
Apparently, you add a computed column to the index that just computes a constant value. And single non-null column then causes the nulls in other columns to get indexed, it's only if the whole tuple is composed of nulls that it gets left out.
That also seems like a bug waiting to happen; someone inverts a query to find unset (NULL) entries, and now you're doing a table scan.
…but it seems also like a form of brain rot, induced by a particular implementation, e.g., similar to how I've had MySQL users ask how to make a key on a table. Where a "key" is an index, it's just that MySQL by default uses the word "key" to mean index, instead of … key¹. (The query language even supports "INDEX" in place of "KEY", but things like "SHOW TABLE" default to the "wrong" (linguistically, not programmatically) word.) And then you might have to de-tangle why these two are different concepts, how they're different. It's very Arrival, in the sense of language (mis-)shaping perception.
¹a key is a set of columns that are sufficient to identify a row. The primary such set of columns is … the primary key. An index can index a key (if more than one exists within a table), but it doesn't have to.
It means basically exporting your database (or tables) and importing it again. What happens is that deleted data which doesn't necessarily free up space (Oracle reuses the freed up space sometimes) doesn't get exported.
https://www.iri.com/blog/vldb-operations/database-reorgs-why...
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUEST...
I used PostgreSQL fairly recently (a year or so ago?) and ended up abandoning it after I was forced to do the export/import dance through a few version upgrades.
When did that requirement go away?
> If for some reason you had to stop the rebuild in the middle, the new index will not be dropped. Instead, it will be left in an invalid state and consume space.
Well, that sure sounds like a bug in PostreSQL to me.
Is there any good reason to keep it? (The fact that it was "created by some user" doesn't seem like much of a reason.)
IMHO, creating an index should be atomic: Either you end up with a valid index, or you end up with nothing.
It looks like in v13+ PostgreSQL could create a single leaf for NULL data and just store row pointers within it, which should reduce data sizes at least a bit.
So, deduplication saves some 65% in index size for NULL-only index-tuples, and the further 35% can be saved by using a partial index (so, in this case, deduplication could have saved 13GB).
*note: last time I checked, REINDEX with deduplication enabled packs 50 duplicates in one compressed index tuple. This varies for naturally grown indexes, and changes with column types and update access patterns.
pre-13 was 16 bytes each (20 when 64-bit compiled), and post-13 it is 6.32 bytes/heap tuple when deduplication has kicked in.
If I had to guess, I would say that it doesn't accomplish anything (or as much as you'd think) for null values simply because there is no real data to store in either approach, you just have a bunch of pointers either way.
The NULL issue is documented directly -- see the "Note" box here:
https://www.postgresql.org/docs/devel/btree-implementation.h...
Ohh, we've had issues with this. We have this table that's mostly ephemeral data, so rows are constantly inserted and then deleted after a certain amount of time. Due to a bug the deletion didn't work for a while and the db grew very large. Fixed the deletion, but no amount of vacuuming actually allows us to fully reclaim that space so we don't have to pay for it.
At the same time the extra cost is probably negligible compared to spending more energy fixing it..
Another solution is tombstoning data so you never actually do a DELETE, and partial indexes go a long way to making that scale. It removes the logn cost of all of the dead data on every subsequent insert.
This is a classic case where partitioning shines. Lets say those are logs. You partition it monthly and want to retain 3 months of data.
- M1 - M2 - M3
When M4 arrives you drop partition M1. This is a very fast operation and the space is returned to the OS. You also don't need to vacuum after dropping it. When you arrive at M5 you repeat the process by dropping M2.
> Another solution is tombstoning data so you never actually do a DELETE, and partial indexes go a long way to making that scale. It removes the logn cost of all of the dead data on every subsequent insert.
If you are referring to PostgreSQL then this would actually be worse than outright doing a DELETE. PostgreSQL is copy on write so an UPDATE to a is_deleted column will create a new copy of the record and a new entry in all its indexes. The old one would still need to be vacuumed. You will accumulate bloat faster and vacuums will have more work to do. Additionally, since is_deleted would be part of partial indexes like you said, a deleted record would also incur a copy in all indexes present on the table.
Compare that to just doing the DELETE which would just store the transaction ID of the query that deleted the row in cmax and a subsequent vacuum would be able to mark it as reusable by further inserts.
> but no amount of vacuuming actually allows us to fully reclaim that space
a full vacuum would. but it would also lock the table for the duration (which is something pg_repack won't do)
Indexing boolean columns is often only useful if one of the values is uncommon and the index is partial to only include those uncommon rows.
>Re-create the table: Using this method as described above often requires a lot of development, especially if the table is actively being used as it's being rebuilt.
>Vacuum the table: PostgreSQL provides a way to reclaim space occupied by dead tuples in a table using the VACUUM FULL command. Vacuum full requires a lock on the table, and is not an ideal solution for tables that need to be available while being vacuumed:
This is confusing to me, i thought postgre was suppose to be better then mysql, yet mysql has a non-locking command to recreate a table. it has like 3 that would fit here, AND deal with the indexes in one command.
We have one huge table I want to add some indexes for specific cases (for max 1% of records) but server will not have enough memory for it if I add those indexes for all records :/
Annoying, but it should work for most queries I'd expect without too much SQL.
I've definitely "rolled my own indexing" like this in the past, though it's more often been duplicating strings into a custom "collation" or other transformations.
Another solution is simply to split your table in two, with the same columns in both, and the index only on one of the tables. But of course that really depends on your business logic -- queries that need to retrieve data from both tables together can get pretty hairy/slow, and if you've got auto-incrementing PKEY's then avoiding collisions between the two tables can be tricky on its own. So this is definitely the less general solution.
Of coure it certainly would be nicer if MySQL supported partial indexes. It seems so useful, I'm surprised it didn't happen long ago.
> TokuDB has been deprecated by its upstream maintainer. It is disabled from MariaDB 10.5 and has been been removed in MariaDB 10.6 - MDEV-19780. We recommend MyRocks as a long-term migration path.
https://mariadb.com/kb/en/tokudb/
Is MyRocks comparable?
Available storage depends on usage and capacity.
Edit: I meant for this article; of course I believe it is useful to track this in practice.
Available makes sense for knowing when things will just plain break (reaching 0 = write failure for a DB).
>Every few months we get an alert from our database monitoring to warn us that we are about to run out of space.
In this case they were avoiding their DB server breaking. They didn't do this for performance reasons.