Introducing HypoPG, hypothetical indexes for PostgreSQL (opens in new tab)

(postgresql.org)

145 pointsnarfz10y ago34 comments

34 comments

So presumably you could do the following on your entire database:

    for each query in my codebase Q:
        for each possible combination of indexes I for Q:
            create hypothetical index I
            run EXPLAIN query Q
            if Q uses I, create I

If EXPLAIN doesn't actually run a query, and creating hypothetical indexes doesn't actually write anything, the above could presumably done in milliseconds, no?

Why not take this one step further and just enable an autoindexing mode in postgres using the query history as a heuristic?

developer110y ago

For the typical larger project, you'd wind up with hundreds of indexes each used by a single application query. The optimal index arrangement for a database often does not result in all columns being indexed on a query. It's normal and desirable (for index size and performance) to have fewer indexes that may only partially cover your queries. For example, a query with a WHERE of "a AND b AND c" is more often than not better handled with an index on (a) or (a, b) rather than (a, b, c). Especially if you have a second query with "a AND b AND d", you're most likely going to wind up with a single index for (a, b) rather than one for (a, b, c) and another for (a, b, d).

What could be useful would be a generated report for all your queries with possible suggestions, grouping together queries with the same partial prefix columns, and the cardinality that each index would result in. Then one could manually review to look for the most significant improvements possible. I think a fully automated attempt (even built-in heuristics that modified indexes over time) would sometimes result in a bad decision being made that could theoretically crash your application. Perhaps with a few years of perfecting a solution this could become the norm. :)

ufmace10y ago

There's more to deciding whether to create any particular index than just whether it makes any of your queries faster. Each index can speed up SELECTs, but it also slows down any write queries on those tables. Each index also takes up space on disk and memory.

Then there's the question of how often which queries are run, and how important their performance is. Maybe query X is run by some background process, so nobody will notice if it takes 10x longer, but indexing the table to speed it up will slow down the INSERT statements on that table, which are in the user's path and will be noticed. Maybe it would also tax limited disk space on the DB server and require a hardware upgrade or a complex multi-server setup to work.

Coming up with an optimal indexing scheme for any particular database is a series of complex trade-offs which include taking customer needs and business requirements into account. I don't think it will be practical to do it automatically anytime soon.

jedberg10y ago

Combining this with just a bit of programming loops could get you pretty far at automatically creating indexes. Basically take a query plan and then create all possibly indexes hypothetically until you get the most optimal query.

timdorr10y ago

Except PG index usage in the query plan is dependant on stats about the table (row count, width, etc). So, you would need to continually run it against production data and have it forever chase the perfect set of indices.

If you did end up making it, it should be named moby_dick.py.

CaveTech10y ago

Or you tune for the best average execution time. Sample your queries and run them all through each query plan. One has to be optimal, unless all changes are negligible.

2 more replies

dunkelheit10y ago

Cool one! Thankfully postgres allows creating indexes concurrently (without blocking writes) but 'hypothetical indexes' seem more convenient.

What would be even cooler though is the ability to see multiple plans considered (and rejected) by the planner with all associated costs. It often takes a lot of painful guesswork to understand why particular superior plan is not used.

makmanalp10y ago

I'd always wondered why query engines don't create such temporary indices, for example in the case of large subquery queries, where creating and index AND doing the query with an index runs way faster in total than waiting for it to run without the index. Any insights on this? I guess predicting the gain could be difficult.

woodman10y ago

SQLite does this, autoindex[0].

"When no indices are available to aid the evaluation of a query, SQLite might create an automatic index that lasts only for the duration of a single SQL statement. Since the cost of constructing the automatic index is O(NlogN) (where N is the number of entries in the table) and the cost of doing a full table scan is only O(N), an automatic index will only be created if SQLite expects that the lookup will be run more than logN times during the course of the SQL statement."

[0] http://www.sqlite.org/optoverview.html#autoindex

MSM10y ago

There shouldn't be a situation where reading the data + creating the index would be less intensive than just reading the data.

This is from the SQL server perspective so I apologize if all this doesn't translate 100%. Let's say the column you're filtering on is currently not in any index. That field then needs to be pulled out of the clustered index's leaf pages. Since it's unsorted, we'd need to read the entire table- a full scan.

To build the temporary index we'd also need to read in all of the data, so the same amount of work from that perspective, but then we'd have to sort it all, which is very intensive.

If there's a situation where we're doing something analogous to a inner loop join in SQL Server, but it's doing full scans for each iteration, that's a problem with the query optimizer.

daamien10y ago

Temporary index and Hypothetical index are two different things.

Temporary index is a real index that's usually dynamically created by the optimizer to speed up a query. But it's kind of useless because most of the time the cost of creating a one-time index exceed the cost of not using it. Afaik only DB2 as this feature

http://www.centerfieldtechnology.com/PDFs/DB2%20Temp%20Index...

An hypotetical index is just a way to "trick" the planner into thinking that there's an index and see what the query plan would be if that index really existed. The cost of creating this is close to zero.

hawkice10y ago

If you do N self-joins in a row, creating the index on the fly would be a good move, because they'd have to read through the same data multiple times. The real world example that springs to mind is comments that can nest up to e.g. 5 layers, where it kinda makes sense to grab it all in one query like that.

Obvious caveats: I would just have a 'top-level-post-id' so I could grab everything I need and use the foreign keys to arrange them in memory, not for discovering the rows I need, also wowzers doing _any_ joins without an index is silly, forget multiple in one query.

btown10y ago

Very true. That said, there are almost always situations where reading the data + creating the index + using it N times would be less intensive than just reading the data N times, for very small N.

I'd love to see Postgres or MongoDB with an index-suggesting logger - I'd pay the cost of the logging to have it run on every query, then a background thread would figure out what the "hot spot" types of queries are, figure out what indices to create concurrently, and show it to me in a web interface where I can click a button and create the indices I need concurrently during a low-traffic period of time.

Does anything out there like this exist?

3 more replies

rosser10y ago

Your scenario is valid for "index organized tables", where the data for a given row is stored in one of the index's leaf pages. Synthesizing an index on-demand for those involves an extraordinary amount of random IO.

In postgres, a table's data is stored in the "heap" (a disk file representing the table, itself; indexes are separate disk files). Consequently, you can do a sequential pass through the table, and then calculate/build the index in RAM (assuming you have sufficient "maintenance_work_mem", in postgres configuration terms, otherwise you'll spill to disk at this step, too), and then write it out sequentially as well — and only then walk the index depth-first to get to the leaf page that points to the disk page in the heap that contains the row you're selecting.

(And even that random depth-first traversal of the index will probably actually happen in RAM, because postgres significantly defers to the kernel's buffer cache to mitigate IO costs, and the index you're traversing is pretty much guaranteed to be cached at this point.)

EDIT: It would apply to MySQL using InnoDB, however, as IIRC that does store table data on the leave pages of the primary key index. Can someone corroborate that?

pizza23410y ago

I think the parent refers to a specific optimization that MySQL doesn't perform:

SELECT ... FROM ... WHERE field IN ( SELECT field2 FROM ... )

MySQL doesn't acceptably optimize such queries (at v5.6, as far as I remember).

This is a more complex case than the one you're pointing - the field2 records may be an intermediate result, logically and/or physically.

For example of the latter, in a typical mysql execution with a relatively large number of records, the subquery records are likely to reside on a temporary swapped table. In this case, the optimizer would need to be able to create the intermediate table with optimal data structure[s], which (if I remember correctly) version 5.6 is not capable of.

pizza23410y ago

Watch out, the extension does not create "temporary indices". They can't be used in queries.

It only creates virtual indices, whose only purpose is to observe how the query optimizer would behave, if they were real.

I find it quite a curious idea, although an important parameter of the indices is the cardinality, which I don't see customizable.

kbenson10y ago

Yes, in this case hypothetical means conjecture. "Let's get information about how the system would function if there existed an index as such..."

yangyang10y ago

Well that's pretty much what a hash join is doing - scan a table creating a hash table (effectively an in-memory index), then scan the inner table looking up values in the hash table.

A hash table is a more effective data structure than a btree for in-memory search.

ankrgyl10y ago

Most relatively mature database implementations do in fact apply this optimization. These databases have a component which can estimate the cost and benefit of such operations before running the query. The optimization that you mentioned turns out to be a no-brainer in most cases.

thirdsun10y ago

As far as I understand it FileMaker does this based on your searches/queries.

michaelmior10y ago

This is quite excellent for those of us doing research in physical schema design. As far as I know, this is the first "what-if" analysis tool integrated into an open source database. Although if there are others, I'd love to know about them :)

willu10y ago

One of my major hang-ups about transitioning away from SQL Server to Postgres is the productivity that comes from tools like Profiler and Database Engine Tuning Advisor (which uses hypothetical indexes to find performance improvements). Can anyone suggest analogous tools for Postgres?

postila10y ago

You should a couple of years more.

There is a LOT of efforts being made these days (like http://www.pgcon.org/2015/schedule/events/809.en.html for profiling) and eventually we'll see stuff not worse than commercial DBMSs have. I hope it will be soon.

postila10y ago

*wait

j / k navigate · click thread line to collapse

34 comments

ninkendo10y ago

So presumably you could do the following on your entire database:

    for each query in my codebase Q:
        for each possible combination of indexes I for Q:
            create hypothetical index I
            run EXPLAIN query Q
            if Q uses I, create I

If EXPLAIN doesn't actually run a query, and creating hypothetical indexes doesn't actually write anything, the above could presumably done in milliseconds, no?

Why not take this one step further and just enable an autoindexing mode in postgres using the query history as a heuristic?

developer110y ago

ufmace10y ago

jedberg10y ago

timdorr10y ago

If you did end up making it, it should be named moby_dick.py.

CaveTech10y ago

Or you tune for the best average execution time. Sample your queries and run them all through each query plan. One has to be optimal, unless all changes are negligible.

2 more replies

dunkelheit10y ago

Cool one! Thankfully postgres allows creating indexes concurrently (without blocking writes) but 'hypothetical indexes' seem more convenient.

makmanalp10y ago

woodman10y ago

SQLite does this, autoindex[0].

[0] http://www.sqlite.org/optoverview.html#autoindex

MSM10y ago

There shouldn't be a situation where reading the data + creating the index would be less intensive than just reading the data.

To build the temporary index we'd also need to read in all of the data, so the same amount of work from that perspective, but then we'd have to sort it all, which is very intensive.

If there's a situation where we're doing something analogous to a inner loop join in SQL Server, but it's doing full scans for each iteration, that's a problem with the query optimizer.

daamien10y ago

Temporary index and Hypothetical index are two different things.

http://www.centerfieldtechnology.com/PDFs/DB2%20Temp%20Index...

hawkice10y ago

btown10y ago

Very true. That said, there are almost always situations where reading the data + creating the index + using it N times would be less intensive than just reading the data N times, for very small N.

Does anything out there like this exist?

3 more replies

rosser10y ago

EDIT: It would apply to MySQL using InnoDB, however, as IIRC that does store table data on the leave pages of the primary key index. Can someone corroborate that?

pizza23410y ago

I think the parent refers to a specific optimization that MySQL doesn't perform:

SELECT ... FROM ... WHERE field IN ( SELECT field2 FROM ... )

MySQL doesn't acceptably optimize such queries (at v5.6, as far as I remember).

This is a more complex case than the one you're pointing - the field2 records may be an intermediate result, logically and/or physically.

pizza23410y ago

Watch out, the extension does not create "temporary indices". They can't be used in queries.

It only creates virtual indices, whose only purpose is to observe how the query optimizer would behave, if they were real.

I find it quite a curious idea, although an important parameter of the indices is the cardinality, which I don't see customizable.

kbenson10y ago

Yes, in this case hypothetical means conjecture. "Let's get information about how the system would function if there existed an index as such..."

yangyang10y ago

Well that's pretty much what a hash join is doing - scan a table creating a hash table (effectively an in-memory index), then scan the inner table looking up values in the hash table.

A hash table is a more effective data structure than a btree for in-memory search.

ankrgyl10y ago

thirdsun10y ago

As far as I understand it FileMaker does this based on your searches/queries.

michaelmior10y ago

willu10y ago

postila10y ago

You should a couple of years more.

postila10y ago

*wait

j / k navigate · click thread line to collapse