Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.
This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.
here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.
Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.
Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.
Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.
I know, crazy right? But to this day some of that old software, of which PostgreSQL is an example, still has this documentation that you can read, even before you use the software in a production system.
Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.
If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.
Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.
Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.
The difference is really in whether you recognize the issue and quietly hope no one finds out how dumb you really are, or whether you make a big celebratory blog post about the secret behind your "pioneering" work, making sure that your title and first and last name are clearly attached. And of course, we can't fail to highlight the further brilliance of accomplishing this marvelous feat by employing "rarely used, low-level" commands from within the framework's ORM.
Hold on to your butts, because next week he's going to learn that you can execute commands directly on the server, without even having to use the "low-level" elements of an ORM! I can't wait for the field to be revolutionized by Lead Developer James Gordon's next discovery.
The entirety of the "Why We Did It" section:
-----
> This improvement was pioneered by James Gordon, the Coalition’s lead developer.
> He drew instruction from PostgreSQL’s official documentation, which reads:
>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]
>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.
-----
Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!
Perhaps the rest of us need to start trumpeting our accomplishments when we find some clearly-stated performance gain in the manual, rather than hiding our heads in embarrassment for not finding out until we released version 2.2 of our mass DB import tool.
[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/
If so, how does that compare, in aggregate, to the time saved in the loads?
Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?
EDIT: I'm mostly asking these questions to nudge people to think about them in the course of trying this in their own environments. It's my day job to think about these kinds of things; I've worn the PostgreSQL DBA hat for over a decade now.
Maybe we are missing something by getting rid of the DBAs.
Our project saw SIGNIFICANTLY better performance with batched multi-threaded INSERTs. If you can run a few hundred load threads and manage the concurrency correctly (not trivial), it will chew through big loads like a monster.
If I ever have the time/excuse, I want to go back and try a multi-threaded COPY. But if you need speed and have a choice between multi-threaded INSERTs or a single-threaded COPY, go with the INSERTs every time.
On a one for one basis COPY IN will be faster than inserts:
- COPY uses a special optimization in the access method: instead of running the full insert logic (find a target page, lock it, insert, unlock page) per row, it batches all the rows that will fit on the target page.
- COPY overall has shorter code paths than regular inserts.
None of you read all the performance "tricks" to Postgres before writing your first SQL statement.
Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.
Whoever wrote the Django bit didn't really do a good job on the defaults.
But that's not what they're dealing with. They're dealing with CSV, presumably from some external source. It'd also be faster if they were dealing with pre-formed database files that they could just rsync. But they're not.