Boosting the performance of PostgreSQL’s COPY command by dropping indexes (opens in new tab)

(californiacivicdata.org)

37 pointspalewire8y ago46 comments

46 comments

Is it not common knowledge that dropping indexes improves database insert performance?

Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.

This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.

cuchoi8y ago

I think what the article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

maxxxxx8y ago

I thought this is pretty common knowledge and I don't even do much with databases.

mgkimsal8y ago

and I think that's what what johnny555 is questioning - isn't this common knowledge already? apparently not.

here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.

1 more reply

adrianmonk8y ago

"Batch-mode processing is always more efficient."

Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.

Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.

1 more reply

C14L8y ago

https://xkcd.com/1053/

adrianmonk8y ago

It's weird but doing less work requires less time. Who'd have thought that?

drblast8y ago

I must be getting old.

Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.

I know, crazy right? But to this day some of that old software, of which PostgreSQL is an example, still has this documentation that you can read, even before you use the software in a production system.

Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.

derefr8y ago

I second reading the Postgres docs. They're fabulous resources. Any time I'm attempting to apply a new SQL syntax feature I'm not 100% fluent with, I'll give the PG docs a read-through—not just for usage, but also for idiomatic examples, performance analysis, edge-cases to consider, and more.

slantyyz8y ago

>> even before jQuery

If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.

jeremiep8y ago

And then people wonder why Agile never produces quality results.

Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.

Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.

cookiecaper8y ago

To be fair, the honest among us will admit that it's not unusual to miss something glaringly obvious for an embarrassingly long amount of time.

The difference is really in whether you recognize the issue and quietly hope no one finds out how dumb you really are, or whether you make a big celebratory blog post about the secret behind your "pioneering" work, making sure that your title and first and last name are clearly attached. And of course, we can't fail to highlight the further brilliance of accomplishing this marvelous feat by employing "rarely used, low-level" commands from within the framework's ORM.

Hold on to your butts, because next week he's going to learn that you can execute commands directly on the server, without even having to use the "low-level" elements of an ORM! I can't wait for the field to be revolutionized by Lead Developer James Gordon's next discovery.

1 more reply

gaius8y ago

This is where the rampant ageism in the industry has lead us; millennials who expect a prize for doing something totally obvious to anyone with experience, who was never considered for the job because they were “too old”

whistlerbrk8y ago

I understand your sentiment, but I think your wording is a bit mean spirited. I don't think this person "expects a prize", nor do I think this is a trait that can be applied to millenials as a whole. My sister who has hired literally a hundred people bucketed as "millenials" has had the exact opposite experience in dealing with them than you describe. She thinks they are some of the hardest working people she has ever worked with. But we digress.

cookiecaper8y ago

Wow. You've gotta love the audacity in making a big announcement like this based on the developer finally getting around to reading the docs. "Drop constraints and indexes for faster imports" is mass import 101.

The entirety of the "Why We Did It" section:

-----

> This improvement was pioneered by James Gordon, the Coalition’s lead developer.

> He drew instruction from PostgreSQL’s official documentation, which reads:

>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]

>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.

-----

Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!

Perhaps the rest of us need to start trumpeting our accomplishments when we find some clearly-stated performance gain in the manual, rather than hiding our heads in embarrassment for not finding out until we released version 2.2 of our mass DB import tool.

not_kurt_godel8y ago

Yeah, it's OK to not know about this, and get excited about discovering it, but to categorize it as "pioneering" is, as you say, audacious. This announcement would have been better as a minor bullet-point in the release notes. If and when someone publicly points out how much faster it is, the appropriate response would be, "heh, this is embarrassing, but we didn't know you should drop indexes before importing"

larzang8y ago

Heads are going to explode when they discover the arcane rarely-utilized tool known, nay, merely whispered of, only as "transactions".

mulmen8y ago

Just today someone asked me "what's MVCC[1]?". One of today's lucky 10,000 [2].

[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/

rosser8y ago

Do you have any numbers on how much extra time (that is, time spent servicing queries above the normal query times when your tables are indexed) application queries take after the loads, but before the index rebuilds are complete?

If so, how does that compare, in aggregate, to the time saved in the loads?

Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?

EDIT: I'm mostly asking these questions to nudge people to think about them in the course of trying this in their own environments. It's my day job to think about these kinds of things; I've worn the PostgreSQL DBA hat for over a decade now.

whistlerbrk8y ago

Yes, you would never do this unless writes were completely disallowed to your app. You will save time building the indices in one-shot.

rosser8y ago

My point is: this is not necessarily true. The Fine Article even mentions that the benefits are situational.

jacobkg8y ago

Do you also disable autovacuum while this is running? That is another good trick for speeding up large database imports.

protomyth8y ago

I would, since it would be a waste on a table loaded right after a create.

p2t2p8y ago

N̶o̶ ̶s̶h̶i̶t̶,̶ ̶S̶h̶e̶r̶l̶o̶c̶k̶!̶ You don't say?

p2t2p8y ago

It was a bit abrupt but seriously it is kinda disheartening to read that a _lead_ developer _discovered_ such a basic thing.

trhway8y ago

Reminds about that funny article from Uber on why they switched from one DB to another. Seems like they had failed to discover a number of things.

cuchoi8y ago

The article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

protomyth8y ago

Yep, and this is common knowledge for most folks who do this kind of work. Heck, if you can stop the database, a whole host of things become quicker by going drops and recreating things. Most alter commands are quicker if you do drops and creates often even if a copy of a table needs to be made.

Maybe we are missing something by getting rid of the DBAs.

segmondy8y ago

It's not a proposal, it's common knowledge. INSERTS causes the index to be rebuilt. You have to search and find the right location then insert the new pointer. If you do 1,000,000 inserts that's 1,000,000 searches and writes to the index.

1 more reply

karlmdavis8y ago

From personal experience: PostgreSQL's COPY commands aren't really all that performant, indexes or no.

Our project saw SIGNIFICANTLY better performance with batched multi-threaded INSERTs. If you can run a few hundred load threads and manage the concurrency correctly (not trivial), it will chew through big loads like a monster.

If I ever have the time/excuse, I want to go back and try a multi-threaded COPY. But if you need speed and have a choice between multi-threaded INSERTs or a single-threaded COPY, go with the INSERTs every time.

davidgould8y ago

COPY of course is single threaded. If you can split your input data and run multiple copy operations you can get similar increases although I would never suggest hundreds of threads. Depending on your IO system something like twice as many threads as CPU threads is probably going to work better.

On a one for one basis COPY IN will be faster than inserts:

- COPY uses a special optimization in the access method: instead of running the full insert logic (find a target page, lock it, insert, unlock page) per row, it batches all the rows that will fit on the target page.

- COPY overall has shorter code paths than regular inserts.

imtringued8y ago

Why not just use regular batched inserts with unnest to turn several array parameters into a table instead of using some arcane hard to use SQL command?

rtpg8y ago

Why is everyone acting so pissy about the fact that someone happened to find a performance improvement trick by reading the docs? Isn't this what's supposed to happen?

None of you read all the performance "tricks" to Postgres before writing your first SQL statement.

Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.

gaius8y ago

Most people don’t brag about being “pioneers” for having read the docs for a basic function. That’s what’s causing all the mirth.

beached_whale8y ago

It looks like the article is directed at those doing bulk uploads to their system and not at developers necessarily. This may not be universally intuitive to people outside of Database admins and developers.

Theodores8y ago

They should have stuck with bog-standard mySQL to get this time saving for free - if you restore a SQL dump created with phpMyAdmin or mysqldump then all the disable index commands are already in there, good to go, and in SQL.

Whoever wrote the Django bit didn't really do a good job on the defaults.

ketralnis8y ago

I don't know what makes mysql more "bog standard" than postgres, but pg_dumpall takes `-F format`, one of which is standard SQL statements.

But that's not what they're dealing with. They're dealing with CSV, presumably from some external source. It'd also be faster if they were dealing with pre-formed database files that they could just rsync. But they're not.

keredson8y ago

yep

j_s8y ago

Yikes https://www.xkcd.com/1053

j / k navigate · click thread line to collapse

46 comments

Johnny5558y ago

Is it not common knowledge that dropping indexes improves database insert performance?

Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.

This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.

cuchoi8y ago

I think what the article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

maxxxxx8y ago

I thought this is pretty common knowledge and I don't even do much with databases.

mgkimsal8y ago

and I think that's what what johnny555 is questioning - isn't this common knowledge already? apparently not.

1 more reply

adrianmonk8y ago

"Batch-mode processing is always more efficient."

Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.

Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.

1 more reply

C14L8y ago

https://xkcd.com/1053/

adrianmonk8y ago

It's weird but doing less work requires less time. Who'd have thought that?

drblast8y ago

I must be getting old.

Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.

Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.

derefr8y ago

slantyyz8y ago

>> even before jQuery

If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.

jeremiep8y ago

And then people wonder why Agile never produces quality results.

Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.

Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.

cookiecaper8y ago

To be fair, the honest among us will admit that it's not unusual to miss something glaringly obvious for an embarrassingly long amount of time.

1 more reply

gaius8y ago

whistlerbrk8y ago

cookiecaper8y ago

The entirety of the "Why We Did It" section:

-----

> This improvement was pioneered by James Gordon, the Coalition’s lead developer.

> He drew instruction from PostgreSQL’s official documentation, which reads:

>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]

>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.

-----

Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!

not_kurt_godel8y ago

larzang8y ago

Heads are going to explode when they discover the arcane rarely-utilized tool known, nay, merely whispered of, only as "transactions".

mulmen8y ago

Just today someone asked me "what's MVCC[1]?". One of today's lucky 10,000 [2].

[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/

rosser8y ago

If so, how does that compare, in aggregate, to the time saved in the loads?

Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?

whistlerbrk8y ago

Yes, you would never do this unless writes were completely disallowed to your app. You will save time building the indices in one-shot.

rosser8y ago

My point is: this is not necessarily true. The Fine Article even mentions that the benefits are situational.

jacobkg8y ago

Do you also disable autovacuum while this is running? That is another good trick for speeding up large database imports.

protomyth8y ago

I would, since it would be a waste on a table loaded right after a create.

p2t2p8y ago

N̶o̶ ̶s̶h̶i̶t̶,̶ ̶S̶h̶e̶r̶l̶o̶c̶k̶!̶ You don't say?

p2t2p8y ago

It was a bit abrupt but seriously it is kinda disheartening to read that a _lead_ developer _discovered_ such a basic thing.

trhway8y ago

Reminds about that funny article from Uber on why they switched from one DB to another. Seems like they had failed to discover a number of things.

cuchoi8y ago

The article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.

protomyth8y ago

Maybe we are missing something by getting rid of the DBAs.

segmondy8y ago

1 more reply

karlmdavis8y ago

From personal experience: PostgreSQL's COPY commands aren't really all that performant, indexes or no.

davidgould8y ago

On a one for one basis COPY IN will be faster than inserts:

- COPY overall has shorter code paths than regular inserts.

imtringued8y ago

Why not just use regular batched inserts with unnest to turn several array parameters into a table instead of using some arcane hard to use SQL command?

rtpg8y ago

Why is everyone acting so pissy about the fact that someone happened to find a performance improvement trick by reading the docs? Isn't this what's supposed to happen?

None of you read all the performance "tricks" to Postgres before writing your first SQL statement.

Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.

gaius8y ago

Most people don’t brag about being “pioneers” for having read the docs for a basic function. That’s what’s causing all the mirth.

beached_whale8y ago

Theodores8y ago

Whoever wrote the Django bit didn't really do a good job on the defaults.

ketralnis8y ago

I don't know what makes mysql more "bog standard" than postgres, but pg_dumpall takes `-F format`, one of which is standard SQL statements.

keredson8y ago

yep

j_s8y ago

Yikes https://www.xkcd.com/1053

j / k navigate · click thread line to collapse