Skip to content

Top New Best Ask Show Jobs

Zheap – Reinvented PostgreSQL Storage | Better HN

Zheap – Reinvented PostgreSQL Storage (opens in new tab)

(cybertec-postgresql.github.io)

263 pointsPhilipTrauner5y ago69 comments

69 comments

For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.

The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.

Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.

Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

> Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast.

This is pretty much what Oracle is doing.

hyperman15y ago

I have the same impression. It solves all the autovacuum problems, but has a different set of tradeoffs, e.g. the infamous 'ORA-01555 snapshot too old' when your query tries to read the old version but it has already been cleaned away.

takeda5y ago

EnterpriseDB (which started this project) specializes in providing postgresql version that can emulate most of Oracle database functionality, so this is not surprising.

willvarfar5y ago

Its also how MySQL works, using overwrite+undo_log rather than multiple row versions in the same table.

stingraycharles5y ago

Thanks for this. So I’m fairly familiar how MVCC relates to CoW, and the way that I read this is that it’s kind of the inverse of CoW: it’s an in-place update and a copy of the old data.

I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

It’s a very interesting approach though. Once all the corner cases are solved and delt with, I wonder: are there any reasons not to choose zheap over the standard storage engine?

> I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

Most of the code having such low level assumptions has been centralized for PG12, as part of the introduction of pluggable table storage. Also note that that assumption as stated isn't actually true - we don't expect on-disk data to be immutable. There is on-access "pruning" of dead rows, there's VACUUM, there's new row versions on the same page etc.

There are probably some cases where traditional heap would be better that (ideal, finished) zheap. Some ideas: Crash recovery is faster without an UNDO phase. Accessing data that has been modified many times since you took your snapshot may be slower if you have to walk back in time through multiple hops to find the right version of each tuple (compared to the regular heap, which also has to step over lots of dead tuples in that case but can do so sequentially). There may be space management problems when there are too many active transactions interested in a page. There may be microscopic overheads that show up in some workloads that don't benefit from updating in place, perhaps because they are insert/delete-only.

I guess both options would be good to have, and it's interesting that recent SQL Server can do it both ways at the user's option (see "Accelerated Database Recovery", something that sounds conceptually a bit like PostgreSQL vacuum instead of ARIES style rollback where you have to undo changes).

simcop23875y ago

the cow approach likely plays better with flash based storage since rewrites happen less often. so if you've got a lot of otherwise extra storage then maybe it's a good idea. though if you run on top of something like f2fs maybe it'll be better in that respect than pg does on it's own.

comboy5y ago

> row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b.

I thought there is a fixed amount of space allocated per row and varying length blobs are stored elsewhere, is my understanding wrong?

eloff5y ago

Yes. Rows are variable length. Think null values, strings, etc.

Only really large values that don't fit into a page are stored elsewhere, and only if they can't be compressed and made to fit. I think it's > 3kb. But I don't remember why that number comes to mind. Pages are 8kb.

jmnicolas5y ago

Your write-up (or something similar) should be on their front page. Most projects assume you already know everything about them.

I don't remember the name but there was a NoSQL DB on HN a couple weeks ago. I had to Google and read different websites for a solid 10 minutes before I got that it was indeed a NoSQL DB and what was the differentiating factor from other DBs.

aidos5y ago

Oof. That sounds super hairy. Don’t you run into all sorts of weirdness with things like bitmaps expecting to find a specific version of a tuple at a specific location in the heap?

Nope. Any time you are looking at a tuple in the existing heap, you have to test it for visibility against your snapshot. With zheap, it's much the same, except that, rather than just "yes or no", the answer could be that you have to following an update chain into the undo log to get an older version.

xyzzy_plugh5y ago

This is more or less how MySQL works, hence it's lack of vacuum.

sixdimensional5y ago

This sounds like temporal tables, does it not - except deep inside the storage engine?? The historical copy goes to a separate table, the current row is in the current table.

I wonder if there is a logical overlap in the ideas between temporal table handling and this low level storage engine optimization.

atombender5y ago

Indeed, Postgres used to have "time travel" [1] — allowing you to view the table as it was a given point in time. It was removed in 6.2, though some of

Of course, support doesn't require Postgres-style MVCC. Oracle has time travel, which it calls Flashback ("SELECT AS OF"), even though it uses a different type of concurrency control.

[1] https://www.postgresql.org/docs/6.3/c0503.htm

jacques_chester5y ago

PostgreSQL used to have a limited form of unitemporalism by exploiting its MVCC implementation. You could retrieve unvacuumed rows through some additional magic.

For true bitemporalism you need more than current-row/historical-row separations, though.

pbalau5y ago

> I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

> But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

Z can be in a different transaction than X and Y. If two transactions run concurrently, one that replaces X with Y and a second one that inserts Z, the above scenario can happen.

abhinav225y ago

Thanks for the explanation. In layman terms how much is the benefit from separating the old rows from the database vs the cost of accessing it from a separate file?

Just curious why it’s being done now as sounds like a major design decision that would have been considered a long time ago

In Postgres 12, the pluggable storage API was introduced. That allows use of multiple different storage engines with Zheap being one of the new storage engines.

IIRC, one of the big reasons for implementing pluggable storage was for Zheap.

FWIW, the existing implementation has worked really well for most use cases.

jasonwatkinspdx5y ago

It's a complex tradeoff. It depends on the specific workload which scheme is better. There's no clear winner. This is why it's being introduced as a pluggable choice vs a wholesale change.

Andy Pavlo's class has a lot of good general information on the topic: https://15721.courses.cs.cmu.edu/spring2020/slides/03-mvcc1....

ComodoHacker5y ago

To put it simply (very simply), most of the time you don't need old rows (rollbacks are rare, crashes are rare, etc.). If you store old rows (undo) in a separate place, you don't have to read (and skip) them all the time while reading actual data. That's the main benefit.

magicalhippo5y ago

> replace the old row with the new one and write a copy of the old row to a separate file

Does it do this in the reverse order? If not, how does it prevent dataloss in case the main table update succeeds but writing to the "old copy" table fails?

Well, conceptually it happens atomically. The change is described in a single WAL (write ahead log) record, and data in the buffers representing the table and undo log isn't allowed to touch the disk until the WAL is on disk, so in any crash/restart scenario that leaves the job half done, we'll simply do it again when we replay all changes since the last checkpoint. A related question is: what happens to the uncommitted change after that? The system recognises that the change belongs to an uncommitted transaction, and rolls it back, which in this case means that we'll copy the old copy back into the main table before allowing any user to access it, and free up the space in the undo log. This is approximately how most other RDBMSes work.

> Does it do this in the reverse order?

I would guess so, but I haven't looked up the implementation so I'm not sure. There's a ton of race conditions that can come up depending on the exact order you write things so I'm sure the actual implementation is pretty messy.

arthurcolle5y ago

Have you stress tested it? How does this end up faring with large scale deployments?

_bohm5y ago

If you look at the most recent status update on the project:

> 12-10-2020: Most regression tests are passing, but write-speeds are still low.

aeyes5y ago

In theory this is much better than what Postgres currently does. Performance depends on autovacuum cleaning up the mess you leave behind, with the amount of dead tuples increasing you also get index bloat which leads to reading much much more data than necessary. I see 3-5x read amplification before vacuum kicks in on update heavy tables (with vacuum tuned a lot).

This is basically what Oracle does with UNDO and it is obviously very scalable. Vacuum on the other hand has its limits, there is only so much you can squeeze out of a single threaded (at table level) process.

What I look most forward to is that it should enable FLASHBACK queries.

formerly_proven5y ago

Undo logging vs redo logging (except both are MVCC).

jacques_chester5y ago

Confusingly, there is another new PostgreSQL storage effort called "Zedstore": https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5...

Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.

darksaints5y ago

Are you working on zedstore? It is one of the postgres developments that I am anticipating the most. cstore_fdw looked pretty cool but it never lived up to the hype and had a lot of not so intuitive limitations. A storage engine built using the new table access APIs seems like the right foundation.

Any word on when you are expecting stability? I'd love to see this in RDS.

jacques_chester5y ago

I'm not working on zedstore. I just like peppering the folks who do with my wishlist items (bitemporalism! query time forecasting and progress measurements! support for query results in binary formats like Avro, ORC or Arrow! kittens and unicorns!) while they try to do useful work.

fanf25y ago

There is a super informative blog post from last week at https://www.cybertec-postgresql.com/en/zheap-reinvented-post... - discussed at https://news.ycombinator.com/item?id=24710765

throwdbaaway5y ago

> What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.

I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.

Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.

https://bugs.mysql.com/bug.php?id=53825 was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS / CATS scheduling.

And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.

The last thing postgres needs is something that looks good on paper / in benchmark, but has a bunch of issues in production.

It is nice that PG got another storage engine. I hope that will ease the implementation of the column based storage in official PG, as the row based storage is a major problem with very large datasets.

ksec5y ago

When can we expect Zheap to land? 14?

With the new pluggable Storage API, Are there any other new storage engine other than Zheap?

Great to see Antonin and others working on this! I did some work on some pieces of this (along with many others), and I gave a talk about an earlier prototype of the undo log machinery and how it fits at PGCon 2019 (which now seems like a whole lifetime ago!). Slides and recording here in case the background is interesting:

https://speakerdeck.com/macdice/transactions-in-postgresql-a...

teej5y ago

This seems neat!

Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.

My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.

andreypopp5y ago

Such workflows with scheduled DELETE/INSERT often mean that the data is "derived", there's unlogged tables feature for that in PostgreSQL. Table configured with unlogged are not being written to WAL and thus generate much much less I/O. The downside is that after a crash occurs the table might be empty (before it is repopulated by the DELETE/INSERT workflow again).

It should be an atomic replace - create a new table, do your inserts, swap the new table in. This way you aren’t left without the data during the rebuild, and you are guaranteed a consistent state since DDL is transactional in PostgreSQL.

Make the new table unlogged while it’s being populated as well for that performance improvement if you want, since the atomic replace guarantees your consistency anyway.

Alternative if it’s not a full rebuild, move from a delete/insert to using INSERT .. ON CONFLICT DO UPDATE where possible. I do this for cache tables that store materialized summary views that are updated by pgAgent jobs.

If you are deleting “old” data out then maybe use partitioned tables so you can just DROP the table containing the data you want gone.

teej5y ago

UNLOGGED tables is new to me, thanks for sharing.

Why not use partitioned tables? Simply drop/archive the child tables that you don't need, and avoid the vacuum overhead.

merb5y ago

well there is truncate for that. (not mvcc safe)

teej5y ago

The table needed to never be visible as empty, so truncate wasn't a short term option.

This should solve one of the problems Uber had with PostgreSQL reported by Christophe Pettus back in 2017[1].

1- https://thebuild.com/presentations/uber-perconalive-2017.pdf

Previous discussion from 2018 https://news.ycombinator.com/item?id=16526623

petergeoghegan5y ago

I am cautiously optimistic that my own recent work on version churn in B-Tree indexes will go a long way towards fixing those problems:

https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFAST...

This can be used without changing anything in the heap. It's less than a thousand lines of C. You could say that it's complementary to zheap, actually.

zheap makes it possible to update the same row many times without requiring new index entries, even when there is a long running transaction that holds back VACUUM. However, it does not avoid the problem of requiring a whole new set of index entries for all indexes in the event of even one indexed column being modified by updates.

Strictly speaking my patch doesn't "fix" that problem, either, but it comes pretty close. It teaches the indexes to "fight back" against version churn caused by updates that cannot use the HOT optimization. It makes non-HOT updates responsible for cleaning up their own mess -- no more negative externalities. In practice this seems to more or less fix the exact thing that the Uber blog post complained about, which was the "write amplification" incurred in all indexes when only one indexed column was changed by an update.

(I am omitting some subtleties here, but that's the general thrust of it.)

AdamProut5y ago

I wonder how this approach will compare with an LSM tree style index that has gained popularity recently (mostly via RocksDB). LSM trees are also write optimized and allow for some optimizations by batching up updates and merging them in bulk.

Look no further, after playing around with TimescaleDB, there are no updates anymore and we can put this problem into a vacuum :)

j / k navigate · click thread line to collapse