The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.
Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.
Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.
This is pretty much what Oracle is doing.
I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.
It’s a very interesting approach though. Once all the corner cases are solved and delt with, I wonder: are there any reasons not to choose zheap over the standard storage engine?
Most of the code having such low level assumptions has been centralized for PG12, as part of the introduction of pluggable table storage. Also note that that assumption as stated isn't actually true - we don't expect on-disk data to be immutable. There is on-access "pruning" of dead rows, there's VACUUM, there's new row versions on the same page etc.
I guess both options would be good to have, and it's interesting that recent SQL Server can do it both ways at the user's option (see "Accelerated Database Recovery", something that sounds conceptually a bit like PostgreSQL vacuum instead of ARIES style rollback where you have to undo changes).
I thought there is a fixed amount of space allocated per row and varying length blobs are stored elsewhere, is my understanding wrong?
Only really large values that don't fit into a page are stored elsewhere, and only if they can't be compressed and made to fit. I think it's > 3kb. But I don't remember why that number comes to mind. Pages are 8kb.
I don't remember the name but there was a NoSQL DB on HN a couple weeks ago. I had to Google and read different websites for a solid 10 minutes before I got that it was indeed a NoSQL DB and what was the differentiating factor from other DBs.
I wonder if there is a logical overlap in the ideas between temporal table handling and this low level storage engine optimization.
Of course, support doesn't require Postgres-style MVCC. Oracle has time travel, which it calls Flashback ("SELECT AS OF"), even though it uses a different type of concurrency control.
For true bitemporalism you need more than current-row/historical-row separations, though.
But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?
Z can be in a different transaction than X and Y. If two transactions run concurrently, one that replaces X with Y and a second one that inserts Z, the above scenario can happen.
Just curious why it’s being done now as sounds like a major design decision that would have been considered a long time ago
IIRC, one of the big reasons for implementing pluggable storage was for Zheap.
FWIW, the existing implementation has worked really well for most use cases.
Andy Pavlo's class has a lot of good general information on the topic: https://15721.courses.cs.cmu.edu/spring2020/slides/03-mvcc1....
Does it do this in the reverse order? If not, how does it prevent dataloss in case the main table update succeeds but writing to the "old copy" table fails?
I would guess so, but I haven't looked up the implementation so I'm not sure. There's a ton of race conditions that can come up depending on the exact order you write things so I'm sure the actual implementation is pretty messy.
> 12-10-2020: Most regression tests are passing, but write-speeds are still low.
This is basically what Oracle does with UNDO and it is obviously very scalable. Vacuum on the other hand has its limits, there is only so much you can squeeze out of a single threaded (at table level) process.
What I look most forward to is that it should enable FLASHBACK queries.
Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.
Any word on when you are expecting stability? I'd love to see this in RDS.
I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.
Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.
https://bugs.mysql.com/bug.php?id=53825 was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS / CATS scheduling.
And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.
The last thing postgres needs is something that looks good on paper / in benchmark, but has a bunch of issues in production.
With the new pluggable Storage API, Are there any other new storage engine other than Zheap?
https://speakerdeck.com/macdice/transactions-in-postgresql-a...
Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.
My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.
Make the new table unlogged while it’s being populated as well for that performance improvement if you want, since the atomic replace guarantees your consistency anyway.
Alternative if it’s not a full rebuild, move from a delete/insert to using INSERT .. ON CONFLICT DO UPDATE where possible. I do this for cache tables that store materialized summary views that are updated by pgAgent jobs.
If you are deleting “old” data out then maybe use partitioned tables so you can just DROP the table containing the data you want gone.
1- https://thebuild.com/presentations/uber-perconalive-2017.pdf
Previous discussion from 2018 https://news.ycombinator.com/item?id=16526623
https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFAST...
This can be used without changing anything in the heap. It's less than a thousand lines of C. You could say that it's complementary to zheap, actually.
zheap makes it possible to update the same row many times without requiring new index entries, even when there is a long running transaction that holds back VACUUM. However, it does not avoid the problem of requiring a whole new set of index entries for all indexes in the event of even one indexed column being modified by updates.
Strictly speaking my patch doesn't "fix" that problem, either, but it comes pretty close. It teaches the indexes to "fight back" against version churn caused by updates that cannot use the HOT optimization. It makes non-HOT updates responsible for cleaning up their own mess -- no more negative externalities. In practice this seems to more or less fix the exact thing that the Uber blog post complained about, which was the "write amplification" incurred in all indexes when only one indexed column was changed by an update.
(I am omitting some subtleties here, but that's the general thrust of it.)