Postgres, Kafka, and a mysterious 100 GB (opens in new tab)

(mattritter.me)

105 pointsritmatter4y ago22 comments

22 comments

nja4y ago

Debezium is a useful tool, but requires a lot of babysitting. If the DB connection blips or DNS changes (say, if you just rebuilt your prod db), or in some other cases, it'll die and present this exact problem. Fortunately, it's easy to enable a "heartbeat" topic to alert on to make sure it can be restarted before the db disk fills (of course, db size growth alerts are critical too).

We've found that it's worth it for most use cases to switch to a vanilla JDBC Kafka Connector with frequent polling. This also allows for cases such as emitting joined data.

Other than Debezium, Postgres + Kafka + Kafka Connect builds a pretty stable system for sending data around all our different dbs, apps, and data lakes.

gunnarmorling4y ago

Which version of Debezium was it you encountered these issues with? Connection losses should not be a problem with any current version, as the connectors will restart automatically in that case.

Agreed though that monitoring should be in place, so to be notified upon failed connectors early on (could be based on the heartbeat topic, but there's also JMX metrics which can be exposed to Prometheus/Grafana, and/or health checks could be set-up based on the connector's status as exposed via the Kafka Connect REST API).

On the matter of disk growth, there's no silver bullet here. Some people will want to make 100% sure that never ever events are missed, which implies the replication slot must hold onto those WAL segments while it's not read (this is not specific to Debezium btw.). Others may be willing to accept missing events if the slot isn't read long enough, so those WAL segments can be discarded. In recent Postgres versions, a max size (or age, not sure) can be configured for a replication slot, so it's a matter of configuration which behavior you want.

In any case, a connector downtime for longer than say a few hours is something that should show up as an alert, allowing to take action.

Disclaimer: I work on Debezium

milanogc4y ago

How do you deal with late commits that have smaller identifiers. Using the incrementing mode those records will be skipped.

higeorge134y ago

This and not being able to track deleted records requires a solution like debezium.

aeyes4y ago

The write activity every 5 minutes is standard Postgres checkpointing, the default value for checkpoint_timeout is 5 minutes. This is not limited to RDS.

Background processes like vacuum and analyze also write to WAL.

ritmatterOP4y ago

Thanks, I didn't know that. Why do you think the Debezium docs call out this behavior specifically for RDS (https://debezium.io/documentation/reference/stable/connector...)? I'm wondering if there is any difference for RDS vs other hosting methods.

asjfkdlf4y ago

Had this exact thing happen in production when we turned off an audit DB replication slot. We got lucky and caught it before our entire app went down. It’s one of the many foot-guns we have found with Postgres.

nieve4y ago

I'm not sure telling a replication system to keep copies of all changes without limits until they're replicated and then not letting them replicate is much of a footgun. You're getting exactly what you asked for and any system with the setup is going to eventually start filling up space if the replication or connector goes down. It's like complaining that your kitchen floor got wet when when you plugged the sink and left it running - it's unpleasant surprise, maybe you wanted a sink with an emergency overflow outlet, but it's the obvious outcome.

j16sdiz4y ago

Depends on your audit requirement, halting the operation when audit is unavailable maybe desirable.

n49o74y ago

What other footguns have you found ?

gregplaysguitar4y ago

I heard an interesting comment recently from Derek Collison (creator of NATS[1]) that durability and delivery requirements can have the unwanted side-effect that one consumer can adversely impact all the others. It didn’t immediately make sense then, but this seems like a succinct illustration of the point!

[1] https://NATS.io

bufferoverflow4y ago

I still don't get what caused 100GB of logs. Idle events every 5 minutes should not generate that much data, so what am I missing?

merb4y ago

when turning on logical replication postgres keeps all wal files until they are also applied to the downstream consumer. if you don't use logical replication they get cleaned up by setting wal_keep_segments and max_wal_size (which have a sane value so there is nothing to bother)

btw. a good idea is to keep these wal files or at least archive them with a basebackup, because it enables PITR recovery (https://www.postgresql.org/docs/9.3/continuous-archiving.htm...) via wal-g or wal-e.

btw. the bigger mistake here is that a lot of people do not monitor their database size, which is a huge problem.

edit: ah and wal files are always 16mb (unless compiled differently)

> There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (renamed to become the next segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit.

mattashii4y ago

Yes, but this system seems to have generated WAL at a very high rate for something that is a test system with no load.

It went from effectively 0 to 97 GB in less than 6 days; which is 673 MB/hr; 11MB /min, or 58 MB /checkpoint timer.

If we consider that most, if not all, of this traffic would be based on checkpoint-related full-page writes, that's still some 7000 distinct pages being modified every 5 minutes. Even when I consider pg_stat_statements recording and writing out all activities, modifying ~7000 blocks is a lot for what I can expect to be repeating and templated maintenance tasks.

whartung4y ago

So this was 21 days of idle chatter? 1 WAL per 5 minute event, 16MB per WAL, 100GB/16MB = 6250 * 5m = 31,250m/1440m/day = 21ish days.

Is that what happened?

2 more replies

tbrock4y ago

We ran into this too, i actually think its a terrible postgres default. Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.

You'd have to resync the followers/secondaries, big deal, its way better than the primary going down because its disk filled up. This failure mode is awful. On RDS its relatively painless because you can snap fingers and have more disk but if you are running it yourself? Good luck.

In practice, mongodb’s oplog mechanism, for example, which acts as a circular buffer with a set size is a much more tolerant implementation. If the oplog rolls over befrore you've read it just resync but at most its taken up 10% of your disk.

Izkata4y ago

> Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.

In other words, replication without using replication slots. Postgres has had it for a long time, although the limit is set in disk usage rather than time.

(unless specifically logical replication and/or debezium don't support this?)

voganmother424y ago

Postgres replication slots support a max size (since 13) specifically to deal with this scenario (prior to that needed to setup monitoring/intervention outside the db)

TYMorningCoffee4y ago

I thought it was going to be a feedback loop, where the write at the end would trigger another Kafka message. I was pleasantly surprised it wasn't.

avg_dev4y ago

Just wanted to say thanks for this article. I have been exploring Debezium for capturing Pg changes at work. I don’t know a whole lot about Pg replication and it’s nice to hear the potential gotchas before moving to anything production-like.

j / k navigate · click thread line to collapse