undefined | Better HN

0 pointsArathorn2y ago0 comments

Solving active/backup HA for Matrix is pretty straightforward: postgres replication for the db and a shared filesystem for the media repo. When the primary postgres+synapse goes down, promote the secondary to primary, and continue.

We’re also working on account portability (MSC4014) that would eventually support active/active at the matrix layer, but it’s not ready tet.

0 comments

drdaeman2y ago

Thanks!

PostgreSQL replication is hard for me. I know how to set up "classic" replication, but automatic failover and recovery that can solve split-brain issues is, sadly, not something I've figured out. And if it's not the machine, but the network that goes down (which happens few times a year on consumer-grade connections), it's a problem that happens in practice. I have learned how to repair nodes by hand, but I don't want to do this.

So far, I've managed to avoid thinking much of it by using distributed databases (CockroachDB and Consul, in particular) that already have all the magic built-in, so I just have to be careful with the settings.

Maybe I just need to learn how to deal with PostgreSQL...

paulmd2y ago

postgres itself doesn't really do more than a leader/follower failover. There is always one source of truth: the leader, and whatever timestamp in the WAL they have written until. This is replicated to the follower. If the leader stops talking, the follower takes over. That's more or less it.

(if you have a third-party commercial PG extension with clustering, like EnterpriseDB or CitusDB, ask them, they'll tell you what they want you doing.)

there is no inherent clustering support in postgres. You can't scatter tables across servers/etc. It's not supported. Always one master server - if you need more storage you can use SAS/fibrechannel/etc of course but there is one postgres instance that's in charge.

All of the postgres replication idioms rely on that model: the WAL is reliable up to the point it's been received/verified. If you've faithfully played out the WAL throughout your existence, everyone's storage instance is at the exact same state and the only thing that matters is who's got the most recent timestamp from the host.

This also includes bringup strategies: if you have an atomically checksummed snapshot of the datasets (all storage, and pg_wal, at the same instant) then you can also spawn new nodes. ZFS can do that. And with an incremental copy-on-write you can replicate the base state, then replicate only the changes (which is much faster) and then fail over quickly. That's popular for other applications that don't have sophisticated ACID stuff, the higher-layer stuff at the PG app level probably makes more sense if it fits your usecase.

Anyway short version is, I'd think about a simple blue-green system (or round-robin system) where you can set an environment flag, and blue frontend nodes can simply choose to direct traffic to green backend's pgbouncer or similar when they see that. Then you flip web traffic over. Then you shut down the blue pg instance, and stuff fails over gracefully. The "healthy", "progressing", and "migrated" states, if that makes sense. And there is always one healthy node, even if you need to bring it up from backups. Yes, you have to look whether the other side is still healthy but that is now a k8s/pod problem and not an app one.

If you can live with an instance in the US and one in Europe or similar, and perhaps a few occasional seconds of "they're down?"/"i'm down?" failover downtime, a simple blue/green failover is probably easiest. Someone is master until they're down for 30+ seconds, the other server tries to notice failover failure and tests whether it's down to a couple canary servers via ping/etc, and if they realize they're offline still then they shoot themselves in the head and let the succession happen. As soon as any other server is aware a failover happened (header announcement from appserver instances) then shoot yourself in the head. It's just a "make sure there's no split brain" procedure, trading some availability. That's the tradeoff RDBMS makes, it's atomic/consistent and partition-proof, it's just not available.

Cassandra is a fantastic match for your problem though, especially if you can define availability zones ("I want each of these regions to be self available") but then you have to deal with split brain. And queues/kafka can potentially solve that if you want or can prove some lower bound on queue iteration position (globally or per-follower).

It's all about what parts of your system you want to be unreliable. Using signatures is an interesting crutch over "availability" though - proving a key was issued by a trusted server is fine, and you can propagate revocations (session and cert) relatively quickly and hopefully trust the revocation. And that cert tree doesn't require a lot of deterministic state, if your PKA structure can scale and issue certs regularly. Rolling server cert issue/revocation (as a liveness check) is an interesting model imo. It's all stateless but you can check that (short-term/ephemeral) cert X was actually signed by keyserver Y and cert X signed Z etc, but we know X was actually revoked at time T before that etc. That's always seemed like a reasonably decent "best-effort" propagation, have stuff like revocation handled through redis sync/etc and just check the state for inconsistencies with the known state of the world.

j / k navigate · click thread line to collapse