Did you ever consider envoy xDS?
There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…
What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.
> Running consensus transcontinentally is very painful
You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.
I have seen the following scheme work for millions of workloads:
1. Etcd quorum across 3 close, but independent regions
2. On startup, the app registers itself under a prefix that all other app replicas register
3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.
4. A custom grpc resolver is used to do lookups by service name
Two other details that are super important here:
This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.
The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.
The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.
Many people will read a comment like this and cargo-cult an implementation (“millions of workloads”, you say?!) without knowing how they are going to handle the many different failure modes that can result, or even at what scale the solution will break down. Then, when the inevitable happens, panic and potentially data loss will ensue. Or, the system will eventually reach scaling limits that will require a significant architectural overhaul to solve.
TL;DR: There isn’t a one-size-fits-all solution for most distributed consensus problems, especially ones that require global consistency and fault tolerance, and on top of that have established upper bounds on information propagation latency.