From my literal conversation I'm having right now, 'try to run one big cluster to capture everything' is our active state. I've brought up federation a bunch of times and it's fallen on deaf ears. :)
We are probably past the size of the entirety of fly.io for reference, and maintenance is very painful. It works because we are doing really strange things with Consul (batch txn cross-cluster updates of static entries) on really, really big servers (4gbps+ filesystems, 1tb memory, 100s of big and fast cores, etc).