undefined | Better HN

0 pointstptacek4mo ago0 comments

It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.

0 comments

chucky_z4mo ago

From my literal conversation I'm having right now, 'try to run one big cluster to capture everything' is our active state. I've brought up federation a bunch of times and it's fallen on deaf ears. :)

We are probably past the size of the entirety of fly.io for reference, and maintenance is very painful. It works because we are doing really strange things with Consul (batch txn cross-cluster updates of static entries) on really, really big servers (4gbps+ filesystems, 1tb memory, 100s of big and fast cores, etc).

schmichael4mo ago

Who orchestrates the orchestrators? is the question we’ve never answered at HashiCorp. We tried expanding Consul’s variety of tenancy features, but if anything it made the blast radius problem worse! Nomad has always kept its federation lightweight which is nice for avoiding correlated failures… but we also never built much cluster management into federated APIs. So handling cluster sprawl is an exercise left to the operator. “Just rub some terraform on it” would be more compelling if our own products were easier to deploy with terraform! Ah well, we’ll keep chipping away at it.

j / k navigate · click thread line to collapse

0 comments

chucky_z4mo ago

From my literal conversation I'm having right now, 'try to run one big cluster to capture everything' is our active state. I've brought up federation a bunch of times and it's fallen on deaf ears. :)

schmichael4mo ago

j / k navigate · click thread line to collapse