undefined | Better HN

0 pointsatonse3y ago0 comments

I did love the simplicity of nomad.

And in general, nomad worked pretty well for us but our consul cluster kept mysteriously failing. I think that caused our nomad cluster to fail because it was backed by Consul.

The one complaint I did have about nomad (same as consul) was that the recovery process was manual where you had to manually generate a peers.json.

I was shocked when I saw that. Truly one of the "finding how the sausage is made" moments even though I've managed linux servers for two decades – I always assumed it would use zeroconf/bonjour/multicast DNS (remember cloud auto-join?) or something similarly elegant to auto discover other nodes in the network and just reconnect and rebuild a cluster. I mean what's the point of all this stuff if it can't be used to recover a cluster and just Do The Right Thing™? The shiny new experience is stellar (like sales, or setting up a new cluster), but the flip side (when things go wrong) is a mess. That's why we eventually said "nope!" to all the custom stuff and went with boring, plain vanilla ECS, which is itself too much now that we've started using fly.

Don't ever want to even think about having to hand-write a peers.json file to recover a cluster, boot things up, and pray to the ancient gods that it works.

We don't have time for that nonsense. Please, take my money, Fly/Render/everyone else. Your costs are a margin of error compared to what I had to pay a devops person to build our own stack. (I'm not even exaggerating. It was six figures. DevOps people are worth every penny but they cost many, many pennies.) Ultimately, we never used the infra.

I want to focus on building solutions for my customers and not fiddling with weird server stuff.

0 comments

chucky_z3y ago

how long ago did you use nomad? nomad integrates with consul but isn't backed by it. it's also pretty trivial to run consul in quite a shitty network environment by bumping up some of the settings (they should probably change their 'production' suggestions).

atonseOP3y ago

We decided to retire the whole infra about 8 months ago. A lot of our consul complications happened about 18 months ago.

The consul clusters would keep failing in QA (they were running on t2.nanos – but that should be plenty of bandwidth for raft not to blow up every couple weeks, same happened with t2.micros too).

Before we pulled the plug we had started seeing something about ec2 health checks failing and the autoscaling groups yanking servers out and replacing them with new servers. but this is exactly the kind of case where consul should've just added the new machine right in. Instead, the 3 node cluster (now 2 nodes) would just sit there saying "hey I can't find a leader... aaaah. I can't find a leader" – well, to paraphrase Mike Myers on SNL, TALK AMONGST YUHSELVES and figure it out, there are two of you remaining.