undefined | Better HN

0 pointslatchkey2y ago0 comments

It always deployed. It was eventually consistent. Any failure would automatically be resolved after a period of time.

0 comments

Interesting. At any point in time, I had errors from hardware, software and networking. Even the racks would be getting overwhelmed at certain times. Simply being able to ssh into every host wasn't guaranteed. I'm not sure how you did it.

kuchenbecker2y ago

+1 to this, we have a 0.1% hardware failure rate every time we do a rolling restart (40-50k nodes). Some just never come back, in the best case, but actively misbehave in the worst. If the node is unresponsive we remove it from the cluster and fix it async.

latchkeyOP2y ago

If the daemon was running, it would ping a central server on a schedule and report its status, the response from the server was if there was a new version available (with the binary in the response), or not. This combined ping/update service really cut down on the overall traffic, and failures.

If the machine had crashed, it would start up, start my daemon, and that daemon would start the ping/update process all over again.

A large portion of the machines were iPXE booted... so, just reboot was one option and it would all start from scratch again.

Yes, some of the boxes had flaky power supplies or would fail an ssd, and that would cause a technician to go out and manually fix things.

I found it was critical to think of everything as eventually consistent because my hardware was boxes with 12 GPUs and they were flaky and would crash the whole box randomly. I got used to boxes rebooting hundreds of times. My process would also auto-tune the GPU for stability too, changing clock/power settings until the individual cards would become stable and stop the crashing.

The only time I had problems was when the daemon was dead. I had a dashboard where I could see which machines hadn't reported their status. It was easy to pick those off by hand.

j / k navigate · click thread line to collapse