If the machine had crashed, it would start up, start my daemon, and that daemon would start the ping/update process all over again.
A large portion of the machines were iPXE booted... so, just reboot was one option and it would all start from scratch again.
Yes, some of the boxes had flaky power supplies or would fail an ssd, and that would cause a technician to go out and manually fix things.
I found it was critical to think of everything as eventually consistent because my hardware was boxes with 12 GPUs and they were flaky and would crash the whole box randomly. I got used to boxes rebooting hundreds of times. My process would also auto-tune the GPU for stability too, changing clock/power settings until the individual cards would become stable and stop the crashing.
The only time I had problems was when the daemon was dead. I had a dashboard where I could see which machines hadn't reported their status. It was easy to pick those off by hand.