I am sorry but I disagree. You are trying to make it sound that your cloud provider downtime has something to do how you manage your workload in your code.
Debugging __any__ distributed system is difficult, this is why monitoring and tracing should be first class citizens in your deployments. It seems they are not for you.
Yeah, monitoring told us it was down and eventually we figured it was an AWS issue we could do nothing about until they patched it. My main point there is actually that for many use cases, this doesn't have to be a distributed computing problem and thus the non-distributed version is superior to the distributed version.