Today, a SaaS I’m familiar with that runs ~10 Aurora clusters in us-east-2 with 2-3 nodes each (1 writer, 1-2 readers) in different AZs had prolonged issues.
At least 1 cluster had a node on “affected” hardware (per AWS). Aurora failed to failover properly and the cluster ended up in a weird error state, requiring intervention from AWS. Could not write to the db at all. This took several hours to resolve.
All that to say that it’s never straightforward. In today’s event, it was pure luck of the draw as to whether a multi-AZ Aurora cluster was going to have >60 seconds of pain.
That SaaS has been running Aurora for years and has never experienced anything similar. I was very surprised when I heard the cluster was in a non-customer-fixable state and required manual intervention. I’ve shilled Aurora hard. Now I’m unsure.
Thank goodness they had an enterprise support deal or who knows if they’d still have issues now.