For general cloud, avoiding screwing might mean multi cloud. But for LLM, there’s only one option at the highest level of quality for now.
People tend to over focus on resilience (minimizing probability of breaking) and neglect the plan for recovery when things do break.
I can’t tell you how weirdly foreign this is to many people, how many meetings I’ve been in where I ask what the plan is when it fails, and someone starts explaining RAID6 or BGP or something, with no actual plan, other than “it’s really unlikely to fail”, which old dogs know isn’t true.
I guess the point is, for now, we’re all de facto plug-in authors.