It's also unclear how this actually solves the problem. Now if S3 in _either_ region is unavailable they'll start to fail 50% of uncached requests. I'm guessing they're using Route 53 health checks with some cloudwatch alarm to cut over to one region when they think the other is unhealthy. Presumably this is covered in the unavailable part 2.
I'm mildly skeptical that this is worth the increased risks plus the increased cost from running Lambda@Edge on cache misses.
It's not impossible, of course. Some kind of control plane error could probably knock the whole global service offline. But I'd rather bet on a multi-region service than have all my eggs in one regional basket.
Cellular Architecture was largely a reaction to the S3 outage [0]. I agree that one is still bound to fail due to unknown unknowns or unpatchable known unknowns, but reducing the blast radius [1] to not be globally unavailable [2] is a step in the right direction.
[0] https://www.youtube-nocookie.com/embed/swQbA4zub20
[1] https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...
[2] https://blog.acolyer.org/2015/05/07/large-scale-cluster-mana...
I don't advocate for ever-more-complicated solutions as a rule. e.g. I think multi-cloud setups are probably way more trouble than they're worth for most companies.
I certainly agree that graceful degradation where possible and not too expensive is ideal. For example, if S3 is having problems in one region, being able to fall back (gracefully degrade) into read-only mode might be a nice thing to have.
(In this particular case having a secondary region also probably helps with disaster recovery, which is pretty much mandatory in B2B, for better or worse.)
Specifically Lambda@Edge though, which changes the math a little.
That said, currently the better solution is CloudFront Origin Groups.
Presumably Part 2 of this post will address this limitation, or maybe their product isn't affected by it. (I've never looked into Contentful, though maybe I will now -- blog post purpose achieved?)
I'm also not sure if "active-active" is the best name for this setup, since objects can't be written to the 2nd bucket (replication only goes one direction). Generally I associate "active-active" with "writes can happen anywhere", though maybe I'm wrong?
[0] https://aws.amazon.com/blogs/aws/s3-replication-update-repli...
Full disclosure, I've never used, but pretty sure this feature was created for the scenario you are trying to solve.
I have not used origin failover either, though I'm pretty sure you're right that this is its exact use-case.
Source: I work at Contentful.
A single region can go down (and has in the past), no matter how reliable S3 as a whole is. If your business wants to avoid downtime, this is a simple solution to further reduce risks that cause downtime.
Just because other sites might go down, doesn't mean you have to accept it for your own.
Disclaimer: I work at Contentful.
Adding another vendor into your stack can often come with none engineering complexity (e.g. data protection forms, contractual requirements from customers another vendor you need to work with, etc.) so this is an alternative to stay in AWS when you already use it.
Disclaimer: I work at Contentful.