Making S3 More Resilient Using Lambda Edge (opens in new tab)

(contentful.com)

64 pointsthelonelygod6y ago23 comments

23 comments

Before they'd be affected by Route 53 outages, Cloudfront outages, and S3 outages. Now they can add Lambda outages to that list too.

It's also unclear how this actually solves the problem. Now if S3 in _either_ region is unavailable they'll start to fail 50% of uncached requests. I'm guessing they're using Route 53 health checks with some cloudwatch alarm to cut over to one region when they think the other is unhealthy. Presumably this is covered in the unavailable part 2.

I'm mildly skeptical that this is worth the increased risks plus the increased cost from running Lambda@Edge on cache misses.

ReidZB6y ago

I think it's still a reduction in risk overall. In the old model, they were vulnerable to S3 failing in one region, a thing that's happened many times. Now they've mitigated the S3-failure-in-one-region issue, at least mostly (though as you point out, how they do so is unknown), and in exchange they've picked up a dependency on Lambda@Edge. But Lambda@Edge, like CloudFront, is a global service distributed across many regions, and to my knowledge AWS has never had a global Lambda@Edge outage.

It's not impossible, of course. Some kind of control plane error could probably knock the whole global service offline. But I'd rather bet on a multi-region service than have all my eggs in one regional basket.

harikb6y ago

The most famous s3 outage has been operator error from a well-meaning privileged user. The fact that it hasn’t happened for Lambda is just betting on luck. Shit happens, we can’t go designing ever more complicated solutions. May be our services should have some graceful degradation when shit happens instead of trying to create a big-bang and spawn an alternate universe.

ignoramous6y ago

> The fact that it hasn’t happened for Lambda is just betting on luck.

Cellular Architecture was largely a reaction to the S3 outage [0]. I agree that one is still bound to fail due to unknown unknowns or unpatchable known unknowns, but reducing the blast radius [1] to not be globally unavailable [2] is a step in the right direction.

[0] https://www.youtube-nocookie.com/embed/swQbA4zub20

[1] https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...

[2] https://blog.acolyer.org/2015/05/07/large-scale-cluster-mana...

1 more reply

ReidZB6y ago

I mean, I agree in spirit, but everyone has a different sense of cost/complexity vs. return.

I don't advocate for ever-more-complicated solutions as a rule. e.g. I think multi-cloud setups are probably way more trouble than they're worth for most companies.

I certainly agree that graceful degradation where possible and not too expensive is ideal. For example, if S3 is having problems in one region, being able to fall back (gracefully degrade) into read-only mode might be a nice thing to have.

(In this particular case having a secondary region also probably helps with disaster recovery, which is pretty much mandatory in B2B, for better or worse.)

1 more reply

paulddraper6y ago

> Now they can add Lambda outages to that list too.

Specifically Lambda@Edge though, which changes the math a little.

That said, currently the better solution is CloudFront Origin Groups.

ReidZB6y ago

If the "Cross-region replication" line in the picture is talking about the native S3 cross-region replication (as I assume it is), beware the replication latency in this setup. AWS recently released "replication with an SLA" for S3 [0], but at "99.99% of the objects will be replicated within 15 minutes", it's not a good enough SLA to rely on in setups like this.

Presumably Part 2 of this post will address this limitation, or maybe their product isn't affected by it. (I've never looked into Contentful, though maybe I will now -- blog post purpose achieved?)

I'm also not sure if "active-active" is the best name for this setup, since objects can't be written to the 2nd bucket (replication only goes one direction). Generally I associate "active-active" with "writes can happen anywhere", though maybe I'm wrong?

[0] https://aws.amazon.com/blogs/aws/s3-replication-update-repli...

rynop6y ago

Confused - why not use CloudFront Origin Groups? https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

Full disclosure, I've never used, but pretty sure this feature was created for the scenario you are trying to solve.

ReidZB6y ago

That's a relatively new feature (from November 2018... wow, has it been a year already?). My guess is that they implemented this stuff before that existed or maybe near to its release.

I have not used origin failover either, though I'm pretty sure you're right that this is its exact use-case.

johanneswu6y ago

You are correct, this has been implemented before Origin Groups have been released and they might be a viable alternative, but we haven’t tested them yet.

Source: I work at Contentful.

rynop6y ago

My concern is you (contentful) are leading folks down a bad path. Adding complexity, code, cost and a larger surface area of services that need to be up.

Zaheer6y ago

Although it may make sense to this company in _majority_ of companies this would be over-engineering. S3 availability is some of the best in the business. If S3 is down, a good chunk of the internet is down with it.

NathanWilliams6y ago

I disagree.

A single region can go down (and has in the past), no matter how reliable S3 as a whole is. If your business wants to avoid downtime, this is a simple solution to further reduce risks that cause downtime.

Just because other sites might go down, doesn't mean you have to accept it for your own.

advisedwang6y ago

Google Cloud Storage has multi-region storage classes. Does S3 not have an equivalent of this?

reilly30006y ago

They have cross-region replication.

m0rphling6y ago

I think it's worthwhile to look more into GCP's multi-region bucket implementation and how nice it is. It pretty much removes the need to explicitly set up cross-region replication of objects and offers a single endpoint from which to serve objects in the nearest/most available region.

knodi6y ago

Sorry I can't condone the use of AWS lambda@edge. No central logs aggregation in an event of an issue or alerting.

johanneswu6y ago

Those are also our biggest pain points together with slow deployments (cloudfront distributions can take 10-20 minutes to update, which is required to rollout a new version) and no support for lambda aliases. We forwarded that to our contacts at AWS, please also do that :-)

Disclaimer: I work at Contentful.

zackbloom6y ago

It's worth pointing out you can just point Cloudflare Load Balancing at two S3 buckets and call it a day.

johanneswu6y ago

If you are already using Cloudflare that is correct, if you don’t that is an alternative.

Adding another vendor into your stack can often come with none engineering complexity (e.g. data protection forms, contractual requirements from customers another vendor you need to work with, etc.) so this is an alternative to stay in AWS when you already use it.

Disclaimer: I work at Contentful.

jugg1es6y ago

The architecture described here is pretty simple. The article states the fix was 20 lines of code. If this is the hardest problem you have to solve at work, I envy you.

j / k navigate · click thread line to collapse

23 comments

sciurus6y ago

Before they'd be affected by Route 53 outages, Cloudfront outages, and S3 outages. Now they can add Lambda outages to that list too.

I'm mildly skeptical that this is worth the increased risks plus the increased cost from running Lambda@Edge on cache misses.

ReidZB6y ago

harikb6y ago

ignoramous6y ago

> The fact that it hasn’t happened for Lambda is just betting on luck.

[0] https://www.youtube-nocookie.com/embed/swQbA4zub20

[1] https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...

[2] https://blog.acolyer.org/2015/05/07/large-scale-cluster-mana...

1 more reply

ReidZB6y ago

I mean, I agree in spirit, but everyone has a different sense of cost/complexity vs. return.

I don't advocate for ever-more-complicated solutions as a rule. e.g. I think multi-cloud setups are probably way more trouble than they're worth for most companies.

(In this particular case having a secondary region also probably helps with disaster recovery, which is pretty much mandatory in B2B, for better or worse.)

1 more reply

paulddraper6y ago

> Now they can add Lambda outages to that list too.

Specifically Lambda@Edge though, which changes the math a little.

That said, currently the better solution is CloudFront Origin Groups.

ReidZB6y ago

Presumably Part 2 of this post will address this limitation, or maybe their product isn't affected by it. (I've never looked into Contentful, though maybe I will now -- blog post purpose achieved?)

[0] https://aws.amazon.com/blogs/aws/s3-replication-update-repli...

rynop6y ago

Confused - why not use CloudFront Origin Groups? https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

Full disclosure, I've never used, but pretty sure this feature was created for the scenario you are trying to solve.

ReidZB6y ago

That's a relatively new feature (from November 2018... wow, has it been a year already?). My guess is that they implemented this stuff before that existed or maybe near to its release.

I have not used origin failover either, though I'm pretty sure you're right that this is its exact use-case.

johanneswu6y ago

You are correct, this has been implemented before Origin Groups have been released and they might be a viable alternative, but we haven’t tested them yet.

Source: I work at Contentful.

rynop6y ago

My concern is you (contentful) are leading folks down a bad path. Adding complexity, code, cost and a larger surface area of services that need to be up.

Zaheer6y ago

NathanWilliams6y ago

I disagree.

Just because other sites might go down, doesn't mean you have to accept it for your own.

advisedwang6y ago

Google Cloud Storage has multi-region storage classes. Does S3 not have an equivalent of this?

reilly30006y ago

They have cross-region replication.

m0rphling6y ago

knodi6y ago

Sorry I can't condone the use of AWS lambda@edge. No central logs aggregation in an event of an issue or alerting.

johanneswu6y ago

Disclaimer: I work at Contentful.

zackbloom6y ago

It's worth pointing out you can just point Cloudflare Load Balancing at two S3 buckets and call it a day.

johanneswu6y ago

If you are already using Cloudflare that is correct, if you don’t that is an alternative.

Disclaimer: I work at Contentful.

jugg1es6y ago

The architecture described here is pretty simple. The article states the fix was 20 lines of code. If this is the hardest problem you have to solve at work, I envy you.

j / k navigate · click thread line to collapse