Summary of the Amazon Kinesis Event in the Northern Virginia (US-East-1) Region (opens in new tab)

(aws.amazon.com)

355 pointscodesparkle5y ago146 comments

146 comments

Running out of file handles and other IO limits is embarrassing and happens at every company, but I’m surprised that AWS was not monitoring this.

I’m also surprised at the general architecture of Kinesis. What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos, a thread per cluster member? Everyone talking to everyone? An hour to reach consensus?) and the front end servers being stateful period breaks a lot of good design choices.

The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.

solatic5y ago

> The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.

I see where you're coming from with this, but you really have to wonder. It sounds more like the original architects made implicit assumptions regarding scale that, likely due to original architects and engineers moving on, were not re-evaluated by the current engineers on Kinesis as Kinesis grew. While it may take an hour now for the front-end cache to sync, I find it highly unlikely that it needed that much time when Kinesis first launched.

The process failure here is organizational, one where Amazon failed to audit its current systems in a complete and current manner such that sufficient attention and resources could be paid to a re-architecture of a critical service before it caused the service to fail. Even now, vertically scaling the front-end cache fleet is just a band-aid - eventually, that won't be possible anymore. Sadly, the postmortem doesn't seem to identify the organizational failure that was the true root cause of the outage.

edoceo5y ago

Oof. My little company is refactoring some five year old architecture design choices. Ugly. Process isn't visible outside the refactor and the work is tedious. Can't imagine what a service refactor is like at A. I bet it sucks

2 more replies

pentlander5y ago

The hand rolled gossip protocol (DFDD) is not used for consensus, it's just used for cluster membership and health information. It's used by pretty much every foundational AWS service. There's a separate internal service for consensus that uses paxos.

The thread per frontend member definitely sounds like a problematic early design choice. It wouldn't be the first time I heard of an AWS issue due to "too many threads". Unlike gRPC, the internal RPC framework defaults to a thread per request rather than an async model. The async way was pretty painful and error prone.

joneholland5y ago

Are they still using Coral and Codigo as the RPC stack?

3 more replies

amf125y ago

Yep, that was my understanding as well. It doesn't seem to be for consensus.

Although, for Frontend servers which just do auth, routing, etc - why is P2P gossip necessary for building shard map? Possibly because retrieval of configuration information directly from the vending service may be a bottleneck - but then why not gossip with a subset of peers than every peer and the vending service which is a source of truth.

ignoramous5y ago

Are you sure Kinesis uses DFDD [0]?

[0] Seems like a relic of years gone by https://patents.justia.com/patent/9838240

2 more replies

justicezyx5y ago

I led the storage engine prototyping for Kinesis in 2012 (the best time in my career so far).

Kinesis uses Chain Replication, a dead simple fault tolerante storage algorithm: machines formed a chain, data flow from head to tail in one direction, writes always start at head, and read at tail, new nodes always join at tail, but nodes can be kicked out at any position.

The membership management of chain node is done through a paxos-based consensus service like chubby or zookeeper. Allan [2] (the best engineer I personally worked with so far, way better than anyone I encountered) wrote that system. The Java code quality shows itself after the first glance. Not mentioning the humbleness and openness in sharing his knowledge during early design meetings.

I am not sure what protocol is actually used now. But I would be surprised it's different, given the protocol's simplicity and performance.

[1] https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf [2] https://www.linkedin.com/in/allan-vermeulen-58835b/

ryanworl5y ago

Can you explain why the sequence numbers are so giant? I've never understood that.

1 more reply

marcinzm5y ago

I don't think it's about growing fast so much as, from those I talked to, Amazon now has a fairly bad reputation in the tech community. You only go to work there if you don't have a better option (Google, Facebook, etc) or have some specialty skill they're willing to pay for. Pay is below other FAANG companies and the work culture isn't great (toxic even some would say).

edit: They also had the most disorganized and de-centralized interview approach from all the FAANG companies I talked with. Which isn't growing pains this far in, it's just bad management and process.

ActorNightly5y ago

Just as a general reminder to anyone reading this: forum comments are incredibly biased and hardly ever represent reality accurately.

1 more reply

imajoredinecon5y ago

Interesting re interview experience

I interviewed as a new grad SWE and the process was totally straightforward, and way lower friction (albeit much less human interaction, which made it feel even more impersonal) than almost everywhere else I applied: initial online screen, online programming task, and then a video call with an engineer where you explained your answer to the programming task.

2 more replies

amf125y ago

> I don't think it's about growing fast so much as, from those I talked to, Amazon now has a fairly bad reputation in the tech community.

My personal observation having known quite a few Amazon SWEs and interviewed them.

The bad rep is only for the junior roles. SWEs who work at AWS and are high L5+ are pretty solid.

1 more reply

nixass5y ago

Very anecdotal

1 more reply

jknoepfler5y ago

It irks me to this day that AWS all-hands meetings (circa 2015) celebrated an exponential hiring curve (as in the graph was greeted with applause and touted as evidence of success by the speaker). The next plot would be an exponential revenue curve with approximately the same shape. Meanwhile the median lifespan of an engineer was ~10 months. I don't know, I just couldn't square that one in my head.

jen205y ago

> What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos)

Raft and Paxos are not gossip protocols - they are consensus protocols.

joneholland5y ago

Fair. What I meant to say is “hand rolling a way to have consensus on a shared piece of data” by implementing it with a naive gossip system.

bonfire5y ago

If you want to eat in a restaurant it's better not to look in the kitchen :-|

incognito_limey5y ago

As someone who is in another high growth start up, one of the fastest in the world (not hyperbole) I wish I could upvote this specific comment more: "The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there."

I came to the realization about a year ago, that there are definitely talent tiers and unless you are working super hard at recruiting and paying top dollar, the cliff edge approaches fast and is very very steep.

hintymad5y ago

AWS frontend services are usually implemented in Java. If Kinesis' frontend does too, then it's surprising that the threads created by a frontend service would exceed the OS limit. This tells three possibilities: 1. Kinesis did not impose a max thread count in their app, which is a gross omission; 2. Or there was a resource leak in their code. 3. Each of their frontend instances stored all the placement information of backend servers, which means their frontend was not scalable by backend size.

joneholland5y ago

My understanding is that every front end server has at least one connection (on a dedicated thread) to every other front end server.

Assuming they have say, 5000 front end instances, thats 5000 file descriptors being used just for this, before you are even talking about whatever threads the application needs.

It’s not surprising that they bumped into ulimits, though as part of OS provisioning, you typically have those tuned for workload.

More concerning is the 5000 x 5000 amount of open tcp sessions across their network to support this architecture. This has to be a lot of fun on any stateful firewall it might cross.

1 more reply

fafner5y ago

Yep. Having each front end needing to scale with the overall size of the front end sounds is obviously going to hit some scaling limit. It's not clear to me from the summary why they are doing that. If it's for the shard-map or cache? Maybe if the front end is stateful that's a way to do stickiness? Seems we can only guess.

karmakaze5y ago

Kinesis was the worst AWS tech I've ever used. Splitting a stream into shards doesn't increase throughput if you still need to run the same types/number of consumers on every shard. The suggested workaround at the time was to use larger batches and add latency to the processing pipeline.

hedora5y ago

I’ve noticed a strong tendency for older systems to accumulate “spaghetti architecture”, where newer employees add new subsystems and tenured employees are blind to the early design mistakes they made. For instance, in this system, it sound like they added a complicated health check mechanism at some point to bounce faulty nodes.

Now, they don’t know how it behaves, so they’re afraid to take corrective actions in production.

They built that before ensuring that they logged the result of each failed system call. The prioritization seems odd, but most places look at logging as a cost center, and the work of improving it as drudgery, even though it’s far more important than shiny things like automatic response to failures, and also takes a person with more experience to do properly.

throwaway1892625y ago

I can't remember which db, but somebody a while back claimed that one of Amazon's "infinitely scalable" dbs was tons of abstraction on top a massive heap of MySQL instances.

I don't trust anything outside core services on AWS. Regardless of whether the rumor I heard is true, it's clear they appreciate quantity over quality.

rubiquity5y ago

Disclosure: I work at AWS, possibly near the system you're describing. Opinions are my own.

If we're talking about the same thing then I think casting stones just because it is based on MySQL is severely misguided. MySQL has decades of optimizations and this particular system at Amazon has solved scaling problems and brought reliability to countless services without ever being the direct cause of an outage (to the best of my knowledge).

Indeed, MySQL is not without its flaws but many of these are related to its quirks in transactions and replication which this system completely solves. The cherry on top is that you have a rock solid database with a familiar querying language and a massive knowledge base to get help from when needed. Oh, and did I mention this system supports multiple storage engines besides just MySQL/InnoDB?

I for one wish we would open source this system though there are a ton of hurdles both technical and not. I think it would do wonders for the greater tech community by providing a much better option as your needs grow beyond a single node system. It has certainly served Amazon well in that role and I've heard Facebook and YouTube have similar systems based on MySQL.

To further address your comment about Amazon/AWS lacking quality: this system is the epitome of our values of pragmatism and focusing our efforts on innovating where we can make the biggest impact. Hand rolling your own storage engines is fun and all but countless others have already spent decades doing so for marginal gains.

1 more reply

WookieRushing5y ago

This is really common and works pretty awesomely. MySQL is extremely battle tested and there’s lots of experts out there for it.

FB built a similar system to maintain their graph: https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-nos...

It’s a ton of tiny DBs that look like one massive eventually consistent DB

newscom595y ago

What’s the problem with this? Anything that’s scalable is “just an abstraction” on top of a heap of shards/processes/nodes/data centers/whatever.

ignoramous5y ago

I believe someone allegedly from AWS said DynamoDB was written on top of MySQL (on top of InnoDB, really) [0] which would be similar to what Uber and Pinterest did as well. [1]

[0] https://news.ycombinator.com/item?id=18426096

[1] https://news.ycombinator.com/item?id=12605026

blandflakes5y ago

At more than one of my jobs, this has been exactly the right way to horizontally scale relational workloads and has gone very well.

derwiki5y ago

Sounds like Dynamo

rantwasp5y ago

if this is dynamo, this is just a bunch of BS. you cannot get the latencies DDB has with a ton of abstractions on top of MySql.

ris5y ago

The one thing I want to know in cases like this is: why did it affect multiple Availability Zones? Making a resource multi-AZ is a significant additional cost (and often involves additional complexity) and we really need to be confident that typical observed outages would actually have been mitigated in return.

talawahtech5y ago

Multi-AZ doesn't protect against a software/OS issue like this, Multi-AZ would be relevant if it was an infrastructure failure (e.g. underlying EC2 instances or networking).

The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius.

They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend".

Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.

It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.

This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.

1. https://youtu.be/swQbA4zub20

talawahtech5y ago

Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size. This would have also helped avoid this specific issue since the number of threads would have been tied to the number of instances in a cell.

I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.

1 more reply

brown9-25y ago

but why does a Kinesis outage due to a capacity increase affect multiple AZs, if one assumes the capacity increase (and the frontend servers impacted by it) are in a single zone?

EdwardDiego5y ago

Indeed. We're paying (and designing our systems to work on multiple AZs) to reduce the risk of outages, but then their back-end services are reliant on services in a sole region?

otterley5y ago

(Disclaimer: I work for AWS but opinions are my own. I also do not work with the Kinesis team.)

Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.

There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.

This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)

As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.

2 more replies

qz25y ago

Correct.

I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.

Realistically you can’t trust one provider at all if you have high availability requirements.

The justification is apparently that the cloud is taking all this responsibility away from people but from personal experience running two cages of kit at two datacenters the TCO was lower and the reliability and availability higher. Possibly the largest cost is navigating Harry-Potter-esque pricing and automation laws. The only gain is scaling past those two cages.

Edit: I should point out however that an advantage of the cloud is actually being able to click a couple of buttons and get rid of two cages worth of DC equipment instantly if your product or idea doesn't work out!

3 more replies

frankietaylr5y ago

AWS should make this more transparent so that we make better design choices.

tnolet5y ago

This is a pretty damn decent post mortem so soon after the outage. Also gives an architectural analysis of how Kinesis works which is something they had not have to do at all.

lytigas5y ago

> During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event.

Poetry.

Then, to be fair:

> We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues.

I'm curious if anyone here actually got one of these.

newscom595y ago

The PHD is always updated first, long before the global status page is updated. Every single one of my clients that use AWS got updates on the PHD literally hours before the status page was even showing any issues, which is typical. It’s the entire point of the PHD.

Through reading Reddit and HN during this event I learned that most people apparently aren’t even aware of the existence of the PHD and rely solely on the global status page, despite the fact that there is a giant “View my PHD” button at the very top of the global status page, and additionally there is a notification icon on the header of every AWS console page that lights up and links you directly to the PHD whenever there is an issue.

The PHD is always where you should look first. It is, by design, updated long before the global status page is.

robbiemitchell5y ago

> despite the fact that there is a giant “View my PHD” button at the very top of the global status page

If you don’t know what the PHD is, a big button pointing to it won’t do anything. People ignore big boxes of irrelevant stuff all the time.

AWS user of ~8 years and I’ve never heard of the PHD nor this sequencing of updating it first.

0x115y ago

I can't say for sure that the company I work for didn't, but it certainly didn't make it's way to me and there are only 8 of us.

ufmace5y ago

My employer is a pretty big spender with AWS. I didn't hear anything about anybody getting status updates from a "Personal Health Dashboard" or anywhere else. I can't be 100% sure such an update would have made its way to me, but given the amount of buzzing, it's hard to believe that somebody had info like that and didn't share it.

mwarkentin5y ago

Yes, we had some messages coming through in our PHD.

vishnugupta5y ago

This won't be a first. The status page was hosted in S3. It is hilarious in the hindsight, but understandable.

capableweb5y ago

> but understandable

Is it really? I get the value of eating your own dogfood, it improves things a lot.

But your status page? Such a high importance, low difficulty thing to build that dogfeeding it gives you small amount of benefits (dogfeed something bigger/more complex instead) in the good case, and high amount of drawback when things go wrong (like when your infrastructure goes down, so does your status page). So what's the point?

2 more replies

loriverkutya5y ago

I can confirm we got the Personal Health Dashboard notifications.

freeone30005y ago

The failure to update the Service Health Dashboard was due to reliance on internal services to update. This also happened in March 2017[0]. Perhaps a general, instead of piecemeal, approach to removing dependencies on running services from the dashboard would be valuable here?

0: https://aws.amazon.com/message/41926/

roman_sf5y ago

But this time it was a _different_ dependency. They just want to make sure all dependencies are ruled out before migrating to GCP :)

codesparkleOP5y ago

From the postmortem:

At 9:39 AM PST, we were able to confirm a root cause [...] the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.

rswail5y ago

Minor detail, but is anyone else irritated by the use of the word "learnings" instead of "lessons"? "To learn" is a verb. Nouning verbs seems to be an unnecessary operationalization.

bithavoc5y ago

They’re calling it an “Event”, title should say “Summary of the Amazon Kinesis Outage...”

lend0005y ago

Even today I had a few minutes of intermittent networking outages around 9:30am EST (which started on the day of the incident), and compared to other regions, I frequently get timeouts when calling S3 from us-east-1 (although that has been happening since forever).

karmakaze5y ago

Seems to me that the root problem could also be fixed by not using presumably blocking application threads talking to each of the other servers. Any async or poll mechanism wouldn't require N^2 threads across the pool.

why-el5y ago

I wonder if the new wonders coming out from linux (io_uring...) would have made this a better design, but that work in the kernel is still in active development.

ipsocannibal5y ago

So the cause of outage boils down to not having a metric on total file descriptors with an alarm if usage gets within 10% of the Max and a faulty scaling plan that should of said "for every N backend hosts we add we must add X frontend hosts". One metric and a couple of lines in a wiki could have saved Amazon what is probably millions in outage related costs. One wonders if Amazon retail will start hedging its bets and go multicloud to prevent impacts on the retail customers from AWS LSE's.

ignoramous5y ago

root-cause tldr:

...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.

fixes:

...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.

...making a number of changes to radically improve the cold-start time for the front-end fleet.

...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.

...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.

...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.

[0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...

frankietaylr5y ago

I wonder how many of them are already logged engineering tasks which never got prioritized because of the aggressive push to add features.

terom5y ago

Unsurprising to see such outages also tickling bugs/issues in the fallback behavior of dependent services that were intended to tolerate outages. There must be some classic law of cascading failures caused by error handling code :)

> Amazon Cognito uses Kinesis Data Streams [...] this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers.

> And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates.

londons_explore5y ago

One requirement on my "production ready" checklist is that any catastrophic system failure can be resolved by starting a completely new instance of the service, and it be ready to serve traffic inside 10 minutes.

That should be tested at least quarterly (but preferably automatically with every build).

If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...

WookieRushing5y ago

This only works for stateless services. If you’ve got frontends that take longer than 10 mins to serve traffic then you have a problem.

But if you’re running a DB or a storage system, 10 mins is a blink of an eye. Storage systems in particular can run a few hundred TB per node and moving that data to another node can take over an hour.

In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded

londons_explore5y ago

It's possible (albeit much harder) for stateful services too.

It basically boils down to "We must be able to restore the minimum necessary parts of a full backup in under 10 minutes".

Take wikipedia as an example. I'd expect them to be able to restore a backup of the latest version of all pages in 10 minutes. It's 20GB of data, and I assume it's sharded at least 10 ways. That means each instance will have to grab 2GB from the backups. Very do-able.

As a service gets bigger, you typically scale horizontally, so the problem doesn't get harder.

Restoring all the old page versions and re enabling editing might take longer, but that's less critical functionality.

why-el5y ago

The same OS limits would apply to new instances, unless they knew the root cause and forced new instances to be configured with larger descriptor limits, which is....well, hindsight is 20/20, no?

WJW5y ago

Kinesis probably runs well over 100k instances. Restarting it might not be so trivial that you can do it in 10 minutes.

steelframe5y ago

> Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range. This had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed.

Translation: The eng team knew that they had accumulated tech debt by cutting a corner here in order to meet one of Amazon's typical and insane "just get the feature out the door" timelines. Eng warned management about it, and management decided to take the risk and lean on on-call to pull heroics to just fix any issues as they come up. Most of the time yanking a team out of bed in the middle of the night works, so that's the modus operandi at Amazon. This time, the actual problem was more fundamental and wasn't effectively addressable with middle-of-the-night heroics.

Management rolled the "just page everyone and hope they can fix it" dice yet again, as they usually do, and this time they got snake eyes.

I guarantee you that the "cellularization" of the front-end fleet wasn't actually under way, but the teams were instead completely consumed with whatever the next typical and insane "just get the feature out the door" thing was at AWS. The eng team was never going to get around to cellularizing the front-end fleet because they were given no time or incentive to do so by management. During/after this incident, I wouldn't be surprised if management didn't yell at the eng team, "Wait, you KNEW this was a problem, and you're not done yet?!?" Without recognizing that THEY are the ones actually culpable for failing to prioritize payments on tech debt vs. "new shiny" feature work, which is typical of Amazon product development culture.

I've worked with enough former AWS engineers to know what goes on there, and there's a really good reason why anybody who CAN move on from AWS will happily walk away from their 3rd- and 4th-year stock vest schedules (when the majority of your promised amount of your sign-on RSUs actually starts to vest) to flee to a company that fosters a healthy product development and engineering culture.

(Not to mention that, this time, a whole bunch of peoples' Thanksgiving plans were preempted with the demand to get a full investation and post-mortem written up, including the public post, ASAP. Was that really necessary? Couldn't it have waited until next Wednesday or something?)

reducesuffering5y ago

Ow, this was traumatizing to relive.

Yes, this is exactly how product development works at many (if not most) places within Amazon for engineers. It can be this toxic.

Disclaimer: Amazon engineer

roman_sf5y ago

Hahaha, well this time Jessy got paged so yeah... the summary got priority over turkey.

fafner5y ago

From the summary I don't understand why front end servers need to talk to each other ("continuous processing of messages from other Kinesis front-end servers"). It sounds like this is part of building the shard map or the cache. Well in the end an unfortunate design decision. #hugops for the team handling this. Cascading failures are the worst.

tmk11085y ago

How does the architecture of Kinesis compare to Kafka? If you scale up the number of Kafka brokers can you hit similar problem? Or does Kafka not rely on creating threads to connect to each other broker

aloknnikhil5y ago

Kafka uses a thread pool for request processing. Both the brokers and the consumer clients use the same request processing loop.

This goes a bit more in-depth: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafk...

zxcvbn40385y ago

They didn’t really discuss their remediation plans but maybe having one fleet of servers for everything isn’t the best setup. I’d love to know which OS setting they ran into. In their defense this is exactly the sort of change that never shows up in testing because the dev and qa environments are always smaller then production.

I’m wondering how many people Amazon fired over this incident - that seems to be their goto answer to everything.

temp08265y ago

us-east-1 is AWS’s dirty secret. If ddb had gone down there, there would likely be a worldwide and multi-service interruption.

pps435y ago

> the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. [...] We didn’t want to increase the operating system limit without further testing

Is it because operating system configuration is managed by a different team within the organization?

mcqueenjordan5y ago

Nope. It's just a case of "stop the bleeding before starting the surgery."

sitharus5y ago

More likely they need to understand what effect changing the thread limit would have - for example it could increase kernel memory usage or increase scheduler latency. It’s not something you want to mess with in an outage.

sudhirj5y ago

I’ve heard AWS follows a you build it, you run it policy, so that seems unlikely. Just seems prudent to not mess with OS settings in a hurry.

Androider5y ago

If you start haphazardly changing things while firefighting without testing, you might make things even worse. And there's worse things than downtime, for instance if the system appears to work but you're actually silently corrupting customer data.

jaikant775y ago

"and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration."

An auto scaling irony for AWS! We seem to be back to the late 1990s :)

hintymad5y ago

A tangential question, why would AWS even use the term "microservice"? A service is a service, right? I'm not sure what the term "microservice" signifies here.

arduinomancer5y ago

It’s because service can be confused with “AWS Service” which is not the same as a microservice (a component of a full service)

metaedge5y ago

I would have started the response with:

First of all, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Then move on to explain...

sigzero5y ago

What they did was fine.

j / k navigate · click thread line to collapse

146 comments

joneholland5y ago

Running out of file handles and other IO limits is embarrassing and happens at every company, but I’m surprised that AWS was not monitoring this.

The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.

solatic5y ago

> The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.

edoceo5y ago

2 more replies

pentlander5y ago

joneholland5y ago

Are they still using Coral and Codigo as the RPC stack?

3 more replies

amf125y ago

Yep, that was my understanding as well. It doesn't seem to be for consensus.

ignoramous5y ago

Are you sure Kinesis uses DFDD [0]?

[0] Seems like a relic of years gone by https://patents.justia.com/patent/9838240

2 more replies

justicezyx5y ago

I led the storage engine prototyping for Kinesis in 2012 (the best time in my career so far).

I am not sure what protocol is actually used now. But I would be surprised it's different, given the protocol's simplicity and performance.

[1] https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf [2] https://www.linkedin.com/in/allan-vermeulen-58835b/

ryanworl5y ago

Can you explain why the sequence numbers are so giant? I've never understood that.

1 more reply

marcinzm5y ago

ActorNightly5y ago

Just as a general reminder to anyone reading this: forum comments are incredibly biased and hardly ever represent reality accurately.

1 more reply

imajoredinecon5y ago

Interesting re interview experience

2 more replies

amf125y ago

> I don't think it's about growing fast so much as, from those I talked to, Amazon now has a fairly bad reputation in the tech community.

My personal observation having known quite a few Amazon SWEs and interviewed them.

The bad rep is only for the junior roles. SWEs who work at AWS and are high L5+ are pretty solid.

1 more reply

nixass5y ago

Very anecdotal

1 more reply

jknoepfler5y ago

jen205y ago

> What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos)

Raft and Paxos are not gossip protocols - they are consensus protocols.

joneholland5y ago

Fair. What I meant to say is “hand rolling a way to have consensus on a shared piece of data” by implementing it with a naive gossip system.

bonfire5y ago

If you want to eat in a restaurant it's better not to look in the kitchen :-|

incognito_limey5y ago

hintymad5y ago

joneholland5y ago

My understanding is that every front end server has at least one connection (on a dedicated thread) to every other front end server.

Assuming they have say, 5000 front end instances, thats 5000 file descriptors being used just for this, before you are even talking about whatever threads the application needs.

It’s not surprising that they bumped into ulimits, though as part of OS provisioning, you typically have those tuned for workload.

More concerning is the 5000 x 5000 amount of open tcp sessions across their network to support this architecture. This has to be a lot of fun on any stateful firewall it might cross.

1 more reply

fafner5y ago

karmakaze5y ago

hedora5y ago

Now, they don’t know how it behaves, so they’re afraid to take corrective actions in production.

throwaway1892625y ago

I can't remember which db, but somebody a while back claimed that one of Amazon's "infinitely scalable" dbs was tons of abstraction on top a massive heap of MySQL instances.

I don't trust anything outside core services on AWS. Regardless of whether the rumor I heard is true, it's clear they appreciate quantity over quality.

rubiquity5y ago

Disclosure: I work at AWS, possibly near the system you're describing. Opinions are my own.

1 more reply

WookieRushing5y ago

This is really common and works pretty awesomely. MySQL is extremely battle tested and there’s lots of experts out there for it.

FB built a similar system to maintain their graph: https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-nos...

It’s a ton of tiny DBs that look like one massive eventually consistent DB

newscom595y ago

What’s the problem with this? Anything that’s scalable is “just an abstraction” on top of a heap of shards/processes/nodes/data centers/whatever.

ignoramous5y ago

I believe someone allegedly from AWS said DynamoDB was written on top of MySQL (on top of InnoDB, really) [0] which would be similar to what Uber and Pinterest did as well. [1]

[0] https://news.ycombinator.com/item?id=18426096

[1] https://news.ycombinator.com/item?id=12605026

blandflakes5y ago

At more than one of my jobs, this has been exactly the right way to horizontally scale relational workloads and has gone very well.

derwiki5y ago

Sounds like Dynamo

rantwasp5y ago

if this is dynamo, this is just a bunch of BS. you cannot get the latencies DDB has with a ton of abstractions on top of MySql.

ris5y ago

talawahtech5y ago

Multi-AZ doesn't protect against a software/OS issue like this, Multi-AZ would be relevant if it was an infrastructure failure (e.g. underlying EC2 instances or networking).

Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.

It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.

This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.

1. https://youtu.be/swQbA4zub20

talawahtech5y ago

I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.

1 more reply

brown9-25y ago

but why does a Kinesis outage due to a capacity increase affect multiple AZs, if one assumes the capacity increase (and the frontend servers impacted by it) are in a single zone?

EdwardDiego5y ago

Indeed. We're paying (and designing our systems to work on multiple AZs) to reduce the risk of outages, but then their back-end services are reliant on services in a sole region?

otterley5y ago

(Disclaimer: I work for AWS but opinions are my own. I also do not work with the Kinesis team.)

Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.

2 more replies

qz25y ago

Correct.

I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.

Realistically you can’t trust one provider at all if you have high availability requirements.

3 more replies

frankietaylr5y ago

AWS should make this more transparent so that we make better design choices.

tnolet5y ago

This is a pretty damn decent post mortem so soon after the outage. Also gives an architectural analysis of how Kinesis works which is something they had not have to do at all.

lytigas5y ago

> During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event.

Poetry.

Then, to be fair:

I'm curious if anyone here actually got one of these.

newscom595y ago

The PHD is always where you should look first. It is, by design, updated long before the global status page is.

robbiemitchell5y ago

> despite the fact that there is a giant “View my PHD” button at the very top of the global status page

If you don’t know what the PHD is, a big button pointing to it won’t do anything. People ignore big boxes of irrelevant stuff all the time.

AWS user of ~8 years and I’ve never heard of the PHD nor this sequencing of updating it first.

0x115y ago

I can't say for sure that the company I work for didn't, but it certainly didn't make it's way to me and there are only 8 of us.

ufmace5y ago

mwarkentin5y ago

Yes, we had some messages coming through in our PHD.

vishnugupta5y ago

This won't be a first. The status page was hosted in S3. It is hilarious in the hindsight, but understandable.

capableweb5y ago

> but understandable

Is it really? I get the value of eating your own dogfood, it improves things a lot.

2 more replies

loriverkutya5y ago

I can confirm we got the Personal Health Dashboard notifications.

freeone30005y ago

0: https://aws.amazon.com/message/41926/

roman_sf5y ago

But this time it was a _different_ dependency. They just want to make sure all dependencies are ruled out before migrating to GCP :)

codesparkleOP5y ago

From the postmortem:

rswail5y ago

Minor detail, but is anyone else irritated by the use of the word "learnings" instead of "lessons"? "To learn" is a verb. Nouning verbs seems to be an unnecessary operationalization.

bithavoc5y ago

They’re calling it an “Event”, title should say “Summary of the Amazon Kinesis Outage...”

lend0005y ago

karmakaze5y ago

why-el5y ago

I wonder if the new wonders coming out from linux (io_uring...) would have made this a better design, but that work in the kernel is still in active development.

ipsocannibal5y ago

ignoramous5y ago

root-cause tldr:

fixes:

...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.

...making a number of changes to radically improve the cold-start time for the front-end fleet.

...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.

...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.

...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.

[0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...

frankietaylr5y ago

I wonder how many of them are already logged engineering tasks which never got prioritized because of the aggressive push to add features.

terom5y ago

londons_explore5y ago

That should be tested at least quarterly (but preferably automatically with every build).

If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...

WookieRushing5y ago

This only works for stateless services. If you’ve got frontends that take longer than 10 mins to serve traffic then you have a problem.

In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded

londons_explore5y ago

It's possible (albeit much harder) for stateful services too.

It basically boils down to "We must be able to restore the minimum necessary parts of a full backup in under 10 minutes".

As a service gets bigger, you typically scale horizontally, so the problem doesn't get harder.

Restoring all the old page versions and re enabling editing might take longer, but that's less critical functionality.

why-el5y ago

The same OS limits would apply to new instances, unless they knew the root cause and forced new instances to be configured with larger descriptor limits, which is....well, hindsight is 20/20, no?

WJW5y ago

Kinesis probably runs well over 100k instances. Restarting it might not be so trivial that you can do it in 10 minutes.

steelframe5y ago

Management rolled the "just page everyone and hope they can fix it" dice yet again, as they usually do, and this time they got snake eyes.

reducesuffering5y ago

Ow, this was traumatizing to relive.

Yes, this is exactly how product development works at many (if not most) places within Amazon for engineers. It can be this toxic.

Disclaimer: Amazon engineer

roman_sf5y ago

Hahaha, well this time Jessy got paged so yeah... the summary got priority over turkey.

fafner5y ago

tmk11085y ago

aloknnikhil5y ago

Kafka uses a thread pool for request processing. Both the brokers and the consumer clients use the same request processing loop.

This goes a bit more in-depth: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafk...

zxcvbn40385y ago

I’m wondering how many people Amazon fired over this incident - that seems to be their goto answer to everything.

temp08265y ago

us-east-1 is AWS’s dirty secret. If ddb had gone down there, there would likely be a worldwide and multi-service interruption.

pps435y ago

Is it because operating system configuration is managed by a different team within the organization?

mcqueenjordan5y ago

Nope. It's just a case of "stop the bleeding before starting the surgery."

sitharus5y ago

sudhirj5y ago

I’ve heard AWS follows a you build it, you run it policy, so that seems unlikely. Just seems prudent to not mess with OS settings in a hurry.

Androider5y ago

jaikant775y ago

An auto scaling irony for AWS! We seem to be back to the late 1990s :)

hintymad5y ago

A tangential question, why would AWS even use the term "microservice"? A service is a service, right? I'm not sure what the term "microservice" signifies here.

arduinomancer5y ago

It’s because service can be confused with “AWS Service” which is not the same as a microservice (a component of a full service)

metaedge5y ago

I would have started the response with:

Then move on to explain...

sigzero5y ago

What they did was fine.

j / k navigate · click thread line to collapse