Root cause analysis: significantly elevated error rates on 2019‑07‑10 (opens in new tab)

(stripe.com)

203 pointsgr20206y ago108 comments

108 comments

vjagrawal19846y ago

In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.

Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?

wallflower6y ago

Mainframes.

> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.

> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.

https://www.share.org/blog/mainframe-matters-how-mainframes-...

https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...

https://www.ibm.com/it-infrastructure/servers/mainframes

andrewg6y ago

Specifically they run IBM zTPF on their mainframes, which is also used by airlines. Some installations have uptimes measured in decades.

https://www.ibm.com/it-infrastructure/z/transaction-processi...

wereHamster6y ago

It's rarely the hardware that fails, it's more often due to software. So I wonder what the software that's running on mainframes does differently than the software that's written for ordinary computers.

2 more replies

londons_explore6y ago

Both have had plenty of downtime:

https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...

I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.

segmondy6y ago

No they don't. If I sell a diamond ring for $20k and Visa passes that the card is valid but it's not. The buyer just got a free $20k ring. Card could be expired, cancelled, or not have enough balance. The merchant must be paid, their processor has to pay them, the bank that issued the card must provide that credit until the card holder pays it back. If the card was expired or a card with a $10 balance. The card holder will refuse, it get's really mess fast. Visa is not willing to assume such risk, they simply provide a network. If it goes down, it goes down and everyone on their network is screwed.

When dispute is at play, it's a hot potato that no one wants to hold between the merchant, processor, ISO, sales agent & bank. The card networks have been smart to eliminate themselves from that step.

Terretta6y ago

> No they don’t.

On the contrary, I developed early merchant and payment gateway tech, and they absolutely do. The scenario you describe is extraordinarily rare, which allows an arbitrage between CAP perfection and customer satisfaction.

On a separate note, at any given time, some parts of our national payments ecosystem are “down”. There are enough players involved you have an appearance of resilience.

You can see this in a mall, when one store’s card swipe terminals are down, and another’s are not, and almost never happens that all the stores are down at the same time.

You can think of all these other players as an incidental circuit breaker pattern upstream of Visa.

VisaNet itself is surprisingly unscaled, capable of only about 24,000 transactions per second. Twenty years ago, our gateway would hit 15,000 transactions per second real world use. To do that, we scattered/gathered across many independent paths into card networks and various merchant banks.

https://usa.visa.com/content/dam/VCOM/download/corporate/med...

https://www.capgemini.com/wp-content/uploads/2017/07/Domesti...

nulbyte6y ago

Actually, merchants, acquirers, and issuers can do this. It happens sometimes. When it happens, other limits come into play downstream, such as terminal configuration. There are separate offline limits, and it is unlikely they would set it that high, so a $20,000 offline charge would be declined, even if a lesser charge would be approved, stored, and processed later. As for expired cards, the expiration date is on the mag stripe and in the chip, so the transaction could be rejected at any point, even offline at the terminal. It's also printed on the card so it might be rejected before it's even swiped or dipped.

1 more reply

jeremyjh6y ago

They absolutely do, it is called "stand-in processing". I saw this while working in ATM at a major bank. The terminal operator (e.g. in our case, the ATM authorization system) can stand-in for the payment network when required. There are per-card number transaction limits that are well-defined in their contracts, and fraud liability can shift during this period of time. The payment network can also stand in for the issuer. In either case, once the network is restored all the authorization advices are forwarded.

seanmcdirmid6y ago

Credit cards are very asynchronous, going back to the days when carbon copy paper was used and no in time verification might have been involved at all.

Shop owners would even get a reward for snipping a bad credit user’s card in half (something that survives to this day only as a meme).

jasonjei6y ago

That’s a great point. In spite of technical changes such as Apple Pay/Android Pay, chip cards, and so on, I can never recall an instance when I was unable to use a credit card globally. It seems most failures to running a credit card are pretty localized, too, and never at the interchange level...

raverbashing6y ago

I suspect there's a lot of caching involved as well. When making a purchase you probably don't need all the info to go all the way to the bank and back.

Stolen/lost cards can simply be flagged in a master db/table and can be rejected quickly for example.

Thaxll6y ago

They're also much simpler and the system behind payment solution didn't changed that much in the last 10 years.

londons_explore6y ago

They are also miles behind on features customers want...

For example:

* My credit card statement should have links to the merchant, the address, a list of the things I bought, a link to the returns process, etc.

* Why can't my statement also have the total number of calories I've purchased in the last month, or grams of carbon in fuel I've put in the truck?

* Why can't I use my mastercard to pay another mastercard user directly?

* Why hasn't mastercard produced a '2 factor' for card payments rather than forcing every bank to implement their own?

* Why can't I buy a dual Mastercard/Visa/Other card, which works with merchants who are picky and will only accept one or the other?

* Why are we still issuing bits of plastic in the digital age anyway?

* Why don't the cards have a microusb plug on one edge, or NFC to plug into a phone or computer to log in, to act as an identity card, to authenticate or make payments, or anything else other companies issue smartcards for?

* Why don't mastercard work with mobile providers to issue cards that you can spend your pay-as-you-go balance with, turning a mobile provider into a bank.

It seems mastercards business is 'stuck', and there are opportunities to innovate all around them, but they won't.

marbletiles6y ago

Half of what you "want" is a quasi-dystopian nightmare.

2 more replies

zelon886y ago

Why would you want Target telling Mastercard that you bought Spongebob underwear and 1,968 calories worth of freeze pops?

1 more reply

kortilla6y ago

>Why are we still issuing bits of plastic in the digital age anyway?

Phones die.

If you don’t care, I suggest you look into Apple Pay or something similar. You’ll find many merchants that you won’t be able to pay.

2 more replies

segmondy6y ago

They are not, they go down quite often. lol.

ssalazars6y ago

[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.

To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.

dps6y ago

(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.

ssalazars6y ago

Thank you for taking the time to respond to my questions. I believe the high potential of causing a follow-up incident was left out of the post (or maybe I missed it?).

I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.

I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.

tus886y ago

> ship hundreds of deploys a week safely

That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?

tschwimmer6y ago

They very likely have continuous deployment. So each change could potentially be released as a separate deploy. If the changes have changed to the data model, they gotta run a migration. So hundreds seems reasonable to me.

nialldalton6y ago

From the outside it sounds like, whatever the database is, it has far too many critical services tightly bound within it. E.g. leader election implemented internally instead of as a service with separate lifecycle management - pushing the database query processor minor version forward forcing me to move the leader election code or replica config handling forwards... ick.

From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.

Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!

Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)

bdamm6y ago

In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.

merlincorey6y ago

> In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better.

Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:

> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible

[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE

ssalazars6y ago

I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.

Mitigation is your top-priority. Bringing the system back to a good shape.

If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.

If there was a deployment, roll-back.

My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.

I'm just asking questions based on the documentation provided; I do not have more insights.

I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.

londons_explore6y ago

> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.

Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.

That prevents 'bit rot' meaning you daren't reduild or rollback something.

1 more reply

sb82446y ago

I think this is a good point. Don't rollback if you don't know why your new code is giving you problems. You may fix things with the rollback, or you may put yourself in a worse situation where the forward/backwards compatibility has a bug in it. The issue may even be coincidental to the new code.

However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.

scott006y ago

A rollback without understanding is definitely risky. An uninformed rollback is one of the factors that killed Knight Capital Group in 2012. For those not familiar, the actual problem was they failed to update one of a cluster of eight servers, and the server on the old version was making bad trades. They attempted to mitigate with a rollback, which made all eight servers start to make bad trades. In the end they lost $460 million over the course of about 45 minutes.

The full report is here if you're curious: https://www.sec.gov/litigation/admin/2013/34-70694.pdf

hyperpape6y ago

Knight Capital also didn't know what version of software their servers were running, didn't know which servers were originating the bad requests, had abandoned code still in the codebase, and reused flags that controlled that abandoned code (another summary: https://sweetness.hmmz.org/2013-10-22-how-to-lose-172222-a-s...). I'm not sure what you can infer about the risk of a rollback in a less crazy environment.

greenleafjacob6y ago

If rollbacks are not safe then you have a change management problem.

If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.

Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?

Silhouette6y ago

It's not always as simple as that. What if the problem was that something in a change didn't behave as specified and wound up writing important data in an incorrect but retrievable format? Rolling back might not recognise that data properly and could end up either modifying it further so the true data could no longer be retrieved or causing data loss elsewhere as a consequence.

2 more replies

sb82446y ago

Because people make mistakes. Mistakes get fixed in post mortems, retros, best practices, etc. But mistakes will still happen.

kevinburke6y ago

The odds of you understanding all of the constraints and moving variables in play, and doing situation analysis better than the seasoned ops team at a multibillion dollar company are pretty low. Maybe hold off on the armchair quarterbacking.

scarejunba6y ago

I dunno. Based on what's on show here I'd rather buy their product than yours if you were competing.

laCour6y ago

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

lethain6y ago

(Stripe infra lead here)

This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)

aeyes6y ago

You might want to clarify this in the post. To me it reads like you knowingly had degraded infra for days leading up to an incident which might have been preventable had you recovered this instances.

lethain6y ago

Thanks for the suggestion, we’re adding a clarifying note to the report’s timeline.

throwaway34896y ago

I am a curious and very amateur person, but do you think that if "100%" uptime were your goal, this:

"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."

could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?

what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.

when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?

excuse my ignorance on this subject.

redis_mlc6y ago

> could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?

This is a valid question.

As a database and security expert, I carefully weigh database changes. However, developers and security zealots typically charge ahead "because compliance."

Email me if you need help with that.

Thorrez6y ago

You could use that same logic to argue that they should never write any new code, just live forever on the existing code.

But customers want new features, so Stripe does changes.

Jorsiem6y ago

How do you have a ATM thats not networked?

1 more reply

ashelmire6y ago

If you can think of every possible failure and create monitoring and reporting for it before it happens, then you're the best dev on the planet.

sithlord6y ago

And also have the greatest bosses on the history of earth giving you unlimited time to do this.

raverbashing6y ago

And then filtering for a lot of crap and false alarms the tools and supporting infrastructure throws

I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine

gtirloni6y ago

In this environment:

Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.

having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.

runevault6y ago

this case that doesn't sound like it was the issue, it was the lack of promotion of new master due to the bug in the system in terms of shard promotion.

NikolaeVarius6y ago

In many HA setups, you're supposed to not have to care if any single thing goes down because it should auto recover

The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.

laCour6y ago

Right, but they didn't recover speedily. To have the cluster in such a state for so long sounds like poor monitoring to me because this can knowingly interfere with an election later.

kortilla6y ago

The health check said it was ok. How would they know it needed to be recovered?

The fault was the bad health check. Not the process.

1 more reply

zby6y ago

So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.

zbentley6y ago

It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).

throwawaydba6y ago

From reading the RCA, this should be the trinity of mysql + orchestrator + vitess. If stripe can't get it right, there is no chance for the others.

gr2020OP6y ago

Anybody know what database they’re using?

conroy6y ago

MongoDB is the primary data store used at Stripe.

a13n6y ago

Really speaks volumes about how mature MongoDB has become considering how solid Stripe's reliability is.

londons_explore6y ago

MongoDB is a really scary database to use at scale.

It doesn't shard nicely. Failovers have rather nasty semantics that can cause nasty bugs in client side code. Performance cliffs abound.

If your datastore is anything over 1TB, I'd be using postgres, or if you can manage it something bigtable-like.

ahuang6y ago

Not always ;). As someone with experience managing mongo at scale, this really speaks volumes to the amount of effort needed to make it not do the wrong thing. And even then, there are unknown unknowns like this that can pop up at any time.

segmondy6y ago

As I mentioned early, " human error often, configuration changes often, new changes often. " https://news.ycombinator.com/item?id=20406116

chance_state6y ago

This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: https://blog.cloudflare.com/details-of-the-cloudflare-outage...

dps6y ago

I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.

davidw6y ago

I don't think either one is particularly "useful" to me as a consumer of the business, other than knowing that "we have top people working on it right now" and there's a plan in place to try and avoid future problems.

What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.

It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...

hibikir6y ago

Hi Dave, you probably won't remember me (we only spent about 2 months together in Stripe), but I bet Mr Larson remembers.

The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.

I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.

In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.

I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.

Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.

patio116y ago

I work at Stripe, on the marketing team, and assisted a bit here. My last major engineering work was writing the backend to a stock exchange.

If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.

luizfelberti6y ago

Maybe Kyle Kingsbury (aka @aphyr) is the person you are looking for?

https://jepsen.io/services#consulting

1 more reply

Ocha6y ago

For starters maybe provide more details beside the vague information of some feature of some database didnt work as expected. Imagine you are giving this to your employees (especially new ones) to learn something. How much actual useful knowledge is being shared here to learn from?

chacham156y ago

Unexpected things are bound to happen. But, one thing that stuck out to me is that you dont seem to have a safe way to test changes (which would have prevented the second failure). Are there no other environments to test changes on? Is there no way to incrementally roll-out? Is there not another environment which can step in in place of a failing one while you investigate? These seem like fairly common industry practices which help you deal with unexpected failures, but I dont see a mention of if/why these practices failed and if/how that is being remediated.

jabart6y ago

It would be great that in these types of situations if the CC Tokens validity period is extended, or at least known as the documentation states it is short. For our app if the tokens were valid longer, we could write this up as a non-event and retry when things were better.

Havoc6y ago

>This reads like the marketing/PR teams wrote much of it.

The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.

buildawesome6y ago

Out of curiosity, how would you have preferred to see a shard unable to accept writes? I think in both post-mortems, you would see comparable graphs - usage and then a drop in usage. I think it's easier to document a failed regex versus "here's our cluster architecture that we've been using for 3 months".

Also, does your company's engineering decisions change based on other companies' post-mortems?

mual6y ago

Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.

jacquesm6y ago

Why don't they call 'significantly elevated error rates' an 'outage' instead?

dps6y ago

(Stripe CTO here)

That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.

jacquesm6y ago

And thank you for the answer and for being open to outsider input.

NikolaeVarius6y ago

Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

I don't understand why people demand the usage of incorrect language.

teraflop6y ago

In my mind, a "degradation" would be if some fraction of requests were randomly failing, but they would be likely to eventually succeed if retried. Or if the service itself was essentially accessible, but some non-essential functionality was not working correctly.

On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)

I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.

ComputerGuru6y ago

> Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.

> I don't understand why people demand the usage of incorrect language.

The irony.

dmlittle6y ago

My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:

> Stripe splits data by kind into different database clusters and by quantity into different shards.

So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).

luminati6y ago

Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.

debt6y ago

"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.

cetico6y ago

Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.

debt6y ago

"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.

uponcoffee6y ago

I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.

The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking details about how the second incident came to be or how they will be prevented going forward places the second incident as 'not fine'.

This information is what the GP comment is asking for.

Compare this PM with Cloudflare's PM, where they detail how they tested rules, what safeguards were in place, how the incident came to be, and how they intend to prevent similar incidents; the impression given here is that they will put up more fire alarms and fire extinguishers but do little fire prevention.

EugeneOZ6y ago

Not sure why this is downvoted but it all really looks like non-tested deployments to production servers.

dang6y ago

Possibly downvoted because of the name-calling ('what a mess', 'amateur move'), which degrades discussion and is against the site guidelines. It's also sort of distasteful to pile on like that.

https://news.ycombinator.com/newsguidelines.html

j / k navigate · click thread line to collapse