Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?
> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.
> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.
https://www.share.org/blog/mainframe-matters-how-mainframes-...
https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...
https://www.ibm.com/it-infrastructure/z/transaction-processi...
https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...
I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.
When dispute is at play, it's a hot potato that no one wants to hold between the merchant, processor, ISO, sales agent & bank. The card networks have been smart to eliminate themselves from that step.
Stolen/lost cards can simply be flagged in a master db/table and can be rejected quickly for example.
For example:
* My credit card statement should have links to the merchant, the address, a list of the things I bought, a link to the returns process, etc.
* Why can't my statement also have the total number of calories I've purchased in the last month, or grams of carbon in fuel I've put in the truck?
* Why can't I use my mastercard to pay another mastercard user directly?
* Why hasn't mastercard produced a '2 factor' for card payments rather than forcing every bank to implement their own?
* Why can't I buy a dual Mastercard/Visa/Other card, which works with merchants who are picky and will only accept one or the other?
* Why are we still issuing bits of plastic in the digital age anyway?
* Why don't the cards have a microusb plug on one edge, or NFC to plug into a phone or computer to log in, to act as an identity card, to authenticate or make payments, or anything else other companies issue smartcards for?
* Why don't mastercard work with mobile providers to issue cards that you can spend your pay-as-you-go balance with, turning a mobile provider into a bank.
It seems mastercards business is 'stuck', and there are opportunities to innovate all around them, but they won't.
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.
I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.
That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?
From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.
Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!
Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)
Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:
> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible
[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE
Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.
However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.
The full report is here if you're curious: https://www.sec.gov/litigation/admin/2013/34-70694.pdf
If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.
Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?
How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.
This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."
could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?
The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?
what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.
when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?
excuse my ignorance on this subject.
I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.
It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).
What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.
It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...
The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.
I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.
In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.
I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.
Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.
If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.
The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.
Also, does your company's engineering decisions change based on other companies' post-mortems?
That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
I don't understand why people demand the usage of incorrect language.
On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)
I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.
The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.
> I don't understand why people demand the usage of incorrect language.
The irony.
> Stripe splits data by kind into different database clusters and by quantity into different shards.
So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).
Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.
It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.
One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.
Doing a large rollback based on a hunch seems like an overreaction.
It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.
Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.
Of course, I wasn't there so I could be completely off.
idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.