Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.