undefined | Better HN

0 pointsSilhouette6y ago0 comments

In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

OK, but then what if it's new data being stored in real time, so there isn't any previous backup with the data in the intended form? In this case, we're talking about Stripe, which is presumably processing a high volume of financial transactions even in just a few minutes. Obviously there is no good option if your choice is between preventing some or all of your new transactions or losing data about some of your previous transactions, but it doesn't seem unreasonable to do at least some cursory checking about whether you're about to cause the latter effect before you roll back.

0 comments

londons_explore6y ago

I think you guys are considering this from the wrong angle...

Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.

The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.

If only the new release fails, a rollback should be initiated automatically.

If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.

SilhouetteOP6y ago

This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.

For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.

But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.

And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.

j / k navigate · click thread line to collapse