Gandi goes into some detail on the recovery process and on ways to fix the issue in the future. But, apart from some hand-waving, they don't have any specifics about how they'll communicate expectations better with their customers in the future.
Imagine the counterfactual: Gandi's docs clearly communicate "this service has no backups, you can take a snapshot through this api, you're on your own." Of course customers with data loss would've complained, but, at the end of the day, the message from both Gandi and the community would've been "well, next time buy a service with backups?" Yet there's no explicit plan to improve documentation.
I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.
Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.
I have not read this post-mortem yet, but I can attest that this is a viable strategy.
As many know, rsync.net is built entirely on ZFS.
While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.
In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.
---
What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.
We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.
Issues occur from time to time, and I can assure that these times are very stressful. I am grateful to rely on ZFS, because yet I have never lost any data from people (datasets are often around 10TiB).
An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?
Sure!
A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.
However the missing phase destroyed the compressor motors of their cooling systems. Temperature crept up higher and higher. They had to turn off whole floors of servers. When they believed they fixed the problem, they turned on row by row. Renters then frantically tried to copy what they had on the servers and the half repaired cooling system was overtaxed and they had to turn off servers again.
Edit: made some details more specific.
Drive failures and HVAC failures due to bad power are not "black swawn" events. These are very common problems for DCs, and a good design takes these problems into account.
However, a "bad" design is cheap, and hopefully the savings is passed to the customer.
You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.
Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.
It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?
I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.
You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.
Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.
Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2].
Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.
[1]https://www.wired.com/2012/07/google-colossus/
[2]https://cloud.google.com/files/storage_architecture_and_chal...
If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...
If only ...
From all of the cases I've read where people where not idiots (Not using snapshots and overwriting a dataset..), it's by far the safest filesystem I've seen during my 12 years working with it and I've yet to loose a single file.
Sure, performance can suffer and RAM is pricey, but safety of the data is more important.
Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.
The take-away here is clear: don't trust Gandi with anything you care about.
I don't know if I expect a postmortem to say "sorry", and I think you are being needlessly harsh. But I agree this level of service doesn't seem up to current best in class. Like Amazon etc. (Which of course still have unexpeted outages very occasionally, although a 5 day time to recovery would certainly be... unusual).
But this partially shows how much expectations/standards have raised in the past few-10 years. When an unacceptable not up to par level of reliably still involves no data loss, we're doing pretty good. And I think "don't trust Gandi with anything you care about" is probably an exagerated response. But yes, they don't seem to be providing mega-cloud-service-provider level of service.
See this thread for the support at the time:
https://twitter.com/andreaganduglia/status/12152827193300664...
What about any data that would have accumulated in those 5 days? This was storage for their IAAS and PAAS products, so anyone using those lost access for 5 days?
> The take-away here is clear: don't trust Gandi with anything you care about.
The take-away is not this one. Its: backup anything you care about.
> "Snapshots allow you to create a backup copy of a volume"
https://pbs.twimg.com/media/EN2UZ6TX4AAMe-H?format=png&name=...
They are doing a lot of preaching about backups when failing to do internal backups (not customer facing backups) of their own products.
They also failed to address their abysmal responses on Twitter that essentially belittled and poked fun at the affected users.
>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.
https://news.gandi.net/en/2020/01/major-incident-on-our-host... (linked in the Postmortem)
If you have a single point of failure for data and "snapshots" then you should explain that very clearly to customers. Moreover, as I understand it, competitors like AWS do not have such a single point of failure (ie: EBS Snapshots are on S3 and not EBS) so using the same terminology/workflow is going to cause confusion.
Case in point.
Am I reading this right? This works out to just ~3.8TiB.
So much drama over, basically, one HDD worth of data?
They probably thought their no-BS morally-right stance was supposed to comfort me, and it's not like I host anything that would likely meet that criteria. But who is to say a blog post has cursing in it that they decide, at their sole discretion, to be bad? Or speak up on abortion in a way they don't agree with? Or any of the other morally-charged topics out there? I'm not hosting with the thought police and it's always made me wonder how others felt comfortable with them.
Are they using ECC RAM?
This does also mean that other data would be corrupted too, running ZFS without ECC RAM is frequently warned against.
If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.
Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.
If recovery takes ages, you might want to practice recovery and improve your tooling.
And so on.
Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.