Postmortem of the failure of one hosting storage unit on Jan. 8, 2020 (opens in new tab)

(news.gandi.net)

120 pointsnachtigall6y ago77 comments

77 comments

It's interesting -- they ID the actual cause of the problem up top, and then just zip right past it. The problem wasn't the hardware failure, or the lack of backups, it was that customers expected them to have backups.

Gandi goes into some detail on the recovery process and on ways to fix the issue in the future. But, apart from some hand-waving, they don't have any specifics about how they'll communicate expectations better with their customers in the future.

Imagine the counterfactual: Gandi's docs clearly communicate "this service has no backups, you can take a snapshot through this api, you're on your own." Of course customers with data loss would've complained, but, at the end of the day, the message from both Gandi and the community would've been "well, next time buy a service with backups?" Yet there's no explicit plan to improve documentation.

speeder6y ago

People posted an excerpt of their manual, it explicitly called "snapshots" a "backup" and customers were surprised when they asked them to restore the snapshots and only now they explained the snapshots were not backups...

kaizendad6y ago

Absolutely! And yet no explicit planning for how they'd reword the manual in the future. That is the root cause of the real problem here -- not that there weren't backups, but that there weren't and the manual said that there were, and there's no plan to fix that!

zaroth6y ago

I’m no ZFS expert, but it must have been incredibly stressful, if not mildly terrifying, going that far down the rabbit hole with customer data on the line.

I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.

Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.

rsync6y ago

"Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial."

I have not read this post-mortem yet, but I can attest that this is a viable strategy.

As many know, rsync.net is built entirely on ZFS.

While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.

In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.

---

What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.

We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.

piti_6y ago

I don't have hundreds customers, but I handle hundreds of TiB of data (for a science lab).

Issues occur from time to time, and I can assure that these times are very stressful. I am grateful to rely on ZFS, because yet I have never lost any data from people (datasets are often around 10TiB).

toomuchtodo6y ago

No offsite backups? Backblaze B2 is exceedingly cheap for example.

doublerabbit6y ago

The animation studio I worked for had almost a petabyte of data. It may be cheap to buy the storage but transferring is costly. It's very easy to saturate a MPLS circuit with data, even rSync on a 10Gbit internal connection takes a long while.

Really Gandi should of had backups from day one. If your hosting data you should always have backups ready and tested on day one.

3 more replies

mattkrause6y ago

My lab generates similar-sized data sets and the transfer, more than the at-rest storage, is tough.

If our internet (or Box's datacenter) were slow, we could easily collect data faster than we could send it to our collaborators.

piti_6y ago

We do have backups, but getting back your hundred's TiB is really long: you want to keep it where they are living.

_nalply6y ago

Just bad luck. A different story: I had a cheap dedicated host in Atlanta. Their failure was epic. You get what you pay for.

An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?

Sure!

A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.

However the missing phase destroyed the compressor motors of their cooling systems. Temperature crept up higher and higher. They had to turn off whole floors of servers. When they believed they fixed the problem, they turned on row by row. Renters then frantically tried to copy what they had on the servers and the half repaired cooling system was overtaxed and they had to turn off servers again.

Edit: made some details more specific.

robbyt6y ago

Not bad luck, bad design. Same for Gandi, and for the situation you just described.

Drive failures and HVAC failures due to bad power are not "black swawn" events. These are very common problems for DCs, and a good design takes these problems into account.

However, a "bad" design is cheap, and hopefully the savings is passed to the customer.

_nalply6y ago

Hopefully... My dedicated hosting was very cheap, I paid $20 a month.

halfeatenpie6y ago

It's Delimiter, wasn't it? They weren't a very good hosting organization to begin with, but you get what you pay for.

1 more reply

ZeroCool2u6y ago

This is basically the stuff of nightmares.

You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.

Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.

escardin6y ago

There was one, obvious thing they could have done. Backups. ZFS even makes it easy. They state they have triple redundancy on their servers. Pull 1/2 of the drives and have backups. ZFS supports streaming snapshots (maybe not on this particularly old system). It sounds like they have multiple ZFS servers per datacenter, so given 2 servers, they could use 50% of the storage on each nodes as a backup of the other node.

It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?

I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.

You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.

generalpass6y ago

> How much did five days of panic cost them? Their customers? Their brand?

Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.

RantyDave6y ago

Oh, I think they recognise the damage to their brand.

1 more reply

ZeroCool2u6y ago

They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers.

Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2].

Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.

[1]https://www.wired.com/2012/07/google-colossus/

[2]https://cloud.google.com/files/storage_architecture_and_chal...

escardin6y ago

Do not provide backups and don't have backups aren't the same thing. If you have triply redundant local disks, you can probably afford take half of them and use them as backups for other systems and achieve better availability results (I'm assuming it's not triply redundant for performance).

While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.

I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.

Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.

rsync6y ago

"Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this."

If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...

If only ...

tmikaeld6y ago

This actually makes me glad to use ZFS (FreeBSD and ZOL) on all servers, a broken RAID on a different filesystem could have meant complete data-loss.

From all of the cases I've read where people where not idiots (Not using snapshots and overwriting a dataset..), it's by far the safest filesystem I've seen during my 12 years working with it and I've yet to loose a single file.

Sure, performance can suffer and RAM is pricey, but safety of the data is more important.

Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.

RantyDave6y ago

But would this replicate the broken metadata? And exactly how do they think it got there in the first place? TBH this sounds like a problem with an ancient version of Solaris that's been enhanced by a relatively small company and it's finally just bitten them.

gizmo6y ago

They don't say they're sorry, because they're not. Instead they minimize their actions by: 1) stating how few customers customers were affected, 2) how it's not really their fault because it was a hardware error, 3) it's not really their fault because they had already planned to upgrade the server, 4) it's not really their fault the restore procedure took so long because they had to make backups first, 5) the restore took so long because spinning disks are slow, and they really had no way to know this in advance. And to top it all off they point out they're not contractually obligated to provide working snapshots at all, so really it's the customers who are at fault here.

The take-away here is clear: don't trust Gandi with anything you care about.

jrochkind16y ago

No data was lost though, true?

I don't know if I expect a postmortem to say "sorry", and I think you are being needlessly harsh. But I agree this level of service doesn't seem up to current best in class. Like Amazon etc. (Which of course still have unexpeted outages very occasionally, although a 5 day time to recovery would certainly be... unusual).

But this partially shows how much expectations/standards have raised in the past few-10 years. When an unacceptable not up to par level of reliably still involves no data loss, we're doing pretty good. And I think "don't trust Gandi with anything you care about" is probably an exagerated response. But yes, they don't seem to be providing mega-cloud-service-provider level of service.

FemmeAndroid6y ago

Honestly, I too was looking for a straightforward 'We screwed up, sorry.' I wouldn't care nearly as much if they'd just had 5 days without snapshots. But the way they poorly handled support deserves to be addressed in a postmortem.

See this thread for the support at the time:

https://twitter.com/andreaganduglia/status/12152827193300664...

Mashimo6y ago

> Honestly, I too was looking for a straightforward 'We screwed up, sorry.'

They do so a bit here: https://news.gandi.net/en/2020/01/major-incident-on-our-host...

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.

jrochkind16y ago

Fair enough. That probably is bad marketing if nothing else, and maybe something else. what you're saying about the major failure being in support/customer-management, even more than the technical issue, seems potentially reasonable. (I am not a Gandi customer, so it's not personal for me).

I still think the fact that there was no data loss, and we're still on the edge of calling it unacceptable incompetence, is worth noting, as to how far our expectations and standards have come. Which is good of course.

1 more reply

cptskippy6y ago

> No data was lost though, true?

What about any data that would have accumulated in those 5 days? This was storage for their IAAS and PAAS products, so anyone using those lost access for 5 days?

bpizzi6y ago

Well, it's a technical post-portem, not a love letter. I'm not affiliated to Gandi in any way, but I find the finger-pointing a bit too pedantic.

> The take-away here is clear: don't trust Gandi with anything you care about.

The take-away is not this one. Its: backup anything you care about.

bradly6y ago

They called snapshots backups in their web interface when viewing snapshots. From their docs:

> "Snapshots allow you to create a backup copy of a volume"

https://pbs.twimg.com/media/EN2UZ6TX4AAMe-H?format=png&name=...

They are doing a lot of preaching about backups when failing to do internal backups (not customer facing backups) of their own products.

bpizzi6y ago

As it's said in the postmortem, they are agreeing that they should have stated in a more obvious way that the backups availability was not contractually assured.

1 more reply

hyper_reality6y ago

I for one am glad they released a factual account and timeline of what went wrong. I don't see it as an attempt to minimize their actions. They even admit that they have no clear explanation of the original issue, when they could easily have committed to a stronger theory to make themselves look more competent. Overall I'd much rather read this than a massaged PR apology that keeps us in the dark of what actually happened.

cdubzzz6y ago

This is the messaged PR “postmortem”. It’s basically a shoulder shrug emoji and takes zero responsibility for the incident.

They also failed to address their abysmal responses on Twitter that essentially belittled and poked fun at the affected users.

E.g. https://news.ycombinator.com/item?id=22002258

Mashimo6y ago

>They don't say they're sorry,

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.

https://news.gandi.net/en/2020/01/major-incident-on-our-host... (linked in the Postmortem)

marcinzm6y ago

>But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.

If you have a single point of failure for data and "snapshots" then you should explain that very clearly to customers. Moreover, as I understand it, competitors like AWS do not have such a single point of failure (ie: EBS Snapshots are on S3 and not EBS) so using the same terminology/workflow is going to cause confusion.

AaronFriel6y ago

Are S3 and EBS not hosted on the same underlying storage subsystems?

generalpass6y ago

I realize everyone here seems focused on the file system, but one of the things stressed by the OpenBSD project is that difficulty in upgrades are the root cause of the biggest problems.

Case in point.

vahomu6y ago

> As disks are read at 3M/s, we estimate the duration of the operation to be up to 370 hours.

Am I reading this right? This works out to just ~3.8TiB.

So much drama over, basically, one HDD worth of data?

robin_reala6y ago

They said that in the end only 414 users were affected, and it was their simple hosting package. Honestly, I’m almost surprised it was that much.

the84726y ago

If they had some spare SSD capacity lying around they could have done a linear copy of the HDD to SSD and then done the import, that could have sped up the random-access scans.

phonon6y ago

Absolutely pathetic on their end.

loa_in_6y ago

Their services otherwise run flawlessly for me. I appreciate the transparency.

BrentOzar6y ago

And now their postmortem blog post is down with a 503 error. Doesn’t exactly fill me with confidence about their abilities.

polyvisual6y ago

Their handling of the issue on Twitter was enough for to decide to move my domain names away from them when their renewal is due.

slig6y ago

Just a heads up, and forgive me if this is obvious, but you can move right away, the expire date will be the same anyway. You certainly don't want a failed migration too close to the expiration date.

polyvisual6y ago

yes, good point!

fooqux6y ago

The last I looked at them for either hosting or domain, they had a provision in their TOS that basically said they could terminate my account at any time if I did anything they felt was morally wrong. I emailed them and asked if what I was reading was true and they confirmed it. I never looked at them again.

They probably thought their no-BS morally-right stance was supposed to comfort me, and it's not like I host anything that would likely meet that criteria. But who is to say a blog post has cursing in it that they decide, at their sole discretion, to be bad? Or speak up on abortion in a way they don't agree with? Or any of the other morally-charged topics out there? I'm not hosting with the thought police and it's always made me wonder how others felt comfortable with them.

Mashimo6y ago

> they had a provision in their TOS that basically said they could terminate my account at any time if I did anything they felt was morally wrong.

Don't most modern service providers have a clause like that?

Mashimo6y ago

Any recommendations? Preferred within the EU.

acd6y ago

Once had a data loss on a test ZFS system with iSCSI on top. What I learnt from that is that you need to schedule scrubs your ZFS pools regularly. Its always easy to be wise afterwards but harder to predict before. Not sure if that would have helped.

iicc6y ago

> We think it may be due to a hardware problem linked to the server RAM.

Are they using ECC RAM?

tmikaeld6y ago

Sounds like they didn't and the metadata logs got corrupted..

This does also mean that other data would be corrupted too, running ZFS without ECC RAM is frequently warned against.

vel0city6y ago

Running any resilient storage system without ECC RAM is warned against, people just really make a big deal about it with ZFS. If your data in RAM is corrupted before it makes it to the hard drive, pretty much any file system is going to write corrupted data to the drive.

tmikaeld6y ago

Indeed, ECC should be _default_ these days!

It becomes even worse when RAM Is usually only tested when the computer is built.

I've had several cases where RAM becomes faulty a couple of years down the road.

Recently I had a very weird case of two stick of four went bad due to only moving the computer from one corner to the next without even opening the case.

cdubzzz6y ago

Previously: https://news.ycombinator.com/item?id=22001822

perlgeek6y ago

As a postmortem, this does not inspire confidence. It's a very technical piece, but doesn't even try to take a customer's perspective.

If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.

Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.

If recovery takes ages, you might want to practice recovery and improve your tooling.

And so on.

Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.

_eht6y ago

Being a Gandi customer must be terrifying, generally.

speedgoose6y ago

They used to have a great reputation in France about 20 to 15 years ago. It went downhill since, the company got sold and they started to sell expensive and slow cloud services, a bit like Amazon with less success.

gdm856y ago

Would it be possible to backup the precious metadata separately to mitigate the issue?

tomatocracy6y ago

Sounds like data was on one pool with a 3 disk mirror setup and nowhere else. This is 'RAID is not a backup' territory. Much better would have been to duplicate the volumes themselves somewhere else (eg using zfs send/receive to a different host) and ideally the contents of those volumes too.

RantyDave6y ago

I don't think it's a problem of the data corrupting while on disk, I think the problem has occurred in ram and then been written to disk.

dmh20006y ago

After a glance I thought, 'why a storage unit?, where do they get power, how do they cool it, its not physically secure, etc'. Then, oh, that kind of storage unit. Yes, I'm dumb.

j / k navigate · click thread line to collapse

77 comments

kaizendad6y ago

speeder6y ago

kaizendad6y ago

zaroth6y ago

I’m no ZFS expert, but it must have been incredibly stressful, if not mildly terrifying, going that far down the rabbit hole with customer data on the line.

I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.

rsync6y ago

I have not read this post-mortem yet, but I can attest that this is a viable strategy.

As many know, rsync.net is built entirely on ZFS.

---

What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.

piti_6y ago

I don't have hundreds customers, but I handle hundreds of TiB of data (for a science lab).

toomuchtodo6y ago

No offsite backups? Backblaze B2 is exceedingly cheap for example.

doublerabbit6y ago

Really Gandi should of had backups from day one. If your hosting data you should always have backups ready and tested on day one.

3 more replies

mattkrause6y ago

My lab generates similar-sized data sets and the transfer, more than the at-rest storage, is tough.

If our internet (or Box's datacenter) were slow, we could easily collect data faster than we could send it to our collaborators.

piti_6y ago

We do have backups, but getting back your hundred's TiB is really long: you want to keep it where they are living.

_nalply6y ago

Just bad luck. A different story: I had a cheap dedicated host in Atlanta. Their failure was epic. You get what you pay for.

An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?

Sure!

A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.

Edit: made some details more specific.

robbyt6y ago

Not bad luck, bad design. Same for Gandi, and for the situation you just described.

Drive failures and HVAC failures due to bad power are not "black swawn" events. These are very common problems for DCs, and a good design takes these problems into account.

However, a "bad" design is cheap, and hopefully the savings is passed to the customer.

_nalply6y ago

Hopefully... My dedicated hosting was very cheap, I paid $20 a month.

halfeatenpie6y ago

It's Delimiter, wasn't it? They weren't a very good hosting organization to begin with, but you get what you pay for.

1 more reply

ZeroCool2u6y ago

This is basically the stuff of nightmares.

escardin6y ago

generalpass6y ago

> How much did five days of panic cost them? Their customers? Their brand?

RantyDave6y ago

Oh, I think they recognise the damage to their brand.

1 more reply

ZeroCool2u6y ago

They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers.

[1]https://www.wired.com/2012/07/google-colossus/

[2]https://cloud.google.com/files/storage_architecture_and_chal...

escardin6y ago

Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.

rsync6y ago

"Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this."

If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...

If only ...

tmikaeld6y ago

This actually makes me glad to use ZFS (FreeBSD and ZOL) on all servers, a broken RAID on a different filesystem could have meant complete data-loss.

Sure, performance can suffer and RAM is pricey, but safety of the data is more important.

Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.

RantyDave6y ago

gizmo6y ago

The take-away here is clear: don't trust Gandi with anything you care about.

jrochkind16y ago

No data was lost though, true?

FemmeAndroid6y ago

See this thread for the support at the time:

https://twitter.com/andreaganduglia/status/12152827193300664...

Mashimo6y ago

> Honestly, I too was looking for a straightforward 'We screwed up, sorry.'

They do so a bit here: https://news.gandi.net/en/2020/01/major-incident-on-our-host...

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.

jrochkind16y ago

1 more reply

cptskippy6y ago

> No data was lost though, true?

What about any data that would have accumulated in those 5 days? This was storage for their IAAS and PAAS products, so anyone using those lost access for 5 days?

bpizzi6y ago

Well, it's a technical post-portem, not a love letter. I'm not affiliated to Gandi in any way, but I find the finger-pointing a bit too pedantic.

> The take-away here is clear: don't trust Gandi with anything you care about.

The take-away is not this one. Its: backup anything you care about.

bradly6y ago

They called snapshots backups in their web interface when viewing snapshots. From their docs:

> "Snapshots allow you to create a backup copy of a volume"

https://pbs.twimg.com/media/EN2UZ6TX4AAMe-H?format=png&name=...

They are doing a lot of preaching about backups when failing to do internal backups (not customer facing backups) of their own products.

bpizzi6y ago

As it's said in the postmortem, they are agreeing that they should have stated in a more obvious way that the backups availability was not contractually assured.

1 more reply

hyper_reality6y ago

cdubzzz6y ago

This is the messaged PR “postmortem”. It’s basically a shoulder shrug emoji and takes zero responsibility for the incident.

They also failed to address their abysmal responses on Twitter that essentially belittled and poked fun at the affected users.

E.g. https://news.ycombinator.com/item?id=22002258

Mashimo6y ago

>They don't say they're sorry,

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.

https://news.gandi.net/en/2020/01/major-incident-on-our-host... (linked in the Postmortem)

marcinzm6y ago

>But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.

AaronFriel6y ago

Are S3 and EBS not hosted on the same underlying storage subsystems?

generalpass6y ago

I realize everyone here seems focused on the file system, but one of the things stressed by the OpenBSD project is that difficulty in upgrades are the root cause of the biggest problems.

Case in point.

vahomu6y ago

> As disks are read at 3M/s, we estimate the duration of the operation to be up to 370 hours.

Am I reading this right? This works out to just ~3.8TiB.

So much drama over, basically, one HDD worth of data?

robin_reala6y ago

They said that in the end only 414 users were affected, and it was their simple hosting package. Honestly, I’m almost surprised it was that much.

the84726y ago

If they had some spare SSD capacity lying around they could have done a linear copy of the HDD to SSD and then done the import, that could have sped up the random-access scans.

phonon6y ago

Absolutely pathetic on their end.

loa_in_6y ago

Their services otherwise run flawlessly for me. I appreciate the transparency.

BrentOzar6y ago

And now their postmortem blog post is down with a 503 error. Doesn’t exactly fill me with confidence about their abilities.

polyvisual6y ago

Their handling of the issue on Twitter was enough for to decide to move my domain names away from them when their renewal is due.

slig6y ago

Just a heads up, and forgive me if this is obvious, but you can move right away, the expire date will be the same anyway. You certainly don't want a failed migration too close to the expiration date.

polyvisual6y ago

yes, good point!

fooqux6y ago

Mashimo6y ago

> they had a provision in their TOS that basically said they could terminate my account at any time if I did anything they felt was morally wrong.

Don't most modern service providers have a clause like that?

Mashimo6y ago

Any recommendations? Preferred within the EU.

acd6y ago

iicc6y ago

> We think it may be due to a hardware problem linked to the server RAM.

Are they using ECC RAM?

tmikaeld6y ago

Sounds like they didn't and the metadata logs got corrupted..

This does also mean that other data would be corrupted too, running ZFS without ECC RAM is frequently warned against.

vel0city6y ago

tmikaeld6y ago

Indeed, ECC should be _default_ these days!

It becomes even worse when RAM Is usually only tested when the computer is built.

I've had several cases where RAM becomes faulty a couple of years down the road.

Recently I had a very weird case of two stick of four went bad due to only moving the computer from one corner to the next without even opening the case.

cdubzzz6y ago

Previously: https://news.ycombinator.com/item?id=22001822

perlgeek6y ago

As a postmortem, this does not inspire confidence. It's a very technical piece, but doesn't even try to take a customer's perspective.

If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.

Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.

If recovery takes ages, you might want to practice recovery and improve your tooling.

And so on.

Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.

_eht6y ago

Being a Gandi customer must be terrifying, generally.

speedgoose6y ago

gdm856y ago

Would it be possible to backup the precious metadata separately to mitigate the issue?

tomatocracy6y ago

RantyDave6y ago

I don't think it's a problem of the data corrupting while on disk, I think the problem has occurred in ram and then been written to disk.

dmh20006y ago

After a glance I thought, 'why a storage unit?, where do they get power, how do they cool it, its not physically secure, etc'. Then, oh, that kind of storage unit. Yes, I'm dumb.

j / k navigate · click thread line to collapse