undefined | Better HN

0 pointsderefr8y ago0 comments

And even if that wasn't true, network-attached storage (unlike local storage) has no semantics for communicating a "partially completed" write of a block. Your server either manages to send an iSCSI packet to the SAN with a completed checksum, or it doesn't. Which means that—for the problems that would arise from a sudden power-cut to a VM (let's say from unexpected hypervisor failure)—using a journalling filesystem on your network disks would perfectly compensate for those problems.

0 comments

4 comments · 2 top-level

fulafel8y ago

Common filesystems only do metadata journaling, so your file contents are not protected by this. As an exception, the ext3 and ext4 filesystems support a data journaling mode using a special flag.

Even if you had data journaling, it won't give you consistency between different files. This post used Gitlab as an example, and git will break if some files in its databse are updated, but some not. Git doesn't use fsync to ensure their update order, I don't know if Gitlab enables it or if the performance hit is reasonable.

solatic8y ago· 2 in thread

Partially completed write of a block, sure. But partially completed write of a file?

I can imagine (cough) an application where the application is trying to write some binary blob to disk, doesn't finish before shutdown, and upon reboot, tries to load the binary blob back into memory, fails because the binary blob isn't consistent, doesn't handle the failure well, and refuses to boot.

App's fault? Sure. Does the customer care at 2 am? Nope.

colechristensen8y ago

Then all you're saying over and over is that in your imagination, not using a long running instance is very dangerous because rebooting exposes the fragility of your app.

Honestly, it's much safer in that circumstance to have a frequently rebooting instance because it will quickly expose your app's fragility during normal operations instead of that fragility being exposed in a disaster.

solatic8y ago

> it's much safer in that circumstance to have a frequently rebooting instance

I actually happen to agree with you in principle on this, and it's at the root of my current side project.

But sometimes you just don't have the flexibility to fix or replace the app. Ops engineering, like any other kind of engineering, is about dealing with real-world constraints and making the most of the resources you have. Most apps, on some notion of a fragility spectrum, are far closer to fragile than to antifragile, because fragile is the default, and extensive stress-testing to understand and plan for all failure modes before a production deployment isn't typically feasible. At that point, if you can't fix it, you have to work around it.

1 more reply

j / k navigate · click thread line to collapse