Making Backup Validation Easier (opens in new tab)

(brokensandals.net)

41 pointsjaw6y ago17 comments

17 comments

This seems like a worse than just hashing the file. Random bit flips will probably go undetected using this method, but won't be with hashing.

jawOP6y ago

I'm mostly trying to address cases where there is no original file that I fully trust. If I'm exporting my data from some web app/service, I can't get a hash of the data as it is in the actual source of truth on their servers, and there's multiple points at which an error could be introduced before the completed export file lands on my machine.

It's a good point that hashing is a better method when you have access to the original files.

close046y ago

> and there's multiple points at which an error could be introduced before the completed export file lands on my machine

Aren't all bets off at this point? I mean validating the backup seems skipping steps if you are not validating the source. Scrolling through thumbnails is better than nothing, sure. But it's really prone to false negatives. Corrupted images can look good in a thumbnail and your eyes might just miss even glaring corruption because you just scrolled too fast. If it's not an image file it gets more challenging.

You seem to have one of those corner cases where basically no automated method can solve your problem but the volume of data is just low enough to alleviate the issues with a bit of manual intervention.

jawOP6y ago

> Scrolling through thumbnails is better than nothing, sure.

"Better than nothing" is pretty much what I'm going for here. Almost all my personal data stored in cloud services falls into this "corner case": I only have indirect access to the source, it's important enough to me that I want to do some level of checking, but it's not important enough to spend the huge amounts of time it would take to inspect every individual datum.

akie6y ago

The contents of my backups are never the same, not from one single day to the other - so hashes would be useless.

HideousKojima6y ago

You don't need hashes to match between days at all though. You simply hash the file that was just backed up, and the the backup copy of it, then compare the two

tonyarkles6y ago

I guess it depends somewhat on what you're backing up and what the anticipated failure modes might be. As an example, if there was a bug in my todo software that deleted a bunch of entries, the hash scheme wouldn't pick that up. You've just successfully backed up corrupted data, and you're not aware of it. SQL dumps would be another good example of this. If one day you do a backup and the backup reports that it has archived significantly fewer rows than yesterday, you know something's up. Maybe a fault lost some data, maybe the archiver is broken, etc.

1 more reply

derekp76y ago

This works except in the case where your backups include live database files (where you put the database in extended logging mode, back up the data files while they are being modified, then back up the logs).

I haven't found a good way to verify these without doing a full database restore and seeing if the logs apply cleanly, along with having the DB do internal checks.

1 more reply

close046y ago

I think making a list of the files to be copied and their hashes, then a list of files that were copied and their hashes, then comparing the 2 lists should provide an even quicker way to validate. Or even hashing the entire source and destination (hash of the list of hashes) and providing both values to the user to visually compare.

As far as I can tell the method described in the article doesn't really validate the backups in any way, just provides some statistics that will fail in very plausible ways.

And of course, if the data is important to you and there are special circumstances that could affect the process, nothing beats an actual restore test.

jawOP6y ago

I replied to a similar point about hashing here - https://news.ycombinator.com/item?id=23032633

You're correct that the methods I described are a far cry from actually guaranteeing that the backup has no errors. In the same way that a unit test doesn't prove code is error-free, but _can_ justify increased confidence in the code, I'm interested in techniques that can justify increased confidence in my backups. Particularly in cases where I don't have direct access to the original data, and where exhaustively checking the data manually is too time-consuming to be worth it.

wila6y ago

What I did is to do all my work in a VMware virtual machine.

Then I wrote software for backing up VM's automatically (disclaimer: this is a commercial product I sell)

There's options for getting an email on success, failure or both. The VM files are all hashed.

VMs are easy to restore, so an actual restore is pretty easy without risking to overwrite the original. If a file hash does not match on restore, then my software will complain, but continue the restore anyways.

FWIW, all my code etc... is also in source control, so I am not relying on a single layer for that.

j / k navigate · click thread line to collapse

17 comments

gruez6y ago

This seems like a worse than just hashing the file. Random bit flips will probably go undetected using this method, but won't be with hashing.

jawOP6y ago

It's a good point that hashing is a better method when you have access to the original files.

close046y ago

> and there's multiple points at which an error could be introduced before the completed export file lands on my machine

jawOP6y ago

> Scrolling through thumbnails is better than nothing, sure.

akie6y ago

The contents of my backups are never the same, not from one single day to the other - so hashes would be useless.

HideousKojima6y ago

You don't need hashes to match between days at all though. You simply hash the file that was just backed up, and the the backup copy of it, then compare the two

tonyarkles6y ago

1 more reply

derekp76y ago

I haven't found a good way to verify these without doing a full database restore and seeing if the logs apply cleanly, along with having the DB do internal checks.

1 more reply

close046y ago

As far as I can tell the method described in the article doesn't really validate the backups in any way, just provides some statistics that will fail in very plausible ways.

And of course, if the data is important to you and there are special circumstances that could affect the process, nothing beats an actual restore test.

jawOP6y ago

I replied to a similar point about hashing here - https://news.ycombinator.com/item?id=23032633

wila6y ago

What I did is to do all my work in a VMware virtual machine.

Then I wrote software for backing up VM's automatically (disclaimer: this is a commercial product I sell)

There's options for getting an email on success, failure or both. The VM files are all hashed.

FWIW, all my code etc... is also in source control, so I am not relying on a single layer for that.

j / k navigate · click thread line to collapse