Ok, I really wasn't expecting this to land at the top of HN. I'd love to stick around to answer any questions people have, but it's 10PM and my toddler decided to go to bed at 5PM... so if I'm lucky I can get about 4 hours of sleep before she decides that it's time to get up. I'll check in and answer questions in the morning.
God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service. I'm glad you're clear about no SLA, but this seems like a big liability between me and my backups.
I know you didn’t ask me — but I don’t think Colin can answer differently other than saying that he is training a family member or friend to take over if needed.
Here’s more https://news.ycombinator.com/item?id=7514753 this is also linked there http://mail.tarsnap.com/tarsnap-users/msg00846.html
Very old threads but I am not sure much has changed there https://www.tarsnap.com/contact.html
Why would you use it instead of restic? Well, for pricing in pico dollars ;-)
and for it has a functional GUI with tiny system footprint and that there really aren’t many such solutions out there.
Hence the toddler.
(FWIW, S3 can be somewhat straightforwardly configured so that old data is effectively immutable. Google Cloud Storage’s similarly named versioning feature appears to be far weaker.)
Wasabi does $7/TB with no ingress/egress fees. My NAS is set up to rclone to it about once a day and I've yet to have any problems
A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.
"Our simple model that fails gracefully did so and was simple to recover"
Redundancies and failsafes are not free - they add complexity.
99.9% availability fails in boring ways.
99.999% availability fails in fascinating ways.
The main lesson learned was "rehearse this process at least once a year".
> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred
I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.
If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.
For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.
Have been having some luck reading https://www.amazon.com/No-Cry-Sleep-Solution-Toddlers-Presch... - available everywhere libraries (blockbuster for books!) are found.
I too had a few EC2 instances go down with signs of being severed from the EBS in the recent couple of weeks; mine were in eu-west.
- Setup nightly automatic snapshots of EBS volumes (this is supported natively now in AWS under lifecycle manager).
- Use EBS volumes of the new GP3 type, and perhaps use provisioned IOPS.
- Setup a auto-scaling group with automatic failover. Of course increases cost, but should be able to automatically failover to a standby EC2 instance (assuming all the code works automatically which the blog post indicates is not currently the case).
What prevents you to distribute load among other regions?
(Also: did you ever think about abandoning AWS?)
- The use of “I” begs the question: what’s the “bus factor” of Tarsnap? If you were unavailable, temporarily or permanently, what are the contingency plans?
- Will you be making any other changes to improve the recovery time, or did the system mostly function as designed? For example having a hot spare central server?
This speaks volumes to me about what kind of person Percival is; that credit would appear to be generously on the "make customer whole" side of the fence, and unlike the major cloud providers, he didn't make each customer come and individually grovel for it. And a clearly written, technical, detailed PM, too. This is how it ought to be done, and done everywhere. Thanks for being a beacon of light in the dark.
That's well put.
It makes me very happy to live in a world where tarsnap exists and is priced in picodollars.
Also I would suggest to think about the business long term and seeing if you can increase the revenue enough to enable you to hire a part-timer who can be of great help in case a similar event happens.
We are also a small cloud solution provider (we focus on ML API's) and over the years it has become clear to us that when you use cloud hardware (either dedicated or virtual), from time to time the outages periodically happen. RAM, HDD or other parts of the hardware just can malfunction anytime. So this is something which 100% needs to be taken into consideration when running any high availability online service over long-term.
For example in both trains and cars, thanks to anti-lock braking, the correct way to stop the vehicle ASAP is to brake just like normal but as hard as you can, the computers will automatically solve the much trickier problem of turning your input into maximum deliverable braking force by periodically releasing brakes on sticking wheels.
If you run a fire drill, it's surprisingly difficult to get employees to use fire doors that they're used to finding alarmed and unusable. Even though intellectually they know that, say, the door at the bottom of the stairwell is a fire door, with crash bars and leads directly to the outside world, and this is a fire drill, they are likely to (for example) exit on a higher floor and go through a chokepoint lobby, as they would normally, instead of following this safer path that is emergency only. Sadly it is hard to fix buildings after construction if they were designed with such "unused" emergency exits.
For a backup process, having restoring machine images be a service that is sometimes, though not constantly, used anyway for some other reason, is a good way to be comfortable with how it works, that it works, etc. At work for example we routinely test upgrades on test servers restored from a recent backup. Restore serviceA to testA, apply upgrade, discover upgrade completely ruins the service, throw testA away and report this upgrade is garbage. But in the process we gained confidence in the restore process, infrastructure people instead of trying to recall something they only ever did in a drill, when things go badly wrong are very used to this procedure because they do it "all the time".
Rehearsing this annually is definitely going to be a high priority.
I personally would go with the simpler solution because in my experience you need an awful lot of extra complexity before you get to the same level of reliability that you have with the simpler system. Most complexity is just making things worse.
You can see this clearly when it comes to clustering servers. A single server with a solid power supply and network hookup will be more reliable than any attempt at making that service redundant until you get to something like 5x more costly and complex. Then maybe you'll have the same MTBF as you had with the single server. Beyond that you can have actualy improvements. YMMV and you may be able to get better reliability at the same level of performance in some cases but on average it always first gets far more complex, costly and fragile before you see any real improvements.
I strongly believe that the best path to real reliability is simplicity (which is: as simple as possible) and good backups. For stuff that needs to be available 24x7 and 365 days per year this limits your choices in available technologies considerably.
This is Colin's job. Colin has his name attached to it. It's really important to Colin.
You're not going to get the same kind of service from BigBackupCorp. Their employees are replaceable, their management is replaceable, and to be honest, you as a customer are replaceable, if they decide to move in a different direction and become BigFlowerArrangementShippingCorp.
The neat thing about a small business is that it runs entirely on its own profits. There are no stock price games or VC jiggery-pokery or anything like that. If it's a profitable business, there will be somebody to come along and take it over and make it their job with their name attached to it. I think the open Internet benefits a lot from this sort of thing.
They should take separate buses to ______.
Better to have multiple layers of backup, of which tarsnap and friends are only one, and verify regularly.
Recommend writing a TLA+ model to catch stuff like this
(People here asking about the low Bus Factor: you don't keep your backups in one service/location, eh? You use Tarsnap and Restic with Backblaze, Rsync.net, S3, etc. right? "Backups are a tax you pay for the luxury of restore.")
I have been using Tarsnap for a decade and not only has there been minimal availability issues there have been almost no issues of any kind that I can recall.
>> So far such an outage has never occurred; but over time Tarsnap will become more tolerant of failures in order to minimize the probability that such an outage occurs in the future.
Neglecting the pricing, does Tarsnap have any advantage over Restic?
Restic also deduplicates, using little data.
I mean.. you could purchase a cheaper service and also donate to various efforts. Bonus: Then you'd also be able to pick those efforts.
Tarsnap makes a lot of sense when you benefit from the encryption and (especially) de-duplication features that it offers. For me, all of my most important personal and business data, from multiple decades, compresses-and-deduplicates down to around 6GiB. Considering the high value of the data I store in it, tarsnap's pricing actually feels absurdly low.
Can you provide more detail why you think so? I don't believe there is any use case in which tarsnap makes sense, other than maybe some Plan-C backup solution which you fall back on in the highly unlikely event that neither Plan-A nor Plan-B worked.
Concretely, what benefits does tarsnap offer over restic or borg in combination with rsync.net, to make up for the substantial downsides (such as insanely slow restore, complete lack of wetware redundancy or being written in C[1])?
Tarsnap : $0.25 / GB storage, $0.25 / GB bandwidth cost
rsync.net : $0.015 / GB storage, no bandwidth cost
s3 : $0.023 / GB storage, some complicated bandwidth pricing
If tarsnap is built on top of s3, they're charging 10 times for the storage cost. Easy money from the uninformed?
Tarsnap is a wonderful piece of software. You're paying for that.
That said, is the value of "Tarsnap" worth the price difference from "Borg+rsync.net"? (Or Restic, I've been meaning to look into Restic). I'm not so sure. These days I'm a customer of rsync.net, not of Tarsnap.
But I still firmly disagree with the "Colin's just exploiting the uninformed" angle.
Geez, that's really not improving the comparison with Tarsnap.
Backblaze: $0.005 / GB storage, $0.01 / GB download.
I don't think so. Anyone who can use this software I'm sure knows what other options exist.
The 120Gb is the contents of my OneDrive and local repository trees. This is everything I've ever done that I want to keep and is approximately 115Gb of photos and not a lot else!
That's pretty much any SaaS... look at the various log or metrics gathering solution, where you pay serious multipliers of what would cost to run same software on your own instance.
I've been using Tarsnap for 10+ years. There's some Linux stuff getting backed up, configs and such. It costs next to nothing for this kind of usage.
While on the price, patio11 (Patrick) has written an article about tarsnap’s issues more than nine years ago (April 2014). One of the suggestions was to raise prices, IIRC. It’s a long post, but you can read it [1] and the HN post [2] from that time.
[1] In case of an emergency, you will always be able to get back your data from tarsnap at a blazing rate of 50kB/s https://github.com/Tarsnap/tarsnap/issues/333.
How many of the world's best and brightest are doing all sorts of busywork? At least Colin has some time to do whatever he wants to do while running tarsnap.
Colin, could the website be updated to the 2010s? :P
This is entirely Safari's fault for not having good compatibility with a common existing webpage format.
Anyway, if you're the intended audience (someone using tarsnap), you also received a copy to your email address, where you can read the text with your email reader of choice.
<p> is far more appropriate
That isn’t apple’s problem, nor mine.
It’s not impossible, and likely just a fault of whatever list thing is used, but it could be better, and it’s nice if people let you know as such right?
i assumed the parent did not know how to do that, i tried locally and it seemed to work, but i did not pay attention to the text
original:
on the left side of the url input field you'll find "AA"(the first smaller then the seconds), tap that
then, near the bottom of the pop-up menu you'll have "Show Reader", tap that
if you're not happy with the text as displayed then, you can go back to the "AA" menu and change the options
Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?
The backups themselves should be in S3, but using S3 as a NoSQL append-only database seems unwise.
This would benefit from being further from the metal.
On a less technical note: Always avoid the fancy option when it makes sense. (From a veteran of building and maintaining large scale high performance high availability systems)
S3 is not the problem here. The problem is building a database on top of S3, and having to reimplement all the consistency, atomicity, transactions etc. on top.
>no thought to a schema, no migrations to manage
There is, in fact, always a schema. Some people choose to ignore it's there, to their detriment.
>Always avoid the fancy option when it makes sense.
It's not the 1980s. Postgres is not fancy, and Greenspunning it is a mistake.
>Almost guaranteed it's cheaper.
Cheaper than a 26-hour outage?
Cost and reliability?
* Using S3 as a simple database is generally going to be much cheaper than RDS.
* If you turn on point in time restore, then losing data stored in S3 is not a possibility worth worrying about on a practical level for most people. RDS replication is easy enough to use, but adds more cost and a little bit of extra infra complexity.
It's a bad trade. Thousands of hours of a high human capital computer scientist vs. a few tens of dollars a month for RDS.
>Reliability
Empirically false: none of this would have happened if Tarsnap used Postgres instead of a home-spun database.
There's client libraries like Delta Lake that implement ACID on S3.
Much of the Grafana stack uses S3 for storage (Mimir/metrics, Loki/logs, Tempo/traces).
That said, I'm not sure about the implementation Tarsnap uses--if it's completely ad-hoc or based of other patterns/libraries.
How, exactly, is that a good thing?