With that said - Backblaze is optimized for working documents - and the default "exclusion" list makes it clear they don't want to be backing up your "wab~,vmc,vhd,vo1,vo2,vsv,vud,vmdk,vmsn,vmsd,hdd,vdi,vmwarevm,nvram,vmx,vmem,iso,dmg,sparseimage,sys,cab,exe,msi,dll,dl_,wim,ost,o,log,m4v" files. They also don't want to backup your /applications, /library, /etc, and so on locations. They also make it clear that backing up a NAS is not the target case for their service.
I can live with that - because, honestly, it's $4/month, and my goal is to keep my working files backed up. System Image backups, I've been using Super Duper to a $50 external hard drive.
Glacier + a product like http://www.haystacksoftware.com/arq/ means I get the benefit of both worlds - Amazon will be fine with me dropping my entire 256 Gigabyte Drive onto Glacier (total cost - $2.56/month) and I get the benefit of off site backup.
The world is about to get a whole lot simpler (and inexpensive) for backups.
Arq is the pinnacle of my rather large backup pyramid which also includes: Dropbox, Superduper, Crashplan, rsync, SVN/GIT and more.
I will gladly pay more money for Arq again.
That said, I am reminded that I should not forget about the Arq backups and do a few test restores sometime. :)
A home user is fine with a 3.5-4 hour window before their backup becomes available for download (as it will probably take them days to download it anyway).
In a corporate environment, I don't want to wait around for 3.5-4 hours before my data even becomes available for restore in a disaster recovery situation.
Seems good for archive-only in a corporate environment (as the name implies).
In a true disaster recovery (Building burned down or is otherwise unavailable) - it usually takes most businesses a week or so just to find new office facilities.
But - agreed, there will be some customers for whom Glacier wouldn't work well for all use cases.
Now - a blended S3/Glacier offering might be very attractive.
I am guessing you work in a company with a good IT department then, I am guessing this is not the average. Many companies I have worked for 4 hours would be a miracle with a 1-3 day operation minimum.
And don't forget that the X00MB type size limits many IT departments puts everywhere it not because getting a TB hard drive is cheap, but because all of the extra backups add to the cost of each new MB. Having another extremely cheap way they could backup large amounts of data (encrypted?) would help to reduce the cost of each extra GB.
In that case, 3-4 hours would be more than acceptable.
ding!
Already has Amazon support: http://git-annex.branchable.com/tips/using_Amazon_S3/
Very friendly with very large files!
http://www.haystacksoftware.com/support/arqforum/topic.php?i...
Dropbox should work here, but it's simply too expensive. My photo library is 175GB. That isn't excessive when considering I store the digital negatives and this represents over a decade.
I don't mind not being able to access it for a few hours, I'm thinking disaster recovery of highly sentimental digital memories here.
If my flat burns down destroying my local copy, and my personal off-site backup (HDD at an old friends' house) is also destroyed... then nothing would be lost if Amazon have a copy.
In fact I very much doubt anyone I know that isn't a tech actually keeps all of their data backed-up even in that manner.
I find myself already wondering: My 12TB NAS, how much is used (4TB)... could I backup all of it remotely? It's crazy that this approaches being feasible. It's under 30GBP per month for Ireland storage for my all of the data on my NAS.
To be able to say, "All of my photos are safe for years and it's easy to append to the backup.". That would be something.
A service offering a simple consumer interface for this could really do well.
edit: emphasis formatting, niceifying.
That said, people DO use DropBox as backup.
If you take a walk around the British Library and asked every PhD student working there how they "Backup" their research and work in progress, I bet every single person who believes that they have a backup will say "Dropbox", and the only exceptions will be a few who don't really have a backup.
I know that because I ensured my girlfriend does have a real backup solution in place that is tested. Not one of her peers seems to.
DropBox is used for backup because they've made file sync so damn easy that most people can be convinced that if a file exists in many places, it is backed-up.
My whole point is that now storage for long term backup is priced in a way that is affordable to most, that consumer services may emerge that offer true backup to consumers and can successfully migrate people from lesser solutions (DropBox, stacks of CD-ROMs, etc).
One of the things about backup is that it needs to be easy. Currently the size and cost of backups make it expensive, and the only way to reduce the cost makes it difficult (HDD local copies stored at a friends' house for example).
By reducing the cost, perhaps we can finally increase the ease... and then a day may come in which most people have a real backup solution.
For $50 in software (Arq+SuperDuper), $100 for an external HD, and less than $25/month ($4-backblaze, $10 Dropbox, $10 Glacier) you have a backup system that is next to air tight for a Terabyte of Data and a working set (on dropbox) of 100 Gigabytes.
I have my MacBook and a Linux server linked to my Dropbox account. So changes in my documents are synced to my Linux.
My Linux run three cron-jobs. One daily, one weekly and another monthly. The command is.
s3cmd sync --delete-removed ~/Dropbox/documents/ s3://backup-daily/
There are buckets for weekly and monthly too.
Note: the command is not exactly like that, check the man page.
That way I have backed up all my documents very cheap.
Also, please don't use uppercase for emphasis, as mentioned in the guidelines [1]. If you want to emphasize a word or phrase, put asterisks around it and it will get italicized.
Why don't Dropbox enable this use-case. It would surely be very easy to implement. I guess that pack-rat does do this in a way but seems like overkill.
I'd like something along the lines of duplicating all files but requiring confirmation of deletions and overwrites.
"In the coming months, Amazon Simple Storage Service (Amazon S3) plans to introduce an option that will allow you to seamlessly move data between Amazon S3 and Amazon Glacier using data lifecycle policies."
Deleting data from Amazon Glacier is free if the archive being deleted has been stored for three months or longer. If an archive is deleted within three months of being uploaded, you will be charged an early deletion fee. In the US East (Northern Virginia) Region, you would be charged a prorated early deletion fee of $0.03 per gigabyte deleted within three months.
What kind of system has Amazon most likely built that takes 3-4 hours to perform retrieval? What are some examples of similar systems, and where are they installed?
There'll be a near-line HDD array. This is for the recent content and content they profile as being common-access.
Then there'll be a robotic tape library. Any restore request will go in a queue annd when an arm-tapedrive becomes free they'll seek to the data and read it into the HDD array.
Waiting for a slot with the robot arm - tape drive is what will take 4 hours.
EMC(kinda), Fujitsu etc make these.
http://en.wikipedia.org/wiki/Tape_library
http://www.theregister.co.uk/2012/06/26/emc_tape_sucks_no_mo...
First, no tape. The areal storage density of tape is lower than hard disks. Too many moving parts involved. Too hard to perform integrity checks on in a scalable, automated fashion without impacting incoming work.
Second, in order to claim the durability that they do (99.999999999%), that means every spot along the pipe needs to meet those requirements. That means the "near-line HDD array" for warm, incoming data needs to meet those requirements. Additionally, if the customer has specified that the data be encrypted, it needs to be encrypted during this staging period as well. It also needs to be able to scale to tens if not hundreds of thousands of concurrent requests per second (though, for something like Glacier, this might be overkill).
They've already built something that does all that. It's called S3. The upload operations likely proxy to S3 internally (with a bit of magic), and use that as staging space.
After that, the bottleneck is likely I/O to Glacier's underlying storage - but again, not tapes. See this post for deets: http://news.ycombinator.com/item?id=4416065
Or in case they are using regular hard drives, you might want to have this kind of time limits in order to pool requests going to a specific set of drives. This would enable them to power down the drives for longer periods of time.
The 3-4 hour estimate may also be artificial. Even if you can in most cases retrieve the data faster, it would be good to give an estimate you can always meet. They might also want to differentiate this more clearly from standard S3.
And we should not forget that it does take time to transfer for example one terabyte of data over network.
What would be absolutely fascinating is a pay-before-you-go storage service — data cryonics.
Paying $12 to store a gigabyte of data for 100 years seems like a pretty intriguing deal as we emerge from an era of bit rot.
I'm not sure what kind of organisation I'd actually trust to store data for that length of time - a commercial organisation is probably going to be more effective at providing service but what commercial organisation would you trust to provide consistent service for 100 years? A Swiss bank perhaps? Governments of stable countries are obviously capable of this (clearly they store data for much longer times) but aren't set up to provide customer service.
The Stora Kopparberg mining company has existed since it was granted a charter from King Magnus IV in 1347.
A few banks tend to last for a long time [1]. Banca Monte dei Paschi di Siena has existed for about 540 years.
Beretta, the italian firearms company, has existed for 486 years (and has been family owned the entire time).
East and West Jersey were owned by a land proprietorship for around 340 years starting from King Charles II bestowing the land to his brother James in 1664. [2]
At first I thought multinational corporations would be more stable because they could move from land to land to avoid wars and such. But apparently they haven't lasted nearly as long as their single-nation counterparts.
The Knights Templar were granted a multi-national tax exemption by Pope Innocent II in 1139, and lasted almost 200 years until most of their leadership was killed off in 1307.
The Dutch East India Trading Company was one of the first [modern] multinational corporations, spanning almost 200 years from 1602-1798.
However, the longest-lasting companies have been family owned and operated. [3] [4]
It appears most all companies that have lasted a long time are due to two factors: dealing in basic goods and services that all humans need, and looking ahead to the future to change with the times.
[1] https://en.wikipedia.org/wiki/List_of_oldest_banks [2] https://docs.google.com/viewer?a=v&q=cache:nYQU4NpfD74J:... [3] http://www.businessweek.com/stories/2008-05-14/centuries-old... [4] http://www.bizaims.com/content/the-100-oldest-companies-worl...
I don't consider that obvious. I live in Berlin, the capital of what most would consider a stable country, but my apartment (which is even older) has been a part of 5 different countries in the last 100 years (German Empire, Weinmar Republic, Nazi Germany, East Germany and finally, the Federal Republic of Germany).
Having a long history obviously isn't a predictor of future stability. According to the Long Now Foundation site [2], a Japanese company that existed since 578CE went bust in 2007.
[1] http://www.bizaims.com/content/the-100-oldest-companies-worl... [2] http://blog.longnow.org/02008/06/13/the-100-oldest-companies...
If you can figure out a way to convince a few dozen people every decade that the best way to glorify God is to isolate themselves off somewhere maintaining your archival data, you'll be set for centuries.
Because as we answer issues of cost and availability, a logical thing to wonder is "how long can I really depend on it though?" As quickly as cloud services (where "lifetimes" are measured at six years) have entered our economy, that's a question begging to be answered.
Amazon at least seems to be an "eventually durable" datastore, though. Meaning that if you are told in the future that it will go offline, you have an excellent chance to make other suitable arrangements. Say there's a 0.01% chance of this product being discontinued next year, up to a 10% chance 5 years from now. I have to think there will almost certainly be other services you can move your data to, on similar terms, for a long time.
That's assuming you're around at all, and nothing reeeeeally bad happens even so. Making data last after your death (or even after you stop paying!) is a lot harder in this environment, and achieving true 100-year durability is a tough nut indeed.
I like your bank idea, since the preservation of a bank account is just a specialized simple case of data preservation. Data preservation seems rather more reliable when the data is directly attached to money. Then again, maybe banks themselves are on their way out for this purpose — Dropbox could become the new safety-deposit box.
But there are 100-year domain registrations, after all. Maybe we're ready for organizations to at least offer 100-year storage, too.
You could have a foundation established to choose and pay an array of commercial organizations that do the archiving.
As long as that data is decode-able and more importantly, find-able (out of all the GBs frozen for 100 years, why would you want to look at any particular one of them?).
I'd store my pictures there. Finding old pictures of grandparents when they were little, or even older stuff, is amazing. Wouldn't it be cool if my descendants could still look at pictures of my family in 100 years?
Provided that downloading from this 100 year store is something I could do X times per year, and so long as I could append more data to it over time, it's an interesting business model.
"Long Data, LLC... We secure your data for the long-term".
Amazon simultaneously stands for ecommerce and web infrastructure depending on the context. e.g "Hey I want to host my server".. "Why don't you try Amazon". "Do you know where I can get a fair priced laptop?" "Check Amazon".
Is there any other brand that has done this successfully?
Edit: I should have specified internet brand.
Mitsubishi and Samsung springs to mind as two of the best known ones internationally where their brands are known in multiple markets internationally, though many of their businesses are less known outside Asia (e.g. Mitsubishi's bank is Japans largest). Any number of other Asian conglomerates.
ITT used to fall in that category back in the day: Fridges, PC's, hotels, insurance, schools,telecoms and lots more. I remember we at one point had both an ITT PC and fridge. The name was well known in many of its markets.
The large, sprawling, unfocused conglomerate have fallen a bit out of favor in Europe and the US. ITT was often criticized for their lack of focus even back in the 80's, and have since broken itself into more and more pieces and renamed and/or sold off many of them (e.g. the hotel group is now owned by Starwood).
The Japanese Keiretsu[2], from what I have read, is similar. The companies are somewhat loosely connected, but are connected nevertheless.
Virgin has been already mentioned, but is a really interesting example. It really is just a brand - there is no single controlling company.
Glacier is an archive product. It's for data you don't really see yourself ever needing to access in the general course of business ever again.
If you're a company and you have lots of invoice/purchase transactional information that's 2+ years old that you never use for anything, but you still have to keep it for 5 - 10 years for compliance reasons, Glacier is the perfect product for you.
Even its pricing is designed to take into account that the average use case is to only access small portions of the total archive store at a reasonable price (5% prorated for free in the pricing page).
I'm often creating pretty big media assets, so Dropbox doesn't necessarily offer enough space or is - for me - too expensive in the 500gb version (i.e. $50 a month).
Glacier would be $10 a month for 1 terabyte. Fantastic.
+ the $120 or so per TB to transfer it outside of AWS if you need the whole thing back as fast as possible. Still likely to be very cheap as long as you treat it as a disaster recovery backup, though. Will definitively consider it.
(an alternative for you is a service like Crashplan, which also allows you very easy access to past file revisions via a java app and can be very cheap and also allow "peer to peer" backups with your friends/family; the downside with Crashplan is that it can be slow to complete a full initial backup to their servers or to get fully backed up again if you move large chunks of data around)
The salt is used to prevent amazon from just keeping the hashes around to report that all is well.
To avoid abuse, restrict the number of free verification requests per month.
Also - upload/recovery times are problematic when you are talking 10s of terabytes. Right now, the equation is in favor of archiving tapes at that level (Even presuming you store multiple copies for redundancy/safety).
Glacier is for the people wanting to archive in the sub-ten terabyte range - they can avoid the hassle/cost of purchasing tape drives, tapes, software - and just have their online archive.
The needle will move - in 10 years Glacier might make sense for people wanting to store sub 100 Terabytes, and tapes will be for the multi-petabyte people.
[1] http://en.wikipedia.org/wiki/Linear_Tape-Open
[2] http://www.ironmountain.com/Solutions/~/media/9F17511FA1A741...
Amazon Glacier is designed for use cases where data is retained for months, years, or decades. Deleting data from Amazon Glacier is free if the archive being deleted has been stored for three months or longer. If an archive is deleted within three months of being uploaded, you will be charged an early deletion fee. In the US East (Northern Virginia) Region, you would be charged a prorated early deletion fee of $0.03 per gigabyte deleted within three months
There are any number of reasons why deletes would be discouraged. One is packing: if your objects are "tarred" together in a compiled object, discouraging early deletes makes it more cost-effective to optimistically pack early.
This is why the cost of retrieval is so high: every time they need to pull data the drives need to be spun back up (including drives holding data for people other than you), accessed, pulled from, then spun back down and put to sleep. Doing this frequently will put more wear and tear on the components and cost Amazon money in power utilization.
As is Glacier should be extremely cheap for AWS to operate, regardless of the total amount of data stored with it. Beyond the initial cost of purchasing hard drives, installing, and configuring them the usual ongoing maintenance and power requirements go away.
The retrieval fee for 3TB could be as high as $22,082 based on my reading of their FAQ [1].
It's not clear to me how they calculate the hourly retrieval rate. Is it based on how fast you download the data once it's available, how much data you request divided by how long it takes them to retrieve it (3.5-4.5 hours), or the size of the archives you request for retrieval in a given hour?
This last case seems most plausible to me [6] -- that the retrieval rate is based solely on the rate of your requests.
In that case, the math would work as follows:
After uploading 3TB (3 * 2^40 bytes) as a single archive, your retrieval allowance would be 153.6 GB/mo (3TB * 5%), or 5.12 GB/day (3TB * 5% / 30). Assuming this one retrieval was the only retrieval of the day, and as it's a single archive you can't break it into smaller pieces, your billable peak hourly retrieval would be 3072 GB - 5.12 GB = 3066.88 GB.
Thus your retrieval fee would be 3066.88 * 720 * .01 = $22081.535 (719x your monthly storage fee).
That would be a wake-up call for someone just doing some testing.
--
[1] http://aws.amazon.com/glacier/faqs/#How_will_I_be_charged_wh...
[2] After paying that fee, you might be reminded of S4: http://www.supersimplestorageservice.com/
[3] How do you think this interacts with AWS Export? It seems that AWS Export would maximize your financial pain by making retrieval requests at an extraordinarily fast rate.
[(edit) 4] Once you make a retrieval request the data is only available for 24 hours. So even in the best case, that they charge you based on how long it takes you to download it (and you're careful to throttle accurately), the charge would be $920 ($0.2995/GB) -- that's the lower bound here. Which is better, of course, but I wouldn't rely on it until they clarify how they calculate. My calculations above represent an upper bound ("as high as"). Also note that they charge separately for bandwidth out of AWS ($368.52 in this case).
[(edit) 5] Answering an objection below, I looked at the docs and it doesn't appear that you can make a ranged retrieval request. It appears you have to grab an entire archive at once. You can make a ranged GET request, but that only helps if they charge based on the download rate and not based on the request rate.
[(edit) 6] I think charging this way is more plausible because they incur their cost during the retrieval regardless of whether or how fast you download the result during the 24 hour period it's available to you (retrieval is the dominant expense, not internal network bandwidth). As for the other alternative, charging based on how long it takes them to retrieve it would seem odd as you have no control over that.
If you're not an Iron Mountain customer, this product probably isn't for you. It wasn't built to back up your family photos and music collection.
Regarding other questions about transfer rates - using something like AWS Import/Export will have a limited impact. While the link between your device and the service will be much fatter, the reason Glacier is so cheap is because of the custom hardware. They've optimized for low-power, low-speed, which will lead to increased cost savings due to both energy savings and increased drive life. I'm not sure how much detail I can go into, but I will say that they've contracted a major hardware manufacturer to create custom low-RPM (and therefore low-power) hard drives that can programmatically be spun down. These custom HDs are put in custom racks with custom logic boards all designed to be very low-power. The upper limit of how much I/O they can perform is surprisingly low - only so many drives can be spun up to full speed on a given rack. I'm not sure how they stripe their data, so the perceived throughput may be higher based on parallel retrievals across racks, but if they're using the same erasure coding strategy that S3 uses, and writing those fragments sequentially, it doesn't matter - you'll still have to wait for the last usable fragment to be read.
I think this will be a definite game-changer for enterprise customers. Hopefully the rest of us will benefit indirectly - as large S3 customers move archival data to Glacier, S3 costs could go down.
My backup wouldn't it be cool if is, unlike the above reasonableness, a joke: imagining 108 USB hard drives chained to a poor PandaBoard ES, running a fistful at a time: https://plus.google.com/113218107235105855584/posts/BJUJUVBh...
The Marvell ARM chipsets at least have SATA built in, but I'm not sure if you can keep chaining out port expanders ad-infinitum the same way you can USB. ;)
Thanks so much for your words. I'm nearly certain the custom logic boards you mention are done with far more vision, panache, and big-scale bottom line foresight than these ideas, even some CPLD multiplexers hotswapping drives would be a sizable power win over SATA port expanders and USB hubs. Check out the port expanders on OpenCompute Vault 1.0, and their burly aluminium heat sinks: https://www.facebook.com/photo.php?fbid=10151285070574606...
But at its price points, with most US families living under pretty nasty data cap or overage regimes, it sounds superb, with of course the appropriate front ends.
There's no good (reliable), easy and cheap way to store digital movies, e.g. DVD recordable media is small by today's standards and it's much worse than CD-Rs for data retention (haven't been following Blu-ray recordable media, I must confess, I bought an LTO drive instead, but I'm of course unusual). And the last time I checked very few people made a point of buying the most reliable media of any of these formats.
In case of disk failure, fire, tornado (http://www.ancell-ent.com/1715_Rex_Ave_127B_Joplin/images/ ... and rsync.net helped save the day), for this use case you don't care about quick recovery so much as knowing your data is safe (hopefully AWS has been careful enough about common mode failures) and knowing you can eventually get it all back. Plus a clever front end will allow for some prioritizing.
Important rule learned from Clayton Christensen's study of disruptive innovations (where the hardest data comes from the history of disk drives...) is that you, or rather AWS here, can't predict how your stuff will be used. So if they're pricing it according to their costs as you imply they're doing the right thing. Me, I've got a few thousand Taiyo Yuden CD-Rs who's data is probably going to find a second home on Glacier.
ADDED: Normal CDs can rot, getting them replaced after a disaster is a colossal pain even if your insurance company is the best in the US (USAA ... and I'm speaking from experience, with a 400+ line item claim that could have been 10 times as bad since most of my media losses were to limited water problems), so this is also a good solution to backing up them. Will have to think about DVDs....
Right now we sell 10TB blocks for $9500/year[1].
This works out to 7.9 cents/GB, per month, so 7.9x the glacier pricing. However, our pricing model is much simpler, as there is no charge at all for bandwidth/transfer/usage/"gets", so the 7.9 cents is it.
7.9x is a big multiplier. OTOH, users of these 10TB blocks get two free instances of physical data delivery (mailing disks, etc.) per year, as well as honest to god 24/7 hotline support. And there's no integration - it just works right out of the box on any unix-based system.
We had this same kind of morning a few years ago when S3 was first announced, and always kind of worried about the "gdrive" rumors that circulated on and off for 4 years there...
2013 will be interesting :)
The wonky pricing on retrieval makes this inordinately complex to price out for the average consumer who will be doing restores of large amounts of data.
The lack of easy consumer flexibility for restores also is problematic for the use case of "Help, I've lost my 150 GB Aperture Library / 1 TB Hard Drive"
The 4 Hour retrieval time makes it a non starter for those of us who frequently recover files (sometime from a different machine) off the website.
The cost is too much for >50 Terabyte Archives - Those users will be likely be doing multi-site Iron Mountain backups on LTO-5 Tapes. After 100 Terabytes, the cost of the drives is quickly amortized and ROI on the tapes is measured in about a month.
The new business model that Amazon may have created overnight though, and beats everyone on price convenience, is "Off-Site Archiving of low-volume low value Documents" - Think Family Pictures. Your average shutterbug probably has on the order of 50 GBytes of photos (give or take) - is it worth $6/year for them to keep a safe offline archive of them? Every single one of those people should be signing up for the first software package that gives them a nice consumer-friendly GUI to backup their picasa/iPhoto/Aperture/Lightroom photo library.
Let's all learn a lesson from [Edit Mat, one t] Honan.
.79 * 720 * .01
Giving me a little less than $6.
Now, do you think Amazon is likely to think they can get away with selling a service that charges you $22k for a 3TB retrieval?
Second, you have ranged GETs and tape headers; use them to avoid transferring all of your data out of the system at once. [Edit: looks like ranged GETs are on job data, not on archival retrieval itself. My bad.]
The most obvious way to me would be to assume it is based on the actual amount of data transferred in an hour less the free allowance they give you. Which is actually what they say:
"we determine the hour during those days in which you retrieved the most amount of data for the month."
This also ties in with what the cost is to them, the amount of bandwidth you're using.
In your example you would need to be getting transfer rates of 3TB/hr. Given the nature of the service I don't think they are offering that amount of bandwidth to begin with. (I'm sure they get good transfer rates to other amazon cloud services but customers could be downloading that data to a home PC at which point they will not be getting anything even close to those transfer rates)
At that point a bigger issue might be how long it takes to get the data out rather than the cost.
At an overly generous download speed (residential cable) of 10GB/hr your 3TB archive would take over 12 days to download.
Probably based on the speed and the number of arms that the robot has that will grab the right tapes for you :-)
I'm not joking.
If it takes them 4 hours to retrieve your 3TB, then your peak hourly retrieval rate would be 768GB / hour (3072 GB / 4 hours). Your billable hourly retrieval rate would be 768GB - 1.28GB (3072 * .05 / 30 / 4 hours).
Total retrieval fee: 766.72 * 720 * .01 = $5520.38 (~180x your monthly storage fee)
The pricing appears to not be optimized for retrieving all your data in one fell swoop. This particular example appears to be a worst case scenario for restoration because you haven't split up your data into multiple archives (doing so would allow you to reduce your peak hourly retrieval by spacing out your requests for each archive) and you want to restore all your data (the free 5% of your data stored doesn't help as much when you want to restore all your data).
[1] https://forums.aws.amazon.com/message.jspa?messageID=374065#...
I also must say that the way you calculate the retrieval fee is really looking like black magic at first sight. I hope they will add a simple calculator to evaluate some scenario and provide the expected bandwidth available from Glacier to an EC2 instance.
'Update: An Amazon spokesperson says “For a single request the billable peak rate is the size of the archive, divided by four hours, minus the pro-rated 5% free tier.”'
This seems to imply the cost is closer to 4k instead of 22k.However, the spokesperson's statement seems to describe intended system performance , not prescribe the resulting price. So if it actually does take them an hour to retrieve your data, you might still owe them 22k
0.01S+1.80R.max(0, 1-0.0017S/D)
S is number of GB stored.
R is the biggest retrieval of the month. Parallel retrievals are summed, even if overlap is only partial.
D is the amount of data retrieved on the peak day (≥R)
e.g., for 10TB storage, max. 50GB per retrieval, and max. 200GB retrieval per day: $2188.20 / year
http://www.wolframalpha.com/input/?i=0.01*10000%2B1.80*50*ma...
Let's run 100GB, X. Allowance limit: 100GB . 5% is 5GB/mo, or per day, 100GB/(2030), 0.166GB/day; X/600.
Hourly rate necessary for a sustained 24 hour cycle of 100GB is: 100GB/24hr, or 4.166GB/hr, X/24. Peak hourly, this.
To determine the amount of data you get for free, we look at the amount of data retrieved during your peak day and calculate the percentage of data that was retrieved during your peak hour. We then multiply that percentage by your free daily allowance.*
To begin all that's stated here is, break your data-retrieval out over a day. Their example:
you retrieved 24 gigabytes during the day and 1 gigabyte at the peak hour, which is 1/24 or ~4% of your data during your peak hour.
We're doing 4.166GB in the peak hour/100GB in the peak hour, or ~4%.
X/24 / X = 1/24 = ~4.1666666% if you don't fuck your meteringly up.
We multiply 4% by your daily free allowance, which is 20.5 gigabytes each day. This equals 0.82 gigabytes [ed: free allowance hourly]. We then subtract your free allowance from your peak usage to determine your billable peak.
Free allowance hourly rate: 4.16666% . 0.166 = 0.006666, or (X/600/24), X/15000. A is at (12 . 1024)/15000, or indeed 0.8192 free, to verify.
billable peak hourly is then: hourly peak rate - free rate, 4.1666 - 0.00666 = 4.160, or (X/24) - (X/600/24) or (X-(X/600))/24 or (599X/600)/24 or simply, billable peak hourly will always be for sufficiently non-incompetent implementations: ~0.0415972222X. Always.
Let's check: 100GB . 0.041597 = 4.15970. Cannot compare to amazon, because their hourly rate is calculating a 24GB of 12TB archive download, but, 1-0.8192 still checks out. It would be 511.14666666 if their entire set, or (12 . 1024)/24 - 0.8192, 511.1808GB/hr peak hourly (nice pipes kids).
Retrieval fee is then, 0.041597X . 720 . tier pricing, and tier pricing I really do not understand the origin of at all but all examples seem to be 0.01. So, $29.95/100GB. For 12TB, say hello to $3680.25599 transfer fee. 3TB is $920.064.
720 . (599X/600) /24 /100, so for the transfer of your entire set X GB of data, evenly done across the day, you will be charged: (599X/600).(3/10)$,
0.2995$/GB to pull data out in a day.
http://news.ycombinator.com/item?id=3560952
Rotating hard drives on the NAS in my attic is going to get a LOT simpler...
that would certainly be very nice. cperciva, what do you think?
Also, with Tarsnap's average block size (~ 64 kB uncompressed, typically ~ 32 kB compressed) the 50 microdollar cost per Glacier RETRIEVAL request means that I'd need to bump the pricing for tarsnap downloads up to about $1.75 / GB just to cover the AWS costs.
I may find a use for Glacier at some point, but it's not something Tarsnap is going to be using in the near future.
it's unfortunate, because some backups happen to just lie around for very long. it would be nice to take advantage of (the low cost of) glacier for that.
that said, if it's not possible with tarsnap now, it's not possible now. :D. if you find a satisfying possibility to incorporate it in the new backend(s) design (if that's fixable in the backend(s) alone), i'd surely be pleased.
Any other theories on how this works on the backend while still being profitable?
Drawback is "jobs typically complete in 3.5 to 4.5 hours."
Seeing as how people tend to be pack rats, I can see this being huge.
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years....
They make so many backups so quickly that there is only a 0.00000000001% (I didn't count the zeros) chance of this occuring.
Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives.
[1]https://aws.amazon.com/glacier/faqs/#How_am_I_charged_for_de...
Crashplan+ Unlimited is USD 2.92/month if you take the 4 year package. When I upload 300GB to Amazon and pay 0.01 * 300 = USD 3/month. Amazon would be even more expensive for larger amounts of data.
Is there some fine print I'm missing with Crashplan unlimited?
What's exciting about this, is that Amazon doesn't care _how_ much data you send them - presumably they've priced this so it's profitable at any level you wish to use. It's a sustainable model. Services like TarSnap/Arq will likely adopt this new service (Possibly offering tiered backup/archival services?).
I have (close to) zero doubt that Amazon's Glacier Archival Storage will be available 5 years from now at (probably less than) $0.01/Gigabyte/Month. They are a (reasonably) safe archival choice. Now that light users (<300 Gigabytes) have a financial incentive to move off of CrashPlan onto Amazon - it further exacerbates the challenges that "Unlimited" backup providers will face. All their least costly/most profitable may leave (or, at the very least, the new ones may chose Amazon first)
With that said - I love Backblaze (Been a user since 2008) for working data backups, rapid-online (free) restores - and I will continue to use them, but I wouldn't plan on archiving a Terabyte of Data to them for the next 20 years.
http://blog.kozubik.com/john_kozubik/2009/11/flat-rate-stora...
They put the provider in an adversarial relationship with the user and give them an incentive to keep you from storing data there. They will make it hard for you to use their service.
Could some good soul tell me how much would cost to:
Store 150 GB as one big file for 5 years. To this I will add 10 GB (also as one file) every year. And lets say I will need to retrieve the whole archive (original file + additions) at the end of year 2 and 5.
How much will it cost?
There'd be no better way to ensure that information would eventually be made public.
And while not super-duper expensive, s3 provides much more than I really need, and hence a more limited (but cheaper) service would definitely be appreciated.
If there is anything with the easy of use like s3cmd to accompany this service I will be switching in a heartbeat.
Also, prior comments made mention they were using some sort of robotic tape devices, but according to this blog:
http://www.zdnet.com/amazon-launches-glacier-cloud-storage-h...
Its using "commodity hardware components". So, thats why I thought maybe they are loss-leadering on the storage and making up on the retrieval prices?
Its definately a interesting product and I love how there's a reason they called it Glacier. AMZN is a wild boar going after everyone!
Sure hope transferring 10 or 20 gigabytes of data from S3 to Glacier is easy.
A lot of the consumer-level services refuse any liability for any data loss. Does Amazon do the same for this?
Realistically, you'd want to have at least two diverse cloud backup systems - I doubt you'd be happy with Service Credits if your data went missing.
If not, to any developers reading this: there's money in them thar Glacier.
0.01S+max(0,7.20*(R-0.0017S)/4)
S is number of GB stored
R is biggest retrieval in the month
4 is the average number of hours a retrieval
For an example with 10TB storage (replace 10000 to change): http://fooplot.com/plot/4pu7u2gpox
x is biggest retrieval in GB, y is $/month
I might create a script that uploads everything to Glacier and just keeps a couple of the latest backups on S3 though.
1. http://www.allthingsdistributed.com/2012/08/amazon-glacier.h...
Or is the only way to encrypt it yourself, and then transfer it?
That's pretty impressive. I wonder how many bytes they lost to lose that .0000000001% of data.
I assume they will support Glacier soonish.