Building and operating a pretty big storage system called S3 (opens in new tab)

(allthingsdistributed.com)

804 pointswerner2y ago160 comments

160 comments

> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.

One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.

Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).

mjb2y ago

> daily occurrence when you're operating at S3 scale

Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...

In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.

ignoramous2y ago

James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044

jdwithit2y ago

James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.

aborsy2y ago

This phenomenon is just multiplication of the sample size (scale) times a probability (rare).

maweki2y ago

It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.

da39a3ee2y ago

I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

2 more replies

PaulRobinson2y ago

Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.

rubiquity2y ago

Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!

Twirrim2y ago

This was 7-8 years ago now. Lot of scaling up since those days :)

rubiquity2y ago

I’m sure my numbers are out of date now too

rkagerer2y ago

Personally I'd love working in that kind of environment. That one in a billion hole still itches at me. There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).

fooker2y ago

That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.

rkagerer2y ago

The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.

delecti2y ago

I love conversations like this that remind me how unintuitive big numbers are.

ldjkfkdsjnv2y ago

Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.

wrboyce2y ago

Any examples you can share?

ruckfool2y ago

Redis Node failover

ldjkfkdsjnv2y ago

Apache tomcat starts to break down

1 more reply

baz002y ago

We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.

ilyt2y ago

I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough

benou2y ago

Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.

Twirrim2y ago

Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.

jamesblonde2y ago

HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.

1 more reply

jacobgorm2y ago

To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…

[1] https://dl.acm.org/doi/10.1145/1807128.1807134

Waterluvian2y ago

Ever see a UUID collision?

cmckn2y ago

Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.

mabbo2y ago

Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.

Dylan168072y ago

Are you sure about that math?

A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.

2 more replies

jandrewrogers2y ago

There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.

It depends on how paranoid you need to be.

2 more replies

lazide2y ago

Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.

epistasis2y ago

Working in genomics, I've dealt with lots of petabyte data stores over the past decade. Having used AWS S3, GCP GCS, and a raft of storage systems for collocated hardware (Ceph, Gluster, and an HP system whose name I have blocked from my memory), I have no small amount of appreciation for the effort that goes into operating these sorts of systems.

And the benefits of sharing disk IOPs with untold numbers of other customers is hard to understate. I hadn't heard the term "heat" as it's used in the article but it's incredibly hard to mitigate on single system. For our co-located hardware clusters, we would have to customize the batch systems to treat IO as an allocatable resource the same as RAM or CPU in order to manage it correctly across large jobs. S3 and GCP are super expensive, but the performance can be worth it.

This sort of article is some of the best of HN, IMHO.

CobrastanJorji2y ago

It also explains some of the cost model for cloud storage. The best possible customer, from a cloud storage perspective, stores a whole lot of data but reads almost none of it. That's kind of like renting hard drives, except if you only fill some of each hard drive with the "cold" data, you can still use the hard drive's full I/O capacity to handle the hot work. So, if you very carefully balance what sort of data is on which drive, you can keep all of the drives in use despite most of your data not being used. That's part of why storage is comparatively cheap but reads are comparatively expensive.

donavanm2y ago

You get similar properties/challenges in lots of multi consumer storage scenarios. I learned lots of similar lessons working on CDNs when it comes to object distribution and access rates.

If youre interested go search for some of the published work from "Coho Data", they had some great usenix presentations IIRC. This was the previous company Andy Warfield was at and they had an emphasis on effective tracking & prediction of IO workloads across very large datasets.

dekhn2y ago

Unfortunately many tools in genomics (and biotech in general) still depend on local filesystems- and even if they do support S3, performance is far slower than it could be.

epistasis2y ago

Most of these tools treat the "local file" as a stream which can be a pipe to a network stream from the object store.

The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is.

At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO.

dekhn2y ago

I didn't make my statement out of ignorance.

Most of these applications depend on OS optimizations that have been made over the decades; multithreaded readers, readahead, and caching are critically important to read performance. In principle, a remote storage system could be as fast as a local disk. This includes random access. after all, the storage system is just a bunch of drives attached to machines connected by networks.

When I worked at Google I wrote a mapreduce that converted BAM files to sstables which are sorted, sharded by key, and sit in an object store like S3. Once the files were in sstables (or columnio) we could do realtime analytics using modern tools.

1 more reply

seized2y ago

The latency is higher so the key is parallelism... Which means you need more cores/hardware/VMs/pick your poison. New but same problem...

dekhn2y ago

Is single-job performance the only criterion? Or can you just run a bunch of different jobs at the same time (genomics has many embarassingly parallel problems, often per-sample) and use the higher aggregate storage bandwidth of your object store to get "more work done in unit time".

1 more reply

kuchenbecker2y ago

As someone in this area: we very much want to make your EiB of data to feel local. It's hard and I'm sorry we only have 3.5 9's of read availability.

epistasis2y ago

People working on storage systems are doing amazing things. When I first heard about Ceph more than a decade ago, I immediately emailed one of the founders asking for an exabyte data store, because I knew just how amazingly difficult it would be and that it was very much needed.

3.5 9s is incredible on large stores. S3 and GCS are just amazing machines. I have nothing but admiration for the people that make this happen.

parentheses2y ago

Some of the best HN indeed. Would love to see any links to HN posts that you think are similarly good!

apitman2y ago

The things we could build if S3 specified a simple OAuth2-based protocol for delegating read/write access. The world needs an HTTP-based protocol for apps to access data on the user's behalf. Google Drive is the closest to this but it only has a single provider and other issues[0]. I'm sad remoteStorage never caught on. I really hope Solid does well but it feels too complex to me. My own take on the problem is https://gemdrive.io/, but it's mostly on hold while I'm focused on other parts of the self-hosting stack.

[0]: https://gdrivemusic.com/help

simonw2y ago

Absolutely this. I would LOVE to be able to build apps that store people's data in their own S3 bucket, billed to their own account.

Doing that right now is monumentally difficult. I built an entire CLI app just for solving the "issue AWS credentials that can only access this specific bucket" problem, but I really don't want to have to talk my users through installing and running something like that: https://s3-credentials.readthedocs.io/en/stable/

jamesblonde2y ago

Most apps, however, assume POSIX-like data access. I would love to see a client-side minimally dependent library that mounts a local directory that is actually the user's S3 bucket.

afr0ck2y ago

Linux has FUSE, which is a framework to develop user-level filesystems. Mounting S3 buckets is a very good use case. Sshfs and httpfs are more or less similar in this regard.

anderspitman2y ago

Yep, and WinFSP and dokany are two options for FUSE on Windows. I'd recommend using rclone or maybe check this list: https://winfsp.dev/doc/Known-File-Systems/

Spivak2y ago

Such a system would be amazing. It would really force companies whose products are UIs on top of S3 to compete hard because adversarial interoperability would be an ever present threat from your competitors.

It really is such a shame that all the projects that tried/are trying to create data sovereignty for users became weird crypto.

ttymck2y ago

I agree with both halves of your comment, but I realized I can't identify the connection between S3 oauth and data sovereignty. Could you elaborate?

Spivak2y ago

So the idea would be that you have an account with AWS (or realistically a more consumer friendly service that's Amazon branded) where all your data lives. Then when you use say Dropbox you can pick "Use my own storage" and grant Dropbox via OAuth the ability to write to /dropbox in your bucket and all your files would live there instead of Dropbox's servers. Lots of the data sovereignty solutions also include a database like interface you can grant apps the ability to use but I can't imagine that catching on initially.

Apple actually already does this with iCloud storage but hides it really well so it feels seamless.

1 more reply

gusmd2y ago

You can get close with a Cognito Identity Pool that exchanges your user's keys for AWS credentials associated with an IAM role that has access to the resources you want to read/write on their behalf. Pretty standard pattern.

https://docs.aws.amazon.com/cognito/latest/developerguide/co...

edit: I think I misread your comment. I understood it as your app wanting to delegate access to a user's data to the client, but it seems like you want the user to delegate access to their own data to your app? Different use-cases.

ent1012y ago

We're building this at https://puter.com

anderspitman2y ago

You mean you're implementing something like this to be used by puter.com?

wizwit9992y ago

Apache Iceberg is kind of this, but more oriented around large data lake datasets.

deathanatos2y ago

> Now, let’s go back to that first hard drive, the IBM RAMAC from 1956. Here are some specs on that thing:

> Storage Capacity: 3.75 MB

> Cost: ~$9,200/terabyte

Those specs can't possibly be correct. If you multiply the cost by the storage, the cost of the drive works out to 3¢.

This site[1] states,

> It stored about 2,000 bits of data per square inch and had a purchase price of about $10,000 per megabyte

So perhaps the specs should read $9,200 / megabyte? (Which would put the drive's cost at $34,500, which seems more plausible.)

[1]: https://www.historyofinformation.com/detail.php?entryid=952

birdyrooster2y ago

Must've put a decimal point in the wrong place or something. I always do that. I always mess up some mundane detail.

S_A_P2y ago

Did you get the memo? Yeah I will go ahead and get you another copy of that memo.

acdha2y ago

https://en.m.wikipedia.org/wiki/IBM_305_RAMAC has the likely source of the error: 30M bits (using the 6 data bits but not parity), but it rented for $3k per month so you didn’t have a set cost the same as buying a physical drive outright - very close to S3’s model, though.

nijave2y ago

I think this is still IBMs license model (at least a few years ago). It was explained to me you basically license a certain amount of compute even though the hardware is in your data center and you pay overages if you exceed your licensed throughput.

Since you license a fixed amount, there were projects at the company looking at running batch/non time sensitive jobs on the mainframe since it was effectively free off peak (I guess power cost was trivially compared to licensing).

mkesper2y ago

You had online jobs during the day and batch at night then. That's why you always had to have one night between. Obviously doesn't work when load is 24/7.

andywarfield2y ago

oh shoot. good catch, thanks!

mannyv2y ago

What most people don't realize is that the magic isn't in handling the system itself; the magic is making authorization appear to be zero-cost.

In distributed systems authorization is incredibly difficult. At the scale of AWS it might as well be magic. AWS has a rich permissions model with changes to authorization bubbling through the infrastructure at sub-millisecond speed - while handling probably trillions of requests.

This and logging/accounting for billing are the two magic pieces of AWS that I'd love to see an article about.

Note that S3 does AA differently than other services, because the permissions are on the resource. I suspect that's for speed?

awithrow2y ago

Keep in mind that S3 predates IAM by several years. So part of the reason that access to buckets/keys is special is because it was already in place by the time IAM came around.

Its likely persisted since than largely since removing the old model would be a difficult taks without potentially breaking a lot of customer's setup

mannyv2y ago

Exactly. This difference makes it easier to (1)understand how IAM works, and (2) how the s3 works...because IAM and S3 work together, but in a different way than the other services.

I heard that AA is done via asics, but resource-level permissions implies that authorization is done at the local level for s3. To me that implies that the system extracts S3 permissions from IAM and sends them downstream s3, which get merged with stuff that s3 manages.

I guess that occurs when permissions are saved up in IAM world. At some point those need to be joined against a principal somewhere, as roles can exist without assignment.

Again, it's be so interesting to see how this is done IRL.

vdm2y ago

AWS re:Invent 2022 - A day in the life of a billion requests (SEC404) https://www.youtube.com/watch?v=tPr1AgGkvc4

Narciss2y ago

"As a really senior engineer in the company, of course I have strong opinions and I absolutely have a technical agenda. But If I interact with engineers by just trying to dispense ideas, it’s really hard for any of us to be successful. It’s a lot harder to get invested in an idea that you don’t own. So, when I work with teams, I’ve kind of taken the strategy that my best ideas are the ones that other people have instead of me. I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions. There are often multiple ways to solve a problem, and picking the right one is letting someone own the solution."

"I learned that to really be successful in my own role, I needed to focus on articulating the problems and not the solutions, and to find ways to support strong engineering teams in really owning those solutions."

I love this. Reminds me of the Ikea effect to an extent. Based on this, to get someone to be enthusiastic about what they do, you have to encourage ownership. And a great way is to have it be 'their idea'.

rtpg2y ago

I don't mean this to be cynical, but I do think that it's worth acknowledging that describing the problem is also, in itself, a tool to guide people towards a solution they want. After all, people often disagree about what "the problem" even is!

Fortunately not every problem is like this. But if you look at, say, discussions around Python's "packaging problem" (and find people in fact describing like 6 different problems in very different ways), you can see this play out pretty nastily.

skybrian2y ago

At a toy scale, using ChatGPT's Code Interpreter to do some programming for fun can be an exercise in getting what you want from an inconsistent worker by changing the problem definition (prompt engineering).

This is sort of like:

* writing an exam question so the person taking the exam is likely to get the answer you want

* guiding someone in a code interview that isn't going so well, without giving away the answer

* being in the back seat while pair programming, except you're not allowed to take a turn at the keyboard

pgwhalen2y ago

I don't think it's cynical, I think it's the point. Describing the problem is not easy, and to your point, is sometimes controversial.

One advantage of focusing on describing the problem is that it naturally lets you have an impact on what you believe to be the important parts of the solution.

rtpg2y ago

I just want to acknowledge that describing the problem is part of picking the solution, and it's not really _that much_ of a "I'm making the most neutral action and letting other people actually choose the solution".

Honestly the "real" hands off thing is letting somebody else also describe the problem and then probing it. But that might lead to a bit too much of an existential crisis for some people. And hey, if something works it works

1 more reply

forrestthewoods2y ago

That section really stood out to be as well.

If Andy Warfield is reading, and I bet he is, I have a question. When developing a problem how valuable is it to sketch possible solutions? If you articulate the problem that probably springs to mind a few possible solutions. Is it worth sharing those possible solutions to help kickstart the gears for potential owners? Or is it better to focus only on the problem and let the solution space be fully green?

Additionally, anyone have further reading for this type of “very senior IC” operation?

andywarfield2y ago

Here's a really quick story on how i accidentally worked out this strategy by getting it wrong first. When I started at Amazon and was trying to convince the team that we should do certain things, I did what I'd always been trained to do: I wrote down the problem and then sketched a solution to it. Then I'd start floating the doc around to try to get folks excited about it. And invariably, they'd do what they were trained to do, which was to have a critical response to the proposed solution. They'd argue that I was solving it the wrong way, and I'd be in a spot where we'd have a conversation where I was defending a position. But this was the last thing I wanted — I was trying to get everyone excited about fixing a problem, but I slowly realized that when I approached it this way, I was just getting feedback on my proposed solution.

So I started doing an experiment where I'd write that same doc, including the ideas i had on the shape of the work we should do, but then I'd delete my solution before sharing it. To your question: I'd still totally write my solution ideas down. Partially because I can't help myself and honestly it was a helpful way to think things through. But when I deleted it and shared a doc with just a problem statement, I'd get feedback on the problem statement. It's pretty obvious, but it was also a pretty surprising result: all of a sudden i was in conversations where we were all on the same side of the table. Feedback was either refining the problem (which was awesome) or proposing solutions. And when the person reading your problem statement starts trying to solve it, it's really cool... because they totally start getting invested and the conversations are great.

Like everything, none of this is actually either/or. There are points in between, like including a sketch of the shape of a solution, or properties that a solution would have to have. But the overall thing of separating the problem and the end state of where you want to get to, from the solution and the plan on how to get there is a pretty effective tool from a sharing ownership perspective.

forrestthewoods2y ago

That’s helpful. Thank you!

tsxxst2y ago

For the "very senior IC", I'd recommend https://staffeng.com

dylan6042y ago

There's a saying that I'm often told, and I'm sure we've all heard it at some point "don't bring me problems, bring me solutions". It's such a shit comment to make.

I interpret it as if they are saying "You plebe! I don't have time for your issues. I can't get promoted from your work if you only bring problems."

Being able to solve the problem is being able to understand the problem and admit it exists first. <smacksMyDamnHead>

yashap2y ago

Depends how it’s used. If it’s used in an org where major, high impact problems are ignored, as a way to just say “ignore all problems”, then yeah, it’s a shit comment.

However, if it’s used to legitimately say “don’t just complain, fix”, then I think it’s a positive. An organization where everyone is constantly negative and complaining about every little issue, but not working to implement improvements/fixes, is essentially a failed company. Successful companies are full of people who actively fix the high impact problems, while also being realists, who can accept that the low impact problems aren’t worth the effort to fix, and aren’t worth endlessly complaining about.

ChainOfFools2y ago

I strongly agree with this perspective but I wish it could be generalized into techniques that work in everyday life, where there isn't already this established ranking of expertise that focuses attention on what is being said and not whether you have the clout or the authority to say it.

Because absent preestablied perceived authority or expertise, which is the context that most day to day problems surface within, holding forth and hogging the entire two-way discussion channel with your long detailed and carefully articulated description of the problem is going to make you sound like someone who wants to do all the talking and none of the work, or the kind of person who doesn't want to share in finding a solution together with others.

niscocity352y ago

this only works if your team are made up of smart competent people.

jl62y ago

Great to see Amazon employees being allowed to talk openly about how S3 works behind the scenes. I would love to hear more about how Glacier works. As far as I know, they have never revealed what the underlying storage medium is, leading to a lot of wild speculation (tape? offline HDDs? custom HDDs?).

blindhippo2y ago

Amazon engineer here - can confirm that Glacier transcodes all data on to the backs of the shells of the turtles that hold up the universe. Infinite storage medium, if a bit slow.

jeffbarr2y ago

Shh....

buildbot2y ago

Blueray disks are thought to be the key: https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

Some people disagree though. It’s still an unknown.

Twirrim2y ago

Glacier is a big "keep your lips sealed" one. I'd love AWS to talk about everything there, and the entire journey it was on because it is truly fascinating.

anyoneamous2y ago

My impression is that the ambiguity gives them freedom to implement in different ways across different regions and over time.

The original Glacier was very clearly tape, but given the instant retrieval capabilities the newer S3-Glacier tiers are most likely just low-margin HDDs, maybe with some dynamic powering on and off of drives/servers.

danpalmer2y ago

I’m sure it’s a mix. Back when it launched there were a number of rumours about it being Blu-Ray based. They had similar capacity for the space used compared to tapes, were considered very physically stable storage mediums, but had long access time as they would need to be physically moved, like tape, explaining the retrieval times.

1 more reply

bombcar2y ago

Glacier is just run on S3 with some sleep statements added.

2 more replies

lytfyre2y ago

I recall at launch just about the only implementation detail that _was_ publicly given was that it did not involve tape. That's going to be difficult to dig up a cite on years later.

No idea how it's evolved over the years, so for all I know it's tape based these days.

inopinatus2y ago

Never officially stated, but frequent leaks from insiders confirm that Glacier is based on Very Large Arrays of Wax Phonograph Records (VLAWPR) technology.

Twirrim2y ago

We came up with that idea in Glacier during the run up to April one year (2014, I think?), half jokingly suggested it as an April Fool's Day Joke, but Amazon quite reasonably decided against doing such jokes.

One of the tag line ideas we had was "8 out of 10 customers say they prefer the feel of their data after it is restored"

jdwithit2y ago

This would have been incredible. But I guess I get the angle of not wanting to risk pissing off the audiophile CTO paying you 10 figures per month. Cause he can TOTALLY hear the difference listening to Dark Side of the Moon on vinyl via Monster Cables.

inopinatus2y ago

The real problem is the lack of Star Wars references.

jdwithit2y ago

It's honestly super impressive that it's never leaked. All it takes is one engineer getting drunk and spouting off. In much higher stakes, a soldier in Massachusetts is about to go to jail for a long time for leaking national security intel on Discord to look cool to his gamer buddies. I would have expected details on Glacier to come out by now.

fomine32y ago

I don't expect high salary engineers leak it, but random contractor at datacenter or supplier would eventually leak if they use special storage device other than HDD/SSD. Since we don't see any leaks, I suspect that it's based on HDD, with very long IO waitlist.

dosman332y ago

HSM is a neat technology, and lots of ways it has been implemented over the years. But it starts with a shim to insert some other technology into the middle a typical posix filesystem. It has to tolerate the time penalty for data recovery of your favored HSM'd medium, but that's kind of the point. You can do it with a lower tier disk, tape, wax cylinder, etc. There's no reason it wouldn't be tape though, tape capacity has kept up and HPSS continues to be developed. The traditional tape library vendors still pump out robotic tape libraries.

I remember installing 20+ fully configured IBM 3494 tape libraries for AT&T in the mid-2000's. These things were 20+ frames long with dual accessors (robots) in each. The robots were able to push a dead accessor out of the way into a "garage" and continue working in the event one of them died (and this actually worked). Someone will have to invent a cheaper medium of storage than tape before tape will ever die.

pkaye2y ago

Glacier was originally using actual glaciers as a storage media since they have been around forever. Bu then climate change happened so they quickly shifted to tiered storage of tape and hard drives.

CrQuYt6fWiUe22y ago

It's just low powered hard drives that aren't turned on all the time. Nothing special.

0cf8612b2e1e2y ago

Are there any public details on how Azure or GCP do archival storage?

ddorian432y ago

Just look at other clouds. I doubt amazon is doing anything special. At least they don't reflect any special pricing.

dsalzman2y ago

> Imagine a hard drive head as a 747 flying over a grassy field at 75 miles per hour. The air gap between the bottom of the plane and the top of the grass is two sheets of paper. Now, if we measure bits on the disk as blades of grass, the track width would be 4.6 blades of grass wide and the bit length would be one blade of grass. As the plane flew over the grass it would count blades of grass and only miss one blade for every 25 thousand times the plane circled the Earth.

Sai_2y ago

The standing joke is that Americans love strange units of measure but this is one is so outre that it deserves an award.

jakupovic2y ago

The part about distributing loads takes me back to S3 KeyMap days and me trying to migrate to it, from initial implementation. What I learned is that even after you identify the hottest objects/partitions/buckets you cannot simply move them and be done. Everything had to be sorted. The actual solution was to sort and then divide the host's partition load into quartiles and move the second quartile partitions onto the least loaded hosts. If one tried to move the hottest buckets, 1st quartile, it would put even more load on the remaining members which would fail, over and over again.

Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so.

Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :).

mcapodici2y ago

S3 is more than storage. It is a standard. I like how you can get S3 compatible (usually with some small caveats) storage from a few places. I am not sure how open the standards is, and if you have to pay Amazon to say you are "S3 compatible" but it is pretty cool.

Examples:

iDrive has E2, Digital Ocean has Object Storage, Cloudflare has R2, Vultr has Object Storage, Backblaze has B2

CobrastanJorji2y ago

Google's GCS as well, and I haven't used Microsoft, but it'd be weird if they didn't also have an "S3 compatible" option.

Edit: I looked it up and apparently no, Azure does not have one :-/

romantomjak2y ago

Apologies if this comes off as blunt, but this is the type of content I come to read at hacker news rather than it being just a series of obituaries.

The author has made a lot of great points, but one that stuck with me was:

> I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.

I haven’t thought of it in this way, but this is an excellent way of motivating someone to “own” a problem.

baq2y ago

> What’s interesting here, when you look at the highest-level block diagram of S3’s technical design, is the fact that AWS tends to ship its org chart. This is a phrase that’s often used in a pretty disparaging way, but in this case it’s absolutely fascinating.

I’d go even further: at this scale, it is essential and required to develop these kind of projects with any sort of velocity.

Large organizations ship their communication structure by design. The alternative is engineering anarchy.

CobrastanJorji2y ago

I'll take the metaphor one step further. The architecture will, over time, inevitably change to resemble its org chart, therefore it is the job of a sufficiently senior technical lead to organize the teams in such a way that the correct architecture emerges.

capableweb2y ago

Also known as "Conway's law"

> Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure

https://en.wikipedia.org/wiki/Conway%27s_law

CobrastanJorji2y ago

Right. Conway's Law describes the property that the architecture will grow to resemble the org chart. I'm suggesting that you can productively apply that principal to produce good software by shaping the org chart.

If Conway's Law is phrenology, the "science" of determining someone's personality by measuring their skull's dimensions, I'm suggesting Terry Prachett's retrophrenology, the process of hitting someone with a hammer very precisely to make them a better person.

hobo_in_library2y ago

This is also why reorgs tend to be pretty common at large tech orgs.

They know they'll almost inevitably ship their org chart. And they'll encounter tons of process-based friction if they don't.

The solution: Change your org chart to match what you want to ship

mr_toad2y ago

A more cynical take is that it makes it look like the new management is doing something.

An even more cynical take is that it makes it difficult to compare performance with past performance.

Severian2y ago

Straight from The Mythical Man Month: Organizations which design systems are constrained to produce systems which are copies of the communication structures of these organizations.

_aaed2y ago

something something Conway's law

whoknowswhat112y ago

Over 100 million requests per second authenticated, billed, versioned, logged, checksummed, encrypted against 200+ trillion objects.

_han2y ago

The talk that this article is based on is available on YouTube: https://www.youtube.com/watch?v=sc3J4McebHE

kaycebasques2y ago

> we’d read and generally have pretty lively discussions about a collection of “classic” systems research papers

Does anyone have the list of papers?

> we managed to kind of “industrialize” verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don’t have PhDs in formal verification can contribute to maintaining the specification, and that we could continue to apply our tools with every single commit to the software

Is any of this open source?

g9yuayon2y ago

S3 is a truly amazing piece of technology. It offers peace of mind (well, almost), zero operations, and practically unlimited bandwidth for at least analytics workload. Indeed, it's so good that there has not been much progress in building an open-source alternative to S3. There seems not much activity in the Hadoop community. I have yet heard any company who uses RADOS on Ceph to handle PBs of data for analytics workload. MinIO made its name recently, but its license is restrictive and its community is quite small compared to that of Hadoop of its hay days.

Sparkyte2y ago

There was a time when S3 was getting resilient. Today it is excellent. Pepridge Farms remembers.

ddorian432y ago

> There seems not much activity in the Hadoop community

There is apache ozone https://ozone.apache.org/

g9yuayon2y ago

Yeah, Ozone looks interesting. I was just not sure who used it at scale other than a Japanese startup. The community engagement seems much lower than other communities, though.

gooseyman2y ago

This is a fantastic point on ownership that those “placing” it on others can often miss.

“Ownership carries a lot of responsibility, but it also carries a lot of trust – because to let an individual or a team own a service, you have to give them the leeway to make their own decisions about how they are going to deliver it.”

j_not_j2y ago

> It’s all one thing, and you can’t really think about it just as software. It’s > software, hardware, and people, and it’s always growing and constantly evolving.

This is a lesson a lot of software people haven't yet learned. Bad UI, bad operational experiences, insufficient logging to resolve issues, un-fixable code because it's too complicated, and so on. But they use git.

The other term of art for this concept is "system engineering", in the aerospace sense. There are a lot of good texts and courses.

One example: Wesson: System Analysis Design and Development, Wiley, 2005. ISBN-10 0-471-39333-9

dosman332y ago

Not trying to be an arse, but the guy spent a lot more time talking about himself and other unrelated stuff than about how S3 works. And I don't mind a good article on RAMAC, but that seems... out of place in a discussion about peta-scale storage. I got the strong impression he doesn't really know the finer details of how S3 really works. And that's probably fine for what he's doing, there is plenty of room for application coding, firefighting, and problem management without having to get into the finer details of how it all works.

simonebrunozzi2y ago

From 2009, a talk I gave about S3 internals [0], when I was Technology Evangelist for AWS. Still relevant today, I believe.

[0]: https://vimeo.com/7330740

nijave2y ago

I think there's a good call-out about ownership here. Ownership and autonomy go hand in hand (you can't force someone to own something)

supermatt2y ago

How does S3 handle particularly hot objects? Is there some form of rebalancing to account for access rates?

dosman332y ago

I was disappointed too, this article was very light on details about the subject matter. I wasn't expecting a blue-print, but what was presented was all very hand-wavy.

In large systems (albeit smaller than S3) the way this works is that you slurp out some performance metrics from storage system to identify your hot spots and then feed that into a service that actively moves stuff around (below the namespace of the filesystem though, will be fs-dependant). You have some higher-performance disk pools at your disposal, and obviously that would be nvme storage today.

So in practice, it's likely proprietary vendor code chewing through performance data out of a proprietary storage controller and telling a worker job on a mounted filesystem client to move the hot data to the high performance disk pool. Always constantly rebalancing and moving data back out of the fast pool once it cools off. Obviously for S3 this is happening at an object level though using their own in-house code.

paulddraper2y ago

> All in, S3 today is composed of hundreds of microservices

wow

j / k navigate · click thread line to collapse

160 comments

Twirrim2y ago

> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.

mjb2y ago

> daily occurrence when you're operating at S3 scale

ignoramous2y ago

James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044

jdwithit2y ago

aborsy2y ago

This phenomenon is just multiplication of the sample size (scale) times a probability (rare).

maweki2y ago

It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.

da39a3ee2y ago

2 more replies

PaulRobinson2y ago

rubiquity2y ago

Twirrim2y ago

This was 7-8 years ago now. Lot of scaling up since those days :)

rubiquity2y ago

I’m sure my numbers are out of date now too

rkagerer2y ago

fooker2y ago

That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.

rkagerer2y ago

The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.

delecti2y ago

I love conversations like this that remind me how unintuitive big numbers are.

ldjkfkdsjnv2y ago

Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.

wrboyce2y ago

Any examples you can share?

ruckfool2y ago

Redis Node failover

ldjkfkdsjnv2y ago

Apache tomcat starts to break down

1 more reply

baz002y ago

We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.

ilyt2y ago

I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough

benou2y ago

Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.

Twirrim2y ago

jamesblonde2y ago

HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.

1 more reply

jacobgorm2y ago

[1] https://dl.acm.org/doi/10.1145/1807128.1807134

Waterluvian2y ago

Ever see a UUID collision?

cmckn2y ago

Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.

mabbo2y ago

Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.

Dylan168072y ago

Are you sure about that math?

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.

2 more replies

jandrewrogers2y ago

It depends on how paranoid you need to be.

2 more replies

lazide2y ago

Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.

epistasis2y ago

This sort of article is some of the best of HN, IMHO.

CobrastanJorji2y ago

donavanm2y ago

You get similar properties/challenges in lots of multi consumer storage scenarios. I learned lots of similar lessons working on CDNs when it comes to object distribution and access rates.

dekhn2y ago

Unfortunately many tools in genomics (and biotech in general) still depend on local filesystems- and even if they do support S3, performance is far slower than it could be.

epistasis2y ago

Most of these tools treat the "local file" as a stream which can be a pipe to a network stream from the object store.

The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is.

At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO.

dekhn2y ago

I didn't make my statement out of ignorance.

1 more reply

seized2y ago

The latency is higher so the key is parallelism... Which means you need more cores/hardware/VMs/pick your poison. New but same problem...

dekhn2y ago

1 more reply

kuchenbecker2y ago

As someone in this area: we very much want to make your EiB of data to feel local. It's hard and I'm sorry we only have 3.5 9's of read availability.

epistasis2y ago

3.5 9s is incredible on large stores. S3 and GCS are just amazing machines. I have nothing but admiration for the people that make this happen.

parentheses2y ago

Some of the best HN indeed. Would love to see any links to HN posts that you think are similarly good!

apitman2y ago

[0]: https://gdrivemusic.com/help

simonw2y ago

Absolutely this. I would LOVE to be able to build apps that store people's data in their own S3 bucket, billed to their own account.

jamesblonde2y ago

Most apps, however, assume POSIX-like data access. I would love to see a client-side minimally dependent library that mounts a local directory that is actually the user's S3 bucket.

afr0ck2y ago

Linux has FUSE, which is a framework to develop user-level filesystems. Mounting S3 buckets is a very good use case. Sshfs and httpfs are more or less similar in this regard.

anderspitman2y ago

Yep, and WinFSP and dokany are two options for FUSE on Windows. I'd recommend using rclone or maybe check this list: https://winfsp.dev/doc/Known-File-Systems/

Spivak2y ago

It really is such a shame that all the projects that tried/are trying to create data sovereignty for users became weird crypto.

ttymck2y ago

I agree with both halves of your comment, but I realized I can't identify the connection between S3 oauth and data sovereignty. Could you elaborate?

Spivak2y ago

Apple actually already does this with iCloud storage but hides it really well so it feels seamless.

1 more reply

gusmd2y ago

https://docs.aws.amazon.com/cognito/latest/developerguide/co...

ent1012y ago

We're building this at https://puter.com

anderspitman2y ago

You mean you're implementing something like this to be used by puter.com?

wizwit9992y ago

Apache Iceberg is kind of this, but more oriented around large data lake datasets.

deathanatos2y ago

> Now, let’s go back to that first hard drive, the IBM RAMAC from 1956. Here are some specs on that thing:

> Storage Capacity: 3.75 MB

> Cost: ~$9,200/terabyte

Those specs can't possibly be correct. If you multiply the cost by the storage, the cost of the drive works out to 3¢.

This site[1] states,

> It stored about 2,000 bits of data per square inch and had a purchase price of about $10,000 per megabyte

So perhaps the specs should read $9,200 / megabyte? (Which would put the drive's cost at $34,500, which seems more plausible.)

[1]: https://www.historyofinformation.com/detail.php?entryid=952

birdyrooster2y ago

Must've put a decimal point in the wrong place or something. I always do that. I always mess up some mundane detail.

S_A_P2y ago

Did you get the memo? Yeah I will go ahead and get you another copy of that memo.

acdha2y ago

nijave2y ago

mkesper2y ago

You had online jobs during the day and batch at night then. That's why you always had to have one night between. Obviously doesn't work when load is 24/7.

andywarfield2y ago

oh shoot. good catch, thanks!

mannyv2y ago

What most people don't realize is that the magic isn't in handling the system itself; the magic is making authorization appear to be zero-cost.

This and logging/accounting for billing are the two magic pieces of AWS that I'd love to see an article about.

Note that S3 does AA differently than other services, because the permissions are on the resource. I suspect that's for speed?

awithrow2y ago

Keep in mind that S3 predates IAM by several years. So part of the reason that access to buckets/keys is special is because it was already in place by the time IAM came around.

Its likely persisted since than largely since removing the old model would be a difficult taks without potentially breaking a lot of customer's setup

mannyv2y ago

Exactly. This difference makes it easier to (1)understand how IAM works, and (2) how the s3 works...because IAM and S3 work together, but in a different way than the other services.

I guess that occurs when permissions are saved up in IAM world. At some point those need to be joined against a principal somewhere, as roles can exist without assignment.

Again, it's be so interesting to see how this is done IRL.

vdm2y ago

AWS re:Invent 2022 - A day in the life of a billion requests (SEC404) https://www.youtube.com/watch?v=tPr1AgGkvc4

Narciss2y ago

rtpg2y ago

skybrian2y ago

This is sort of like:

* writing an exam question so the person taking the exam is likely to get the answer you want

* guiding someone in a code interview that isn't going so well, without giving away the answer

* being in the back seat while pair programming, except you're not allowed to take a turn at the keyboard

pgwhalen2y ago

I don't think it's cynical, I think it's the point. Describing the problem is not easy, and to your point, is sometimes controversial.

One advantage of focusing on describing the problem is that it naturally lets you have an impact on what you believe to be the important parts of the solution.

rtpg2y ago

1 more reply

forrestthewoods2y ago

That section really stood out to be as well.

Additionally, anyone have further reading for this type of “very senior IC” operation?

andywarfield2y ago

forrestthewoods2y ago

That’s helpful. Thank you!

tsxxst2y ago

For the "very senior IC", I'd recommend https://staffeng.com

dylan6042y ago

There's a saying that I'm often told, and I'm sure we've all heard it at some point "don't bring me problems, bring me solutions". It's such a shit comment to make.

I interpret it as if they are saying "You plebe! I don't have time for your issues. I can't get promoted from your work if you only bring problems."

Being able to solve the problem is being able to understand the problem and admit it exists first. <smacksMyDamnHead>

yashap2y ago

Depends how it’s used. If it’s used in an org where major, high impact problems are ignored, as a way to just say “ignore all problems”, then yeah, it’s a shit comment.

ChainOfFools2y ago

niscocity352y ago

this only works if your team are made up of smart competent people.

jl62y ago

blindhippo2y ago

Amazon engineer here - can confirm that Glacier transcodes all data on to the backs of the shells of the turtles that hold up the universe. Infinite storage medium, if a bit slow.

jeffbarr2y ago

Shh....

buildbot2y ago

Blueray disks are thought to be the key: https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

Some people disagree though. It’s still an unknown.

Twirrim2y ago

Glacier is a big "keep your lips sealed" one. I'd love AWS to talk about everything there, and the entire journey it was on because it is truly fascinating.

anyoneamous2y ago

My impression is that the ambiguity gives them freedom to implement in different ways across different regions and over time.

danpalmer2y ago

1 more reply

bombcar2y ago

Glacier is just run on S3 with some sleep statements added.

2 more replies

lytfyre2y ago

I recall at launch just about the only implementation detail that _was_ publicly given was that it did not involve tape. That's going to be difficult to dig up a cite on years later.

No idea how it's evolved over the years, so for all I know it's tape based these days.

inopinatus2y ago

Never officially stated, but frequent leaks from insiders confirm that Glacier is based on Very Large Arrays of Wax Phonograph Records (VLAWPR) technology.

Twirrim2y ago

One of the tag line ideas we had was "8 out of 10 customers say they prefer the feel of their data after it is restored"

jdwithit2y ago

inopinatus2y ago

The real problem is the lack of Star Wars references.

jdwithit2y ago

fomine32y ago

dosman332y ago

pkaye2y ago

Glacier was originally using actual glaciers as a storage media since they have been around forever. Bu then climate change happened so they quickly shifted to tiered storage of tape and hard drives.

CrQuYt6fWiUe22y ago

It's just low powered hard drives that aren't turned on all the time. Nothing special.

0cf8612b2e1e2y ago

Are there any public details on how Azure or GCP do archival storage?

ddorian432y ago

Just look at other clouds. I doubt amazon is doing anything special. At least they don't reflect any special pricing.

dsalzman2y ago

Sai_2y ago

The standing joke is that Americans love strange units of measure but this is one is so outre that it deserves an award.

jakupovic2y ago

Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so.

Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :).

mcapodici2y ago

Examples:

iDrive has E2, Digital Ocean has Object Storage, Cloudflare has R2, Vultr has Object Storage, Backblaze has B2

CobrastanJorji2y ago

Google's GCS as well, and I haven't used Microsoft, but it'd be weird if they didn't also have an "S3 compatible" option.

Edit: I looked it up and apparently no, Azure does not have one :-/

romantomjak2y ago

Apologies if this comes off as blunt, but this is the type of content I come to read at hacker news rather than it being just a series of obituaries.

The author has made a lot of great points, but one that stuck with me was:

> I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.

I haven’t thought of it in this way, but this is an excellent way of motivating someone to “own” a problem.

baq2y ago

I’d go even further: at this scale, it is essential and required to develop these kind of projects with any sort of velocity.

Large organizations ship their communication structure by design. The alternative is engineering anarchy.

CobrastanJorji2y ago

capableweb2y ago

Also known as "Conway's law"

> Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure

https://en.wikipedia.org/wiki/Conway%27s_law

CobrastanJorji2y ago

hobo_in_library2y ago

This is also why reorgs tend to be pretty common at large tech orgs.

They know they'll almost inevitably ship their org chart. And they'll encounter tons of process-based friction if they don't.

The solution: Change your org chart to match what you want to ship

mr_toad2y ago

A more cynical take is that it makes it look like the new management is doing something.

An even more cynical take is that it makes it difficult to compare performance with past performance.

Severian2y ago

Straight from The Mythical Man Month: Organizations which design systems are constrained to produce systems which are copies of the communication structures of these organizations.

_aaed2y ago

something something Conway's law

whoknowswhat112y ago

Over 100 million requests per second authenticated, billed, versioned, logged, checksummed, encrypted against 200+ trillion objects.

_han2y ago

The talk that this article is based on is available on YouTube: https://www.youtube.com/watch?v=sc3J4McebHE

kaycebasques2y ago

> we’d read and generally have pretty lively discussions about a collection of “classic” systems research papers

Does anyone have the list of papers?

Is any of this open source?

g9yuayon2y ago

Sparkyte2y ago

There was a time when S3 was getting resilient. Today it is excellent. Pepridge Farms remembers.

ddorian432y ago

> There seems not much activity in the Hadoop community

There is apache ozone https://ozone.apache.org/

g9yuayon2y ago

Yeah, Ozone looks interesting. I was just not sure who used it at scale other than a Japanese startup. The community engagement seems much lower than other communities, though.

gooseyman2y ago

This is a fantastic point on ownership that those “placing” it on others can often miss.

j_not_j2y ago

> It’s all one thing, and you can’t really think about it just as software. It’s > software, hardware, and people, and it’s always growing and constantly evolving.

The other term of art for this concept is "system engineering", in the aerospace sense. There are a lot of good texts and courses.

One example: Wesson: System Analysis Design and Development, Wiley, 2005. ISBN-10 0-471-39333-9

dosman332y ago

simonebrunozzi2y ago

From 2009, a talk I gave about S3 internals [0], when I was Technology Evangelist for AWS. Still relevant today, I believe.

[0]: https://vimeo.com/7330740

nijave2y ago

I think there's a good call-out about ownership here. Ownership and autonomy go hand in hand (you can't force someone to own something)

supermatt2y ago

How does S3 handle particularly hot objects? Is there some form of rebalancing to account for access rates?

dosman332y ago

I was disappointed too, this article was very light on details about the subject matter. I wasn't expecting a blue-print, but what was presented was all very hand-wavy.

paulddraper2y ago

> All in, S3 today is composed of hundreds of microservices

wow

j / k navigate · click thread line to collapse