One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.
Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).
Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.
One in a billion would be if keys were ~30 bits. Luckily it isn't.
And that's if you're going completely random and not taking care to try to reduce collisions.
And the benefits of sharing disk IOPs with untold numbers of other customers is hard to understate. I hadn't heard the term "heat" as it's used in the article but it's incredibly hard to mitigate on single system. For our co-located hardware clusters, we would have to customize the batch systems to treat IO as an allocatable resource the same as RAM or CPU in order to manage it correctly across large jobs. S3 and GCP are super expensive, but the performance can be worth it.
This sort of article is some of the best of HN, IMHO.
If youre interested go search for some of the published work from "Coho Data", they had some great usenix presentations IIRC. This was the previous company Andy Warfield was at and they had an emphasis on effective tracking & prediction of IO workloads across very large datasets.
The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is.
At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO.
3.5 9s is incredible on large stores. S3 and GCS are just amazing machines. I have nothing but admiration for the people that make this happen.
Doing that right now is monumentally difficult. I built an entire CLI app just for solving the "issue AWS credentials that can only access this specific bucket" problem, but I really don't want to have to talk my users through installing and running something like that: https://s3-credentials.readthedocs.io/en/stable/
It really is such a shame that all the projects that tried/are trying to create data sovereignty for users became weird crypto.
https://docs.aws.amazon.com/cognito/latest/developerguide/co...
edit: I think I misread your comment. I understood it as your app wanting to delegate access to a user's data to the client, but it seems like you want the user to delegate access to their own data to your app? Different use-cases.
> Storage Capacity: 3.75 MB
> Cost: ~$9,200/terabyte
Those specs can't possibly be correct. If you multiply the cost by the storage, the cost of the drive works out to 3¢.
This site[1] states,
> It stored about 2,000 bits of data per square inch and had a purchase price of about $10,000 per megabyte
So perhaps the specs should read $9,200 / megabyte? (Which would put the drive's cost at $34,500, which seems more plausible.)
[1]: https://www.historyofinformation.com/detail.php?entryid=952
Since you license a fixed amount, there were projects at the company looking at running batch/non time sensitive jobs on the mainframe since it was effectively free off peak (I guess power cost was trivially compared to licensing).
In distributed systems authorization is incredibly difficult. At the scale of AWS it might as well be magic. AWS has a rich permissions model with changes to authorization bubbling through the infrastructure at sub-millisecond speed - while handling probably trillions of requests.
This and logging/accounting for billing are the two magic pieces of AWS that I'd love to see an article about.
Note that S3 does AA differently than other services, because the permissions are on the resource. I suspect that's for speed?
Its likely persisted since than largely since removing the old model would be a difficult taks without potentially breaking a lot of customer's setup
I heard that AA is done via asics, but resource-level permissions implies that authorization is done at the local level for s3. To me that implies that the system extracts S3 permissions from IAM and sends them downstream s3, which get merged with stuff that s3 manages.
I guess that occurs when permissions are saved up in IAM world. At some point those need to be joined against a principal somewhere, as roles can exist without assignment.
Again, it's be so interesting to see how this is done IRL.
"I learned that to really be successful in my own role, I needed to focus on articulating the problems and not the solutions, and to find ways to support strong engineering teams in really owning those solutions."
I love this. Reminds me of the Ikea effect to an extent. Based on this, to get someone to be enthusiastic about what they do, you have to encourage ownership. And a great way is to have it be 'their idea'.
Fortunately not every problem is like this. But if you look at, say, discussions around Python's "packaging problem" (and find people in fact describing like 6 different problems in very different ways), you can see this play out pretty nastily.
This is sort of like:
* writing an exam question so the person taking the exam is likely to get the answer you want
* guiding someone in a code interview that isn't going so well, without giving away the answer
* being in the back seat while pair programming, except you're not allowed to take a turn at the keyboard
One advantage of focusing on describing the problem is that it naturally lets you have an impact on what you believe to be the important parts of the solution.
If Andy Warfield is reading, and I bet he is, I have a question. When developing a problem how valuable is it to sketch possible solutions? If you articulate the problem that probably springs to mind a few possible solutions. Is it worth sharing those possible solutions to help kickstart the gears for potential owners? Or is it better to focus only on the problem and let the solution space be fully green?
Additionally, anyone have further reading for this type of “very senior IC” operation?
So I started doing an experiment where I'd write that same doc, including the ideas i had on the shape of the work we should do, but then I'd delete my solution before sharing it. To your question: I'd still totally write my solution ideas down. Partially because I can't help myself and honestly it was a helpful way to think things through. But when I deleted it and shared a doc with just a problem statement, I'd get feedback on the problem statement. It's pretty obvious, but it was also a pretty surprising result: all of a sudden i was in conversations where we were all on the same side of the table. Feedback was either refining the problem (which was awesome) or proposing solutions. And when the person reading your problem statement starts trying to solve it, it's really cool... because they totally start getting invested and the conversations are great.
Like everything, none of this is actually either/or. There are points in between, like including a sketch of the shape of a solution, or properties that a solution would have to have. But the overall thing of separating the problem and the end state of where you want to get to, from the solution and the plan on how to get there is a pretty effective tool from a sharing ownership perspective.
I interpret it as if they are saying "You plebe! I don't have time for your issues. I can't get promoted from your work if you only bring problems."
Being able to solve the problem is being able to understand the problem and admit it exists first. <smacksMyDamnHead>
However, if it’s used to legitimately say “don’t just complain, fix”, then I think it’s a positive. An organization where everyone is constantly negative and complaining about every little issue, but not working to implement improvements/fixes, is essentially a failed company. Successful companies are full of people who actively fix the high impact problems, while also being realists, who can accept that the low impact problems aren’t worth the effort to fix, and aren’t worth endlessly complaining about.
Because absent preestablied perceived authority or expertise, which is the context that most day to day problems surface within, holding forth and hogging the entire two-way discussion channel with your long detailed and carefully articulated description of the problem is going to make you sound like someone who wants to do all the talking and none of the work, or the kind of person who doesn't want to share in finding a solution together with others.
Some people disagree though. It’s still an unknown.
The original Glacier was very clearly tape, but given the instant retrieval capabilities the newer S3-Glacier tiers are most likely just low-margin HDDs, maybe with some dynamic powering on and off of drives/servers.
One of the tag line ideas we had was "8 out of 10 customers say they prefer the feel of their data after it is restored"
I remember installing 20+ fully configured IBM 3494 tape libraries for AT&T in the mid-2000's. These things were 20+ frames long with dual accessors (robots) in each. The robots were able to push a dead accessor out of the way into a "garage" and continue working in the event one of them died (and this actually worked). Someone will have to invent a cheaper medium of storage than tape before tape will ever die.
Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so.
Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :).
Examples:
iDrive has E2, Digital Ocean has Object Storage, Cloudflare has R2, Vultr has Object Storage, Backblaze has B2
Edit: I looked it up and apparently no, Azure does not have one :-/
The author has made a lot of great points, but one that stuck with me was:
> I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.
I haven’t thought of it in this way, but this is an excellent way of motivating someone to “own” a problem.
I’d go even further: at this scale, it is essential and required to develop these kind of projects with any sort of velocity.
Large organizations ship their communication structure by design. The alternative is engineering anarchy.
> Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure
They know they'll almost inevitably ship their org chart. And they'll encounter tons of process-based friction if they don't.
The solution: Change your org chart to match what you want to ship
An even more cynical take is that it makes it difficult to compare performance with past performance.
Does anyone have the list of papers?
> we managed to kind of “industrialize” verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don’t have PhDs in formal verification can contribute to maintaining the specification, and that we could continue to apply our tools with every single commit to the software
Is any of this open source?
There is apache ozone https://ozone.apache.org/
“Ownership carries a lot of responsibility, but it also carries a lot of trust – because to let an individual or a team own a service, you have to give them the leeway to make their own decisions about how they are going to deliver it.”
This is a lesson a lot of software people haven't yet learned. Bad UI, bad operational experiences, insufficient logging to resolve issues, un-fixable code because it's too complicated, and so on. But they use git.
The other term of art for this concept is "system engineering", in the aerospace sense. There are a lot of good texts and courses.
One example: Wesson: System Analysis Design and Development, Wiley, 2005. ISBN-10 0-471-39333-9
In large systems (albeit smaller than S3) the way this works is that you slurp out some performance metrics from storage system to identify your hot spots and then feed that into a service that actively moves stuff around (below the namespace of the filesystem though, will be fs-dependant). You have some higher-performance disk pools at your disposal, and obviously that would be nvme storage today.
So in practice, it's likely proprietary vendor code chewing through performance data out of a proprietary storage controller and telling a worker job on a mounted filesystem client to move the hot data to the high performance disk pool. Always constantly rebalancing and moving data back out of the fast pool once it cools off. Obviously for S3 this is happening at an object level though using their own in-house code.
wow