AWS Nitro SSD – High Performance Storage for Your I/O-Intensive Applications (opens in new tab)

(aws.amazon.com)

89 pointsTrisell4y ago44 comments

44 comments

> Today I would like to tell you about the AWS Nitro SSD.

A bit light on technical details but very fun, very exciting. Kind of sad that such amazing work is no longer quite so public, is no longer something that say Intel is going to talk up in endless details with a product launch. A huge amount of the work & innovation here is extremely specific, extremely private- all this Elastic Fabric Adapter related stuff is advanced systems engineering, close integration of systems, that's Amazon's & Amazon's alone.

Anyhow. This article pairs very well with the "Scaling Kafka at Honeycomb"[1], which I found to be a delightful read on adapting & evolving a big huge workload to ever-improving AWS hardware.

[1] https://www.honeycomb.io/blog/scaling-kafka-observability-pi... https://news.ycombinator.com/item?id=29396319 (38 minutes ago, 13 points)

jeffbarr4y ago

I wrote the AWS post and did my best to share lots of technical details; are there any specific things that you want to know more about?

dmw_ng4y ago

Generally a fan of your posts, but this one was very heavy on marketing buzzology ("cloud scale"). I can't tell if there was a genuine use case for designing a proprietary SSD, or if it were some pet project. Is "75% lower latency variability" because the first gen SSD was a CS101 project, or because AWS have developed some material edge over what others (with much wider scope) in the industry have been doing for years? I can't tell.

I can't see a reason to buy or use this product.

jeffbee4y ago

I doubt that other companies' supposedly "wider scope" actually exists or gives them advantages. Both Amazon and Google make their own SSDs and have the largest computer installations in the known universe. The fact that Samsung makes a lot of SSDs for laptops may not give them wider scope at all.

simonebrunozzi4y ago

I actually think that these posts have gotten much better over the past 2-3 years, at least based on my taste; the level of technical details is just right. On specific topics, I wouldn't mind James Hamilton-level specifics, but you can't be too deep on everything all the time.

(hi Jeff! Hope you're well :D)

jeffbarr4y ago

Hi Simone, doing well and we are trying to add more info while still being frugal with words and with the time of our readers.

kaliszad4y ago

Oh I would love some more deep dives or presentations by James Hamilton into various aspects of AWS. They combine the high level overview and the deep technical details in a very informative and entertaining way.

rektide4y ago

Hi Jeff! Eeeeeek! I'd love to know so much more about the Nitro acceleration. All these accelerated fabrics are so interesting.

* What does the Nitro accelerator look like to the host? . Does the Nitro accelerator present as NVMe devices to the OS host, or is there a more custom thing it presents as? Does the Nitro accelerator use SR-IOV to or something else to present as many different PCIe adapters, per-drive PCIe, or a single PCIe device, or no PCIe devices at all, something else entirely (and if so what)? Are there custom virt-io drivers powering the VMs? How much change has gone into these interfaces in the newest iterations, or have these interface channels remained stable?

* What is the over the wire communication? Related to the above; ultimately the VM's see NVMe, & how far down the stack/across the network does that go? Is what's on the wire NVMe based, or something else; is it custom? What trade-offs were there, what protocols inspired the teams? Originally at launch it seemed like there was a custom remote protocol[1]; has that stayed? What drove the protocol evolution/change over time? What's new & changed?

* What do the storage arrays look like; are they also PCs based? Or do the flash arrays connect via accelerators too? Are these FPGA-based or hard silicon? Are there standard flash controllers in use, or is this custom? How many channels of flash will one accelerator have connected to it? How much has the storage array architecture changed since Nitro was first introduced? Do latest gen nitro & older EBS storages have the same implementation or are newer EBS storages evolving more freely now?

* On a PC, an SSD is really an abstraction hiding dozens of flash channels. There have been efforts like Open Channel SSDs and now zoned namespaces to give the PCs more direct access to the individual channels. Does the Nitro accelerator connect to a single "endpoint" per EBS, or is the accelerator fanning out, connecting to multiple endpoints or multiple channels, doing some interleaving itself?

* What are some of the flash-translation optimizations & wins that the team/teams have found?

And simply: * How on earth can hosts have so much networking/nitro throughput available to them?! It feels like there's got to be multiple 400Gbit connections going to hosts today. And all connected via Nitro accelerators?

It's just incredibly exciting stuff, there's so much super interesting work going on, & I am so full of questions! I was a huge fan of the SeaMicro accelerators of yore, an early integrated network-attached device accelerator. Getting to work at such scale, build such high performance well integrated systems seems like it has so so many interesting fascinating subproblems to it.

[1] https://www.youtube.com/watch?v=e8DVmwj3OEs#t=11m58s

Andys4y ago

> * How on earth can hosts have so much networking/nitro throughput available to them?!

I feel this is something overlooked when people complain about the egress fees

lend0004y ago

If you have an existing EC2 instance with EBS storage and want to convert it to the new Nitro SSD, what will be the process for migration? E.g. a live swapping of attached storage devices, a quick reboot, or spinning up a new instance?

jeffbarr4y ago

The Nitro SSDs are currently used as instance storage, directly attached to particular EC2 instances.

1 more reply

posnet4y ago

Are there plans to provide Metal instances with these new SSDs?

jeffbarr4y ago

I don't know one way or the other, but great question. I prefer launching stuff to hinting about it :-)

1 more reply

sitkack4y ago

I'd like to see P99.9 and MAX latency for certain read and write patterns. More concretely a before and after wrt a specific workload would be even better.

ignoramous4y ago

> A huge amount of the work & innovation here is extremely specific, extremely private- all this Elastic Fabric Adapter related stuff is advanced systems engineering, close integration of systems, that's Amazon's & Amazon's alone.

You speak my mind: https://news.ycombinator.com/item?id=19162376 (from 3yrs ago)

b9a2cab54y ago

Intel has stopped disclosing a lot of details on their newer products, probably because they're no longer far and away the market leader. I think if AWS ever develops a 4-5 year lead over everyone else we'll see similar disclosures out of them. Facebook publishes a lot of info about Oculus asynchronous reprojection techniques and computer vision because they have a 2/3 marketshare in VR.

ahepp4y ago

One question I have is, I thought the cloud was supposed to abstract this kind of stuff away? Shouldn’t cloud services be sold in the “solution domain” rather than by picking the backing technology behind your tool?

For example, why not have a file/object/whatever storage service; and a price matrix that lets you select key metrics like latency, throughput, and variability of either?

I don’t particularly care if my ultra fast ultra low latency is derived from SSDs, spinning rust, RAM, l2 cache, or acoustic ripples. But I’m not super in tune with cloud services to begin with.

acdha4y ago

There are effectively three tiers: managed storage services like S3 (object storage), EFS (NFS NAS), or FSX (clustered filesystem) where most of the decisions are made for you; the mid-level EBS (SAN) service; and storage-optimized instance types with local disks which you manage.

This custom SSD hardware family is what powers the EBS (cloud SAN) service, which allows you to pay for the performance level you need, where they give it both higher absolute performance and [now] better worst-case latency.

This announcement is saying that you can now get your own instances with the same performance characteristics for situations where you need better performance than a SAN can deliver and/or the robustness benefits of using per-node storage rather than a separate networked service.

The other part of this announcement is the implicit message it sends about the competition: they're telling everyone that their storage performance is more consistent than their competitors and increasing the number of areas where they can say they have an option which a competitor does not. Noting that this was driven by the EBS storage team is also a reminder that they have more people working on lower-level infrastructure problems than you likely do.

rawtxapp4y ago

I think it comes down to the fact that at the end of the day, your software runs on real hardware, which isn't perfect. So rather than hide these imperfections behind an opaque surface, AWS let's you peek behind the scenes to optimize your software, debug issues, etc. It's really useful if you're working at a large scale.

They also have things like Lightsail if you don't care about the details and just want the packaged solution.

MR4D4y ago

They already do. But some customers want to more finely control the various trade offs with different technology implementations, and services like this allow them to do so. Everyone else can keep using what they already have.

dikei4y ago

But it is abstract away, you just need to define your IOPS requirements, then pick the cheapest volume type that can satisfy it. You wouldn't have to choose if SSD is as cheap as HDD, but alas we're not there yet.

Comparing to life before the cloud, you would have to choose the vendor, the size of your drives, how many drive, then go through the procurement process, months in advance.

tw044y ago

The cloud is 90% renting other peoples servers and 10% “I just want to write an app”. Based on the feedback I’ve seen on HN, that 10% quickly finds out the inconsistent performance of the paas will eventually become an issue.

judge20204y ago

If I’m not mistaken, EBS (as in, elastic block storage) already allows this, but it often won’t beat the latency of a local SSD.

StratusBen4y ago

Related: I just updated https://ec2instances.info/ so that it now includes the new instances types so you can compare them on a price and resource basis.

miyuru4y ago

that site has become so slow to load and makes my CPU to ramp up for several seconds, I have personally switched to using better alternatives.

StratusBen4y ago

It's open source and we're looking for contributors. Let us know if you'd like to help out.

ant6n4y ago

Can u name a better alternative?

miyuru4y ago

https://instaguide.io/

ksec4y ago

>The second generation of AWS Nitro SSDs were designed to avoid latency spikes and deliver great I/O performance on real-world workloads. Our benchmarks show instances that use the AWS Nitro SSDs, such as the new Im4gn and Is4gen, deliver 75% lower latency variability than I3 instances, giving you more consistent performance.

Tl;dr: They now have custom SSD firmware that avoid latency spikes.

david9274y ago

Directly between Armenian and Azerbaijani, Google translate should add AWS.

sk0g4y ago

I was racking my brain trying to figure out what the parent comment had to do with Georgia. Seems like a dead brain day for me...

NullPrefix4y ago

>Select a language to translate from:

>Armenian

>AWS

>Azerbaijani

Whole comment quoted by ksec sounded like some techno babble. I assume this was david's point.

1 more reply

chrsig4y ago

That'd give them the opportunity to put an ad for google cloud right above it!

AtlasBarfed4y ago

The issue with all AWS storage is that storage bandwidth eats your network bandwidth. And there are not-great documented multi-level throttles and bottlenecks involved in that.

Especially in the "Up to X per second" networking instances, which is basically all of them except the huge ones.

The activation of throttles is NOT well exposed in metrics, nor is bursting amount or detecting if bursting is occurring.

It is all somewhat shady IMO, with AWS trying to hide problems with their platform, or hide that you're getting charged in lots of sneaky ways.

lowbloodsugar4y ago

Most instances are EBS optimized, and have dedicated bandwidth for EBS, optimized stack etc [1].

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-opti...

Donovan24y ago

Hi Jeff, Is this AWS custom SSD based on their own SSD CTRL and FW not commercial SSD??

ABeeSea4y ago

>We also took advantage of our database expertise and built a very sophisticated, power-fail-safe journal-based database into the SSD firmware.

Assuming this means something similar to QLDB, did they put a centralized blockchain in the firmware? Pretty cool.

kall4y ago

Err… what? Is any database now a centralized blockchain? I don‘t even get from the sentence that this is using any kind of cryptographic verification (though it might).

I think we can probably think more along the lines of the postgres wal (write ahead log) and _journaling_ file systems here.

vineyardmike4y ago

I think a lot of younger people and crypto-influenced people today think of "journaling" as "like a blockchain" based on the overlaps (not the cryptographic portion, the "blocks" half).

ABeeSea4y ago

I dislike crypto and have argued against it many times just on this site. But other than financial audit DBs, I haven’t seen many use cases for a cryptographic ledger db. AWS’s own ledger DB, QLDB, was discussed in the context of blockchains on HN when it was announced several years ago.

https://news.ycombinator.com/item?id=18553387

j / k navigate · click thread line to collapse

44 comments

rektide4y ago

> Today I would like to tell you about the AWS Nitro SSD.

Anyhow. This article pairs very well with the "Scaling Kafka at Honeycomb"[1], which I found to be a delightful read on adapting & evolving a big huge workload to ever-improving AWS hardware.

[1] https://www.honeycomb.io/blog/scaling-kafka-observability-pi... https://news.ycombinator.com/item?id=29396319 (38 minutes ago, 13 points)

jeffbarr4y ago

I wrote the AWS post and did my best to share lots of technical details; are there any specific things that you want to know more about?

dmw_ng4y ago

I can't see a reason to buy or use this product.

jeffbee4y ago

simonebrunozzi4y ago

(hi Jeff! Hope you're well :D)

jeffbarr4y ago

Hi Simone, doing well and we are trying to add more info while still being frugal with words and with the time of our readers.

kaliszad4y ago

rektide4y ago

Hi Jeff! Eeeeeek! I'd love to know so much more about the Nitro acceleration. All these accelerated fabrics are so interesting.

* What are some of the flash-translation optimizations & wins that the team/teams have found?

[1] https://www.youtube.com/watch?v=e8DVmwj3OEs#t=11m58s

Andys4y ago

> * How on earth can hosts have so much networking/nitro throughput available to them?!

I feel this is something overlooked when people complain about the egress fees

lend0004y ago

jeffbarr4y ago

The Nitro SSDs are currently used as instance storage, directly attached to particular EC2 instances.

1 more reply

posnet4y ago

Are there plans to provide Metal instances with these new SSDs?

jeffbarr4y ago

I don't know one way or the other, but great question. I prefer launching stuff to hinting about it :-)

1 more reply

sitkack4y ago

I'd like to see P99.9 and MAX latency for certain read and write patterns. More concretely a before and after wrt a specific workload would be even better.

ignoramous4y ago

You speak my mind: https://news.ycombinator.com/item?id=19162376 (from 3yrs ago)

b9a2cab54y ago

ahepp4y ago

For example, why not have a file/object/whatever storage service; and a price matrix that lets you select key metrics like latency, throughput, and variability of either?

I don’t particularly care if my ultra fast ultra low latency is derived from SSDs, spinning rust, RAM, l2 cache, or acoustic ripples. But I’m not super in tune with cloud services to begin with.

acdha4y ago

rawtxapp4y ago

They also have things like Lightsail if you don't care about the details and just want the packaged solution.

MR4D4y ago

dikei4y ago

Comparing to life before the cloud, you would have to choose the vendor, the size of your drives, how many drive, then go through the procurement process, months in advance.

tw044y ago

judge20204y ago

If I’m not mistaken, EBS (as in, elastic block storage) already allows this, but it often won’t beat the latency of a local SSD.

StratusBen4y ago

Related: I just updated https://ec2instances.info/ so that it now includes the new instances types so you can compare them on a price and resource basis.

miyuru4y ago

that site has become so slow to load and makes my CPU to ramp up for several seconds, I have personally switched to using better alternatives.

StratusBen4y ago

It's open source and we're looking for contributors. Let us know if you'd like to help out.

ant6n4y ago

Can u name a better alternative?

miyuru4y ago

https://instaguide.io/

ksec4y ago

Tl;dr: They now have custom SSD firmware that avoid latency spikes.

david9274y ago

Directly between Armenian and Azerbaijani, Google translate should add AWS.

sk0g4y ago

I was racking my brain trying to figure out what the parent comment had to do with Georgia. Seems like a dead brain day for me...

NullPrefix4y ago

>Select a language to translate from:

>Armenian

>AWS

>Azerbaijani

Whole comment quoted by ksec sounded like some techno babble. I assume this was david's point.

1 more reply

chrsig4y ago

That'd give them the opportunity to put an ad for google cloud right above it!

AtlasBarfed4y ago

The issue with all AWS storage is that storage bandwidth eats your network bandwidth. And there are not-great documented multi-level throttles and bottlenecks involved in that.

Especially in the "Up to X per second" networking instances, which is basically all of them except the huge ones.

The activation of throttles is NOT well exposed in metrics, nor is bursting amount or detecting if bursting is occurring.

It is all somewhat shady IMO, with AWS trying to hide problems with their platform, or hide that you're getting charged in lots of sneaky ways.

lowbloodsugar4y ago

Most instances are EBS optimized, and have dedicated bandwidth for EBS, optimized stack etc [1].

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-opti...

Donovan24y ago

Hi Jeff, Is this AWS custom SSD based on their own SSD CTRL and FW not commercial SSD??

ABeeSea4y ago

>We also took advantage of our database expertise and built a very sophisticated, power-fail-safe journal-based database into the SSD firmware.

Assuming this means something similar to QLDB, did they put a centralized blockchain in the firmware? Pretty cool.

kall4y ago

Err… what? Is any database now a centralized blockchain? I don‘t even get from the sentence that this is using any kind of cryptographic verification (though it might).

I think we can probably think more along the lines of the postgres wal (write ahead log) and _journaling_ file systems here.

vineyardmike4y ago

I think a lot of younger people and crypto-influenced people today think of "journaling" as "like a blockchain" based on the overlaps (not the cryptographic portion, the "blocks" half).

ABeeSea4y ago

https://news.ycombinator.com/item?id=18553387

j / k navigate · click thread line to collapse