AMD's MI300X Outperforms Nvidia's H100 for LLM Inference (opens in new tab)

(blog.tensorwave.com)

280 pointsfvv1y ago264 comments

264 comments

"TensorWave is a cloud provider specializing in AI workloads. Their platform leverages AMD’s Instinct™ MI300X accelerators, designed to deliver high performance for generative AI workloads and HPC applications."

I suggest taking the report with a grain of salt.

nabla91y ago

The salt is in the plain sight.

The do the standard AMD comparison:

  8x AMD MI300X (192GB, 750W) GPU  
  8x H100 SXM5 (80GB, 700W) GPU

The fair comparison would be against

  8x H100 NVL (188GB, <800W) GPU

Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price.

nabla91y ago

                 MTr
  ------------------
  H100 SXM5   80,000 
  MI300X     153,000
  H100 NVL   160,000

H100 SXM4 has 52% of the transistors MI300X has, half of the RAM and MI300X achieves *ONLY* 33% higher throughput compared to the H100. MI300X was launched 6 months ago, H100 20 months ago.

AMD has work to do.

2 more replies

fleischhauf1y ago

AMDs deep learning libraries are very bad the last time I checked, nobody uses amd in that space for that reason. Nvidia has a quazi monopoly, that's the main reason for the price difference IMHO.

3 more replies

lhl1y ago

Isn't SXM5 higher bandwidth? It's 900 GB/s of bidirectional bandwidth per GPU across 18 NVLink 4 channels. The NVL's are on PCIe 5, and even w/ NVLink only get to 600 GB/s of bandwidth across 3 NVLink bridges (across only pairs of cards)?

I haven't done a head to head and I suppose it depends on whether tensor parallelism actually scales linearly or not, but my understanding is since the NVL's are just PCIe/NVLink paired H100s, you're not really getting much if any benefit on something like vLLM.

I think the more interesting thing critique might be the slightly odd choice of Mixtral 8x7B vs say a more standard Llama2/3 70B (or just test multiple models including some big ones like 8x22B or DBRX.

Also, while I don't have a problem w/ vLLM, as TensorRT gets easier to set up, it might become a factor in comparisons (since they punted on FP8/AMP in this tests). Inferless published a shootoff a couple months ago comparing a few different inference engines: https://www.inferless.com/learn/exploring-llms-speed-benchma...

Price/perf does tell a story, but I think it's one that's mostly about Nvidia's platform dominance and profit margins more than intrinsic hardware advantages. On the spec sheet MI300X has a memory bandwidth and even raw FLOPS advantage but so far it has lacked proper software optimization/support and wide availability (has anyone besides hyperscalers and select partners been able to get them?)

1 more reply

ebalit1y ago

But the price should be a factor. Your fair comparison would match a ~60k$ setup to a 20k$ according to prices we can find online.

I don't think it should be ignored, especially when the power consumption is similar.

fvvOP1y ago

fair? h100 NVL are two h100 in a single package.. which probably costs 2xh100 or more,

if so ok it's fair to compare 1 mi300x with 1 h100 NVL but then price ( and tco ) should be added to the some metrics conclusion , also the NVL is a 2xpci5.0 quad slot , so not the same thing..

I am not sure about system compatibility and if and how you can stack 8 of those in one system ( like you can do with non NVL and mi300x.. ) so it's a bit a diffent ( and more niche ) beast

sangnoir1y ago

> Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price

What were your thoughts on Zen (1) vs Intel's offerings then? AMD offered more back for the buck then too.

winux-arch1y ago

Price tells the story. Yes but for electric prices not card prize and here their much more close to each other!

resource_waste1y ago

Thx! Anyone who says Nivida isnt king, needs a reality check.

1 more reply

epolanski1y ago

Well, there's the beauty of specifying exactly how you ran your benchmark, it is easy to reproduce and disprove or confirm (assuming you got the hardware).

scotty791y ago

As easy as getting yourself 8 H100 and 8 MI300X.

Fun weekend project for anybody.

1 more reply

impulser_1y ago

If they used Nvidia's chip would this somehow make the blog post better?

aurareturn1y ago

For one, they didn't use TensorRT in the test.

Also, stuff like this is hard to take the results seriously:

  * To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

  * All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.

They did everything they can to make sure AMD is faster.

4 more replies

qeternity1y ago

Why the hell are we doing 128 input token benchmarks in 2024. This is not representative of most workloads, and prefill perf is incredibly important.

ta126534211y ago

For understanding:

What would be a suitable input length in your oppinion?

And why isnt this a good one: Are real-life queries shorter? Or longer?

If i count one word as a token, then in my case most of the queries are less than 128 words.

qeternity1y ago

I think today 512 tokens is a minimum.

It's not just the query (if you're running a chatbot, which many of us are not). It's the entire context window. It's not uncommon to have a system prompt that is > 512 tokens alone.

I would like to see benchmarks for 512, 1024, 4096 and 8192 token inputs.

Gasp0de1y ago

Including the initialization prompt and your history if you have one? I use ChatGPT for a very simple task, to map chat messages to one of 5 supported function calls, and the function definitions alone already take up 200 tokens I think

stefs1y ago

It's not just the current prompt, but the whole conversation, if possible. Or, if you want the AI to summarise an article, the article has to fit in.

If I understood that correctly, context length is something like session storage or short term memory. If it's too small the AI starts to forget what it's talking about.

rfoo1y ago

IMO the relevant benchmark for now is a mixed stream of requests with 50 (20%), 500 (50%), 2000 (10%) and 50k (20%) input tokens, ignore EOS and decode until you get around 300 output tokens.

1 more reply

spacecadet1y ago

In most cases thats not enough

sva_1y ago

I try to be optimistic about this. Competition is absolutely needed in this space - $NVDA market cap is insane right now, about $0.6 trillion more than the entire Frankfurt Stock Exchange.

Rinzler891y ago

It's more how little the Frankfurt stock Exchange is worth. And European devs keep wondering why our wages are lower than in the US for the same work. That's why.

CapeTheory1y ago

The DAX is only 40 companies, most of which make real products rather than advertising mechanisms. Making real physical things just doesn't scale, and never will.

While I would enjoy a US tech salary, I'm not sure we want a world where all manufacturing is set aside to focus on the attention economy.

Nvidia value deserves to be much higher than any company on the DAX (maybe all of them together, as it currently is) - but how much of that current value is real rather than an AI speculation bubble?

4 more replies

braiamp1y ago

Wages is a proxy of how valuable your work is, but not a measure of how value your work is. To support a high salary something has to happen, either the product sold is very expensive or it's being subsidized by investors. No company can pay its employees above what they are able to generate selling the product they worked on indefinitely.

dailykoder1y ago

Okay, so you are saying I should move to america, where apparently a lot of people struggle hard to even get a job?

Nah, then ill get my very good wagie pennies here and have plenty jobs available, plus good health insurrance and whatnot.

3 more replies

raverbashing1y ago

Yes

But there's a long list of German companies not on the DAX

(though Germany DAX really deserves to be worth less than NVidia)

1 more reply

Refusing231y ago

stock value doesnt reflect a company's income or ability to pay their workers

littlecranky671y ago

Frankfurt Stock Exchange or the DAX is mostly irrelevant. Germany has a strong, family-owned Mittelstand, those companies are not publicly traded and thus not listed. Plus, we have some giants that are also not publicly listed but belong to the richest Germans (Lidl, Aldi of discount groceries, but also automotive OEM Bosch).

threeseed1y ago

We are in the middle of an LLM bubble.

Nvidia problem will sort itself out naturally in the coming months/years.

chx1y ago

As someone put it in: we are in the 3D glasses phase of AI. Remember when all TVs came with one?

Rinzler891y ago

Same thing was said about Nvidia's crypto bubbles, and then look what happened.

Jensen isn't stupid. He's making accelerators for anything so that they'll be ready to catch the next bubble that depends on crazy compute power that can't be done efficiently on CPUs. They're so far the only semi company beating Moore's law by a large margin due to their clever scaling tech while everyone else is like "hey look our new product is 15% more efficient and 15% more IPC than the one we launched 3 years ago".

They may be overvalued now but they definitely won't crash back to their "just gaming GPUs" days.

3 more replies

sva_1y ago

Oof, I really didn't intend to start a flamewar.

CarRamrod1y ago

Those are rookie numbers

mistymountains1y ago

I’m a AI Scientist and train a lot of models. Personally I think AMD is undervalued relative to Nvidia. No, chips aren’t as fast as Nvidia’s latest and yes, there are some hoops to get things working. But for most workloads in most industries (ignoring for the moment that AI is likely a poor use of capital), it will be much more cost effective and achieve about the same results.

tgtweak1y ago

The market (and selling price) is reflecting the perceived value of nvidia's solution vs AMDs - comprehensively including tooling, software, TCO and managability.

Also curious how many companies are dropping that much money on those kind of accelerators just to run 8x 7B param models in parallel... You're also talking about being able to train a 14B model on a single accelerator. I'd be curious to see how "full-accelerator train and inferrence" workloads would look ie: Training a 14B param model then inferrence throughput on a 4x14B workload.

AMD (and almost every other inferrence claim maker so far... intel and apple specifically) have consistently cherry picked the benchmarks to claim a win over, and ignored the remainder which all show nvidia in the lead - and they've used mid-gen comparison models as many commenters here pointed out in this article.

fvvOP1y ago

mi300x win in some inference workloads, h100 win in training and some others inference workloads ( fp8 inference with tensorRT-llm , rocm is young but is growing fast )

in a single system ( 8x accelerators ) LLMs, mi300x has very competitive inference TCO vs h100 .

also :

AMD Instinct MI300X Offers The Best Price To Performance on GPT-4 According To Microsoft, Red Team On-Track For 100x Perf/Watt By 2027

https://wccftech.com/amd-instinct-mi300x-best-price-performa...

lostmsu1y ago

wccftech is an untrustworthy source.

1 more reply

fvvOP1y ago

the market and the selling price also includes sales strategies, penetrating a sector dominated by a strong player with somewhat "smart" sales strategies *1

and with a growing but certainly less mature product ( expecially software ), it requires suitable pricing and allocation strategies

1. https://www.techspot.com/news/102056-nvidia-allegedly-punish...

fvvOP1y ago

the price of h100 reflects and reflected the fact that there is a total monopoly in the training sector,

amd is successfully attacking the inference sector, increasing its advantage with mi325 and aiming for training from 2025 with mi350 (and Infinity Fabric interconnect and other types of interconnection that are arriving for the various topologies), which will probably have an advantage over blackwell, and then fall back against rubin and come back ahead against mi400,

at least, this is what it seems, and as long as the rocm continues to improve.

Personally I am happy to see some competition in the sector and especially on open source software

paulmd1y ago

This stuff is the actual reason nvidia is under antitrust investigation.

boo boo, a GTX 670 that cost you $399 in 2012 now costs $599 - grow up, do the inflation calculation, and realize you’re being a child. gamers get the best deal on bulk silicon on the planet, R&D subsidized by enterprise, fantastic blue-sky research that takes years for competitors to (not even) match, and it’s still never enough. ”Gamers” have justified every single cliche and stereotype over the last 5 years, absolutely inveterate manbabies.

(Hardware Unboxed put out a video today with the headline+caption combo “are gamers entitled”/“are GeForce gpus gross”, and that’s what passes for reasoned discourse among the most popular channels. They’ve been trading segments back and forth with GN that are just absolute “how bad is nvidia” “real bad, but what do you guys think???” tier shit, lmao.

https://i.imgur.com/98x0F1H.png

this stuff is real shit, nvidia has been leaning on partners to maintain their segmentation, micromanaging shipment release to maintain price levels (cartel behavior), punishing customers and suppliers with “you know what will happen if you cross us”, literally putting it in writing with GPP (big mistake), playing fuck fuck games with not letting the drivers be run in a datacenter, etc. You see how that’s a little different than a gpu going from an inflation-adjusted $570 to $599 over 10 years?

(And what’s worse the competition can’t even keep that much, they’re falling off even harder now that Moores law has really kicked the bucket and they have to do architectural work every gen just to make progress, instead of getting free shrinks etc… let alone having to develop software! /gasp)

In entirely unrelated news… gigabyte suddenly has a 4070 ti super with a blower cooler. Oh, and it’s single-slot with end-fire power connector. All three forbidden features at once - very subtle, extremely law-abiding.

https://videocardz.com/newz/gigabyte-unveils-geforce-rtx-407...

and literally gamers can’t help but think this whole ftc case is all about themselves anyway…

1 more reply

fvvOP1y ago

mi300x production is ramping , in latest earning report lisa su said 1H2024 is production capped , 2h2024 have increased production ( and still have some to sell ), thanks probably to cowos and hbm3/(e?) supply improved

large orders for those accelerators are placed months ahead

meanwhile mi300x on microsoft are fully booked...

https://techcommunity.microsoft.com/t5/azure-high-performanc...

"Scalable AI infrastructure running the capable OpenAI models These VMs, and the software that powers them, were purpose-built for our own Azure AI services production workloads. We have already optimized the most capable natural language model in the world, GPT-4 Turbo, for these VMs. ND MI300X v5 VMs offer leading cost performance for popular OpenAI and open-source models."

michaelnny1y ago

I'm wondering if the tensor parallel settings have any impact on the performance. My naive guess is yes but not sure.

According to the article: """ AMD Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.

NVIDIA Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM. """

renonce1y ago

I personally find such comparisons unfair. A good comparison should optimize for each device configuration, which means use a model within the VRAM limit and quantize to 8 bits where it boosts performance etc and avoid shortcomings of both devices unless necessary.

huntertwo1y ago

AMD has better seemingly better hardware - but not the production capacity to compete with Nvidia yet. Will be interesting to see margins compress when real competition catches up.

Everybody thinks it’s CUDA that makes Nvidia the dominant player. It’s not - almost 40% of their revenue this year comes from mega corporations that use their own custom stack to interact with GPUs. It’s only a matter of time before competition catches up and gives us cheaper GPUs.

almostgotcaught1y ago

> their own custom stack to interact with GPUs

lol completely made up.

are you conflating CUDA the platform with the C/C++ like language that people write into files that end with .cu? because while some people are indeed not writing .cu files, absolutely no one is skipping the rest of the "stack" (nvcc/ptx/sass/runtime/driver/etc).

source: i work at one of these "mega corps". hell if you don't believe me go look at how many CUDA kernels pytorch has https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/n....

> Everybody thinks it’s CUDA that makes Nvidia the dominant player.

it 100% does

pastaguy11y ago

Can you explain the cuda-less stack a little more or provide a source?

almostgotcaught1y ago

some people emit llvm ir (maaaaybe ptx) directly instead of using the C/C++ frontend to CUDA. that's absolutely the only optional part of the stack and also basically the most trivial (i.e., it's not the frontend that's hard but the target codegen).

1 more reply

Refusing231y ago

> but not the production capacity to compete with Nvidia yet.

thats just a question of negotiating with tsmc or their few competitors

(also didn't tsmc start production of some factories in the US and/or EU?)

I mean, nvidia use tsmc, so does amd.

huntertwo1y ago

Yes it is - but Nvidia has larger contracts _right now_. Nvidia has been investing more money in producing more GPUs for longer, so it’s only natural that they have an advantage now.

But now that there’s a larger incentive to produce GPUs, their moat will eventually fall.

TSMC runs at 100% capacity for top tier processes - their bottleneck is more foundries. These take time to build. So the question becomes - how long can Nvidia remain dominant? It could be quarters or it could be years before any real competitor convinces large customers to switch over.

Microsoft and Google are producing their own AI hardware too - nobody wants to depend solely on Nvidia, but they’re currently forced to if they want to keep up.

1 more reply

mark_l_watson1y ago

A good start for AMD. I am also enthusiastic about another non-NVidea inference option: Groq (which I sometimes use).

NVidia relies on TMSC for manufacturing. Samsung is building competing manufacturing infrastructure which is also a good thing, so Taiwan is not a single point of failure.

lccerina1y ago

Without proper statistical metrics (why use average when 95% percentile is widely used?) and performance/watt this is a useless comparison.

DrNosferatu1y ago

And performance/price -> that's the bottom line.

whereismyacc1y ago

average says more about throughput, right?

95% would be nice too

iAkashPaul1y ago

INT8/FP8 benchmarks would've been great, both cards could have loaded them with around 60GB VRAM instead of TP=2 on H100.

latchkey1y ago

We just got higher performance out of open source. No need for MK1.

https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...

rjzzleep1y ago

> Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.

> MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16

> Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.

> H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16

I really wonder about the pricing. In theory the MI300X is supposed to be cheaper, but whether is that is really the case in practice remains to be seen.

huevosabio1y ago

RunPod [0] is pricing MI300X at $4.89/hr vs $3.89-4.69/hr for H100s.

So, probably around the same price?

The tests look promising, though!

[0] https://runpod.io/

latchkey1y ago

We are starting at $4.50/hr [0]. The catch is that we won't have availability until mid August.

The weird thing on Runpod is the virtual CPUs, you can't run MI300x in virtual machines yet. It is a missing feature that AMD is working on.

[0] https://hotaisle.xyz/pricing/

sigmoid101y ago

It doesn't matter. AMD has offered better compute per dollar for a while now, but noone switched because CUDA is the real reason why all serious ML people use Nvidia. Until AMD picks up the slack on their software side, Nvidia will continue to dominate.

oelang1y ago

Microsoft recently announced that they run chatgpt 3.5 & 4 on mi300 on Azure and the price/performance is better.

https://www.amd.com/en/newsroom/press-releases/2024-5-21-amd...

1 more reply

huntertwo1y ago

Large corporate customers like Microsoft and Meta do not use CUDA. They all use custom software. AMD doesn’t have enough GPUs to sell them yet, that’s the real bottleneck.

1 more reply

mistymountains1y ago

Unless you develop in CUDA, you can easily train code (e.g. PyTorch) written for training on Nvidia hardware on AMD hardware. You can even keep the .cuda() calls.

1 more reply

hyperman11y ago

And this shouldn't be to hard if you know the ins and outs of the hardware and have a reasonable dev team. So why aren't they doing it?

1 more reply

chillee1y ago

I'm skeptical of these benchmarks for a number of reasons.

1. They're only comparing against VLLM, which isn't SOTA for latency-focused inference. For example, their vllm benchmark on 2 GPUs sees 102 tokens/s for BS=1, gpt-fast gets around 190 tok/s. https://github.com/pytorch-labs/gpt-fast 2. As others have pointed out, they're comparing H100 running with TP=2 vs. 2 AMD GPUs running independently.

Specifically,

> To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

This is uhh.... very misleading, for a number of reasons. For one, at BS=1, what does running with 2 GPUs even mean? Do they mean that they're getting the results for one AMD GPUs at BS=1 and then... doubling that? Isn't that just... running at BS=2?

3. It's very strange to me that their throughput nearly doubles going from BS=1 to BS=2. MoE models have an interesting property that low amounts of batching doesn't actually significantly improve their throughput, and so on their Nvidia vllm benchmark they just go from 102 => 105 tokens/s throughput when going from BS=1 to BS=2. But on AMD GPUs they go from 142 to 280? That doesn't make any sense to me.

zxexz1y ago

Is this an ad for a new, closed-source, GPGPU backend?

latchkey1y ago

Here is the open source backend...

https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...

BoredPositron1y ago

Pretty much and the test suit is optimized to get the results they wanted.

nottorp1y ago

Pretty sure a useful benchmark for this kind of thing would calculate performance per watt (or per watt and dollar).

That info is conspicuously absent from the article.

1 more reply

zhyder1y ago

Shouldn't the right benchmark be performance per watt? It's easy enough to add more chips to do LLM training or inference in parallel.

Maybe the benchmark should be performance per $... though I suspect power consumption will eclipse the cost of purchasing the chips from NVDA or AMD (and costs of chips will vary over time and with discounts). EDIT: was wrong on eclipsing; still am looking for a more durable benchmark (performance per billion transistors?) given it's suspected NVDA's chips are over-priced due to demand outstripping supply for now, and AMD's are under- to get a foothold in this market.

mmoskal1y ago

Not quite. Assume 1kW power consumption (with cooling). At $0.08/kWh (avarage US industrial rate) this is $700 per year. Adjust for more cooling etc and for say 5 years of usage but you still won't be anywhere near the $25k MSRP for H100.

instagraham1y ago

Given that a lot of projects are written or optimised for CUDA, would it require an industry shift if AMD were to become a competitive source of GPUs for AI training?

irusensei1y ago

Every hardware vendor is working to provide something with their own technology. I don't know if it's possible but a lot of very resourceful companies are doing their best to break the CUDA dominance. I really hope it works and hopefully a non proprietary standard emerges.

yobbo1y ago

The model code is comparatively tiny compared to pytorch or CUDA itself. Translating models from CUDA/C could be laborious but not a barrier.

Making AMD work effortlessly with pytorch et al should make the switch transparent.

mistymountains1y ago

These kinds of comments make me think few people have actually tried. My experience has been 1 work day of getting things set up to work the same as before for training and testing (PyTorch).

1 more reply

zombiwoof1y ago

AMD supports PyTorch out of the box , these comments make me feel no one has tried or even working on this stuff

DrNosferatu1y ago

The comparison is between setups with different amounts of GPU RAM and there's no quantification of final performance/price.

Gasp0de1y ago

So? If you get twice the RAM at a comparable price and that leads to twice the performance, what's wrong with comparing that?

DrNosferatu1y ago

Nothing wrong - just for transparency.

Also, the price difference is not quantified.

Additionally, CUDA is a known and tangible software stack - can I try out this "MK1 FLywheel" on my local (AMD) hardware?

1 more reply

nextworddev1y ago

We need more competition in the training space, not inference.

For consumer grade inference, there's already many options available.

KaoruAoiShiho1y ago

Pretty bad benchmarks to the point of being deliberately misleading. They benchmarked vllm which is less than half the speed of the inference leader lmdeploy: https://bentoml.com/blog/benchmarking-llm-inference-backends

They also used Flywheel for AMD while not bothering to turn on Flywheel for Nvidia, which is crazy since Flywheel improves Nvidia performance by 70%. https://mk1.ai/blog/flywheel-launch

In this context the 33% performance lead by AMD looks terrible, and straight up looks slower.

DarkmSparks1y ago

hopper (H100) is the predecessor to the current blackwell architecture.

This is a new AMD vs last generation nvidia benchmark.

triblemaster1y ago

Blackwell won't be here till next year.

DarkmSparks1y ago

GB200 based on blackwell launched in March of this year.

https://www.theregister.com/2024/03/21/nvidia_dgx_gb200_nvk7...

MI300X launched 3 months earlier at the end of December.

H100 launched March 2023,

1 more reply

acchow1y ago

Nvidia expects to ship 420k Blackwell chips this year.

robblbobbl1y ago

1.Investing (wasting) the billions. 2. Receive downvotes on ycombinator lol

jvlake1y ago

Cool story. How supported is OpenCL compared to CUDA again?

amelius1y ago

Are these fabbed at the same process node?

(Otherwise it's apples and oranges)

qeternity1y ago

It's not apples and oranges. These are the top of the line offerings from the respective companies today.

amelius1y ago

Perhaps, but that was not the question. After all, these chips are not made by one company. There's a fab too. Not exactly unimportant.

Vvector1y ago

Nvidia just started shipping the H200 to selected companies two months ago. Too late for this benchmark.

j / k navigate · click thread line to collapse

264 comments

m_a_g1y ago

I suggest taking the report with a grain of salt.

nabla91y ago

The salt is in the plain sight.

The do the standard AMD comparison:

  8x AMD MI300X (192GB, 750W) GPU  
  8x H100 SXM5 (80GB, 700W) GPU

The fair comparison would be against

  8x H100 NVL (188GB, <800W) GPU

Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price.

nabla91y ago

                 MTr
  ------------------
  H100 SXM5   80,000 
  MI300X     153,000
  H100 NVL   160,000

H100 SXM4 has 52% of the transistors MI300X has, half of the RAM and MI300X achieves *ONLY* 33% higher throughput compared to the H100. MI300X was launched 6 months ago, H100 20 months ago.

AMD has work to do.

2 more replies

fleischhauf1y ago

AMDs deep learning libraries are very bad the last time I checked, nobody uses amd in that space for that reason. Nvidia has a quazi monopoly, that's the main reason for the price difference IMHO.

3 more replies

lhl1y ago

1 more reply

ebalit1y ago

But the price should be a factor. Your fair comparison would match a ~60k$ setup to a 20k$ according to prices we can find online.

I don't think it should be ignored, especially when the power consumption is similar.

fvvOP1y ago

fair? h100 NVL are two h100 in a single package.. which probably costs 2xh100 or more,

if so ok it's fair to compare 1 mi300x with 1 h100 NVL but then price ( and tco ) should be added to the some metrics conclusion , also the NVL is a 2xpci5.0 quad slot , so not the same thing..

I am not sure about system compatibility and if and how you can stack 8 of those in one system ( like you can do with non NVL and mi300x.. ) so it's a bit a diffent ( and more niche ) beast

sangnoir1y ago

> Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price

What were your thoughts on Zen (1) vs Intel's offerings then? AMD offered more back for the buck then too.

winux-arch1y ago

Price tells the story. Yes but for electric prices not card prize and here their much more close to each other!

resource_waste1y ago

Thx! Anyone who says Nivida isnt king, needs a reality check.

1 more reply

epolanski1y ago

Well, there's the beauty of specifying exactly how you ran your benchmark, it is easy to reproduce and disprove or confirm (assuming you got the hardware).

scotty791y ago

As easy as getting yourself 8 H100 and 8 MI300X.

Fun weekend project for anybody.

1 more reply

impulser_1y ago

If they used Nvidia's chip would this somehow make the blog post better?

aurareturn1y ago

For one, they didn't use TensorRT in the test.

Also, stuff like this is hard to take the results seriously:

  * To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

  * All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.

They did everything they can to make sure AMD is faster.

4 more replies

qeternity1y ago

Why the hell are we doing 128 input token benchmarks in 2024. This is not representative of most workloads, and prefill perf is incredibly important.

ta126534211y ago

For understanding:

What would be a suitable input length in your oppinion?

And why isnt this a good one: Are real-life queries shorter? Or longer?

If i count one word as a token, then in my case most of the queries are less than 128 words.

qeternity1y ago

I think today 512 tokens is a minimum.

It's not just the query (if you're running a chatbot, which many of us are not). It's the entire context window. It's not uncommon to have a system prompt that is > 512 tokens alone.

I would like to see benchmarks for 512, 1024, 4096 and 8192 token inputs.

Gasp0de1y ago

stefs1y ago

It's not just the current prompt, but the whole conversation, if possible. Or, if you want the AI to summarise an article, the article has to fit in.

If I understood that correctly, context length is something like session storage or short term memory. If it's too small the AI starts to forget what it's talking about.

rfoo1y ago

IMO the relevant benchmark for now is a mixed stream of requests with 50 (20%), 500 (50%), 2000 (10%) and 50k (20%) input tokens, ignore EOS and decode until you get around 300 output tokens.

1 more reply

spacecadet1y ago

In most cases thats not enough

sva_1y ago

I try to be optimistic about this. Competition is absolutely needed in this space - $NVDA market cap is insane right now, about $0.6 trillion more than the entire Frankfurt Stock Exchange.

Rinzler891y ago

It's more how little the Frankfurt stock Exchange is worth. And European devs keep wondering why our wages are lower than in the US for the same work. That's why.

CapeTheory1y ago

The DAX is only 40 companies, most of which make real products rather than advertising mechanisms. Making real physical things just doesn't scale, and never will.

While I would enjoy a US tech salary, I'm not sure we want a world where all manufacturing is set aside to focus on the attention economy.

Nvidia value deserves to be much higher than any company on the DAX (maybe all of them together, as it currently is) - but how much of that current value is real rather than an AI speculation bubble?

4 more replies

braiamp1y ago

dailykoder1y ago

Okay, so you are saying I should move to america, where apparently a lot of people struggle hard to even get a job?

Nah, then ill get my very good wagie pennies here and have plenty jobs available, plus good health insurrance and whatnot.

3 more replies

raverbashing1y ago

Yes

But there's a long list of German companies not on the DAX

(though Germany DAX really deserves to be worth less than NVidia)

1 more reply

Refusing231y ago

stock value doesnt reflect a company's income or ability to pay their workers

littlecranky671y ago

threeseed1y ago

We are in the middle of an LLM bubble.

Nvidia problem will sort itself out naturally in the coming months/years.

chx1y ago

As someone put it in: we are in the 3D glasses phase of AI. Remember when all TVs came with one?

Rinzler891y ago

Same thing was said about Nvidia's crypto bubbles, and then look what happened.

They may be overvalued now but they definitely won't crash back to their "just gaming GPUs" days.

3 more replies

sva_1y ago

Oof, I really didn't intend to start a flamewar.

CarRamrod1y ago

Those are rookie numbers

mistymountains1y ago

tgtweak1y ago

The market (and selling price) is reflecting the perceived value of nvidia's solution vs AMDs - comprehensively including tooling, software, TCO and managability.

fvvOP1y ago

mi300x win in some inference workloads, h100 win in training and some others inference workloads ( fp8 inference with tensorRT-llm , rocm is young but is growing fast )

in a single system ( 8x accelerators ) LLMs, mi300x has very competitive inference TCO vs h100 .

also :

AMD Instinct MI300X Offers The Best Price To Performance on GPT-4 According To Microsoft, Red Team On-Track For 100x Perf/Watt By 2027

https://wccftech.com/amd-instinct-mi300x-best-price-performa...

lostmsu1y ago

wccftech is an untrustworthy source.

1 more reply

fvvOP1y ago

the market and the selling price also includes sales strategies, penetrating a sector dominated by a strong player with somewhat "smart" sales strategies *1

and with a growing but certainly less mature product ( expecially software ), it requires suitable pricing and allocation strategies

1. https://www.techspot.com/news/102056-nvidia-allegedly-punish...

fvvOP1y ago

the price of h100 reflects and reflected the fact that there is a total monopoly in the training sector,

at least, this is what it seems, and as long as the rocm continues to improve.

Personally I am happy to see some competition in the sector and especially on open source software

paulmd1y ago

This stuff is the actual reason nvidia is under antitrust investigation.

https://i.imgur.com/98x0F1H.png

https://videocardz.com/newz/gigabyte-unveils-geforce-rtx-407...

and literally gamers can’t help but think this whole ftc case is all about themselves anyway…

1 more reply

fvvOP1y ago

large orders for those accelerators are placed months ahead

meanwhile mi300x on microsoft are fully booked...

https://techcommunity.microsoft.com/t5/azure-high-performanc...

michaelnny1y ago

I'm wondering if the tensor parallel settings have any impact on the performance. My naive guess is yes but not sure.

According to the article: """ AMD Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.

NVIDIA Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM. """

renonce1y ago

huntertwo1y ago

AMD has better seemingly better hardware - but not the production capacity to compete with Nvidia yet. Will be interesting to see margins compress when real competition catches up.

almostgotcaught1y ago

> their own custom stack to interact with GPUs

lol completely made up.

source: i work at one of these "mega corps". hell if you don't believe me go look at how many CUDA kernels pytorch has https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/n....

> Everybody thinks it’s CUDA that makes Nvidia the dominant player.

it 100% does

pastaguy11y ago

Can you explain the cuda-less stack a little more or provide a source?

almostgotcaught1y ago

1 more reply

Refusing231y ago

> but not the production capacity to compete with Nvidia yet.

thats just a question of negotiating with tsmc or their few competitors

(also didn't tsmc start production of some factories in the US and/or EU?)

I mean, nvidia use tsmc, so does amd.

huntertwo1y ago

Yes it is - but Nvidia has larger contracts _right now_. Nvidia has been investing more money in producing more GPUs for longer, so it’s only natural that they have an advantage now.

But now that there’s a larger incentive to produce GPUs, their moat will eventually fall.

Microsoft and Google are producing their own AI hardware too - nobody wants to depend solely on Nvidia, but they’re currently forced to if they want to keep up.

1 more reply

mark_l_watson1y ago

A good start for AMD. I am also enthusiastic about another non-NVidea inference option: Groq (which I sometimes use).

NVidia relies on TMSC for manufacturing. Samsung is building competing manufacturing infrastructure which is also a good thing, so Taiwan is not a single point of failure.

lccerina1y ago

Without proper statistical metrics (why use average when 95% percentile is widely used?) and performance/watt this is a useless comparison.

DrNosferatu1y ago

And performance/price -> that's the bottom line.

whereismyacc1y ago

average says more about throughput, right?

95% would be nice too

iAkashPaul1y ago

INT8/FP8 benchmarks would've been great, both cards could have loaded them with around 60GB VRAM instead of TP=2 on H100.

latchkey1y ago

We just got higher performance out of open source. No need for MK1.

https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...

rjzzleep1y ago

> Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.

> MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16

> Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.

> H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16

I really wonder about the pricing. In theory the MI300X is supposed to be cheaper, but whether is that is really the case in practice remains to be seen.

huevosabio1y ago

RunPod [0] is pricing MI300X at $4.89/hr vs $3.89-4.69/hr for H100s.

So, probably around the same price?

The tests look promising, though!

[0] https://runpod.io/

latchkey1y ago

We are starting at $4.50/hr [0]. The catch is that we won't have availability until mid August.

The weird thing on Runpod is the virtual CPUs, you can't run MI300x in virtual machines yet. It is a missing feature that AMD is working on.

[0] https://hotaisle.xyz/pricing/

sigmoid101y ago

oelang1y ago

Microsoft recently announced that they run chatgpt 3.5 & 4 on mi300 on Azure and the price/performance is better.

https://www.amd.com/en/newsroom/press-releases/2024-5-21-amd...

1 more reply

huntertwo1y ago

Large corporate customers like Microsoft and Meta do not use CUDA. They all use custom software. AMD doesn’t have enough GPUs to sell them yet, that’s the real bottleneck.

1 more reply

mistymountains1y ago

Unless you develop in CUDA, you can easily train code (e.g. PyTorch) written for training on Nvidia hardware on AMD hardware. You can even keep the .cuda() calls.

1 more reply

hyperman11y ago

And this shouldn't be to hard if you know the ins and outs of the hardware and have a reasonable dev team. So why aren't they doing it?

1 more reply

chillee1y ago

I'm skeptical of these benchmarks for a number of reasons.

Specifically,

> To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

zxexz1y ago

Is this an ad for a new, closed-source, GPGPU backend?

latchkey1y ago

Here is the open source backend...

https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...

BoredPositron1y ago

Pretty much and the test suit is optimized to get the results they wanted.

nottorp1y ago

Pretty sure a useful benchmark for this kind of thing would calculate performance per watt (or per watt and dollar).

That info is conspicuously absent from the article.

1 more reply

zhyder1y ago

Shouldn't the right benchmark be performance per watt? It's easy enough to add more chips to do LLM training or inference in parallel.

mmoskal1y ago

instagraham1y ago

Given that a lot of projects are written or optimised for CUDA, would it require an industry shift if AMD were to become a competitive source of GPUs for AI training?

irusensei1y ago

yobbo1y ago

The model code is comparatively tiny compared to pytorch or CUDA itself. Translating models from CUDA/C could be laborious but not a barrier.

Making AMD work effortlessly with pytorch et al should make the switch transparent.

mistymountains1y ago

These kinds of comments make me think few people have actually tried. My experience has been 1 work day of getting things set up to work the same as before for training and testing (PyTorch).

1 more reply

zombiwoof1y ago

AMD supports PyTorch out of the box , these comments make me feel no one has tried or even working on this stuff

DrNosferatu1y ago

The comparison is between setups with different amounts of GPU RAM and there's no quantification of final performance/price.

Gasp0de1y ago

So? If you get twice the RAM at a comparable price and that leads to twice the performance, what's wrong with comparing that?

DrNosferatu1y ago

Nothing wrong - just for transparency.

Also, the price difference is not quantified.

Additionally, CUDA is a known and tangible software stack - can I try out this "MK1 FLywheel" on my local (AMD) hardware?

1 more reply

nextworddev1y ago

We need more competition in the training space, not inference.

For consumer grade inference, there's already many options available.

KaoruAoiShiho1y ago

They also used Flywheel for AMD while not bothering to turn on Flywheel for Nvidia, which is crazy since Flywheel improves Nvidia performance by 70%. https://mk1.ai/blog/flywheel-launch

In this context the 33% performance lead by AMD looks terrible, and straight up looks slower.

DarkmSparks1y ago

hopper (H100) is the predecessor to the current blackwell architecture.

This is a new AMD vs last generation nvidia benchmark.

triblemaster1y ago

Blackwell won't be here till next year.

DarkmSparks1y ago

GB200 based on blackwell launched in March of this year.

https://www.theregister.com/2024/03/21/nvidia_dgx_gb200_nvk7...

MI300X launched 3 months earlier at the end of December.

H100 launched March 2023,

1 more reply