Intel Gaudi2 chips outperform Nvidia H100 on diffusion transformers (opens in new tab)

(stability.ai)

146 pointsmemossy2y ago60 comments

60 comments

Interesting! this was already the case with TPUs easily beating A100s. We sell Stable Diffusion finetuning on TPUs (dreamlook.ai), people are amazed how fast and cheap we can offer it - but there's no big secret, we just use hardware that's strictly faster and cheaper per unit of work.

I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!

memossyOP2y ago

The v5es and v5ps are pretty amazing at running SD, giving code for SD3 now to optimise it on those.

v5es are particularly interesting given the millions that will land and the large pod sizes, particularly well constructed for million token context windows.

doctorpangloss2y ago

But you and I can't buy a TPU. You and I can buy an H100.

isoprophlex2y ago

Speak for yourself! I can't even afford 1/10th of an H100.

elorant2y ago

He's probably speaking about availability, not affordability.

1 more reply

renewiltord2y ago

Which TPUs do you use? Cloud-hosted or your own hardware? Interesting insight.

1024core2y ago

"TPUs" are a Google-only product, available* only on GCP.

* Notwithstanding the Choral boards

MasterScrat2y ago

We started on V3s, now fully moved to V4s with some V5Es, investigating a full move towards V5E & V5P

memossyOP2y ago

We use v4s, v5es & v5ps. Mostly v5ps, very stable int8 training (versus the horror that is fp8 stability)

Flux1592y ago

This is nice to foster some competition in hardware for model training, but the availability of these machines seems very limited - I don't think there's any major cloud provider allowing per hour rental of Gaudi2 VMs and Intel's own site directs you to buy an 8x GPU provisioned server from Supermicro for more than 40k USD. Availability and software stack is still heavily in Nvidia's favor right now, but maybe by the end of the year that will start changing.

memossyOP2y ago

Think it'll probably crack on with Gaudi3 at 4x performance, twice VRAM etc later this year.

We found cuda sycl conversion surprisingly good https://www.intel.com/content/www/us/en/developer/articles/t...

machdiamonds2y ago

It's hard to guess these cards' real performance uplifts. According to Nvidia, H100 is 11x faster than A100, but that's definitely not true in most cases. If Gaudi3 is legitimately 4x faster than Gaudi2, it should be a very good value proposition compared to even the B100. I'm really curious whether Intel will be able to compete with X100 using Falcon Shores or not. Regardless, I don't think Nvidia's margins are sustainable.

GaggiX2y ago

>Intel's own site directs you to buy an 8x GPU provisioned server from Supermicro for more than 40k USD

Isn't that the price of a single H100?

sabujp2y ago

what you're telling me is that SMCI doesn't care if NVDA or INTC or even AMD outperforms on their hardware, they'll make axes and shovels for anyone and profit either way :)

GC_tris2y ago

Disclaimer: I technically still am employed at Genesis Cloud (though no longer actively involved).

Genesis Cloud started integration and testing of Gaudi2 quite a while ago. I fully agree with the take of the article.

I can't promise per hour rental, but for longer times they are available! (should you be interested you can find contact details on the website)

az2262y ago

Would you rent out a node for few days for benchmark testing?

az2262y ago

Link for the price?

1024core2y ago

NVIDIA's profit margin is almost 92% on an H100. I'm surprised more chip companies haven't jumped on a "ML accelerator" bandwagon by now.

wmf2y ago

There's a dozen AI chips already; how many do you want?

Now working ones is a different story.

adtac2y ago

For anyone deeply familiar with building one: what are the biggest problems you run into 3 months in that you didn't foresee?

Just curious because IME that's the point where the fun problems surface :)

jononor2y ago

3 months in? I have not worked in chip design. But as an electronics engineer doing much less complex hardware (IoT sensors), I would say that is probably not enough time to hit any of the unexpected problems. Bringing an new ML accelerator architecture/family to market is likely 36 months at best.

ekelsen2y ago

Some analysis of how and/or why it is able to be 3x faster despite no hardware metric being 3x better would make this actually useful and insightful instead of advertising.

1 more reply

jsheard2y ago

Hasn't H100 been shipping in volume for about a year already? Is Gaudi2 even available at comparable scale yet? I wouldn't count Nvidia out until they start slipping on similar timescales, i.e. if B100 doesn't have a clear lead over competing parts that become available at roughly the same time.

memossyOP2y ago

I think as we go to enterprise workloads the total cost of ownership becomes important.

NVIDIA is still the best for research given ecosystem but once the models are standardised as with transformers/LLaMA and likely multimodal diffusion transformers it then becomes about scale, availability and cost per flop.

ABS2y ago

H100 was released almost exactly 1 year ago so I guess it's ok if Intel is now ready to compete with last year's model.

To those commenting about "no moat" remember CUDA is a huge part of it, it's actually HW+SW and both took a decade to mature, together

memossyOP2y ago

It took less than a day to port our code over, we do custom CUDA across modalities.

Gaudi2 was actually announced 2 years ago and is 7nm like the A100 80Gb it was meant to be competitive with, Gaudi3 later this year is probably going to be the inflection point as that ramps

The cost is like 1/3

https://www.intel.com/content/www/us/en/newsroom/news/vision...

ABS2y ago

"Announced" 2 years ago is different from its availability and ability:

- Intel acquired Habana in 2019

- Habana launched Gaudi2 in 2022

- only in H2 2023 Habana enabled FP8 which delivered around 100% improvement in time-to-train

On the rest I believe you but markets don't move based on single individual's/company's data points

memossyOP2y ago

Gaudi2s started coming out in 2022 (https://huggingface.co/blog/habana-gaudi-2-benchmark) but didn't hit mass scale. I think Gaudi3 will & others have seen similar performance for Gaudi2 eg Databricks: https://www.databricks.com/blog/llm-training-and-inference-i..., mlperf etc

We are about to drop stable diffusion 3 which is the best image model out there (https://x.com/EMostaque/status/1764941367682256950?s=20) with similar architecture to Sora by OpenAI that can be used for any modality.

We have hundreds of millions of downloads of our models so are looking for big scale as we move to every pixel being generated & this stuff goes from research to mass deployment.

nabla92y ago

  2024:  Nvidia's B100 TSMC 3nm (?)
  2024:  Intel Gaudi3  TSMC 5nm (*)
  2023:  AMD MI300X    TSMC 5nm/6nm 
  2022:  Nvidia H100   TSMC 4N
  2020   Nvidia A100   TSMC 7nm

(*): performance critical chiplets at least.

memossyOP2y ago

Falcon shores next year will be crazy with 300gb VRAM & new lith

imtringued2y ago

The fact that AMD's GPGPU platform is buggy for consumers has more to do with incompetence and product cannibalisation than the difficulty of building properly working drivers. Machine learning uses profoundly simple operations. Building a pytorch backend isn't difficult if the drivers are working properly.

yukIttEft2y ago

I'm wondering how AI scientists work these days. Do they really hack Cudakernels or do they plug models together with highlevel toolkits like pytorch?

Considering its the latter, considering pytorch takes care of providing optimized backends for various hardwares, how big of a moat is Cuda then really?

david-gpu2y ago

Pytorch relies heavily on the extensive libraries of high-performance kernels provided by NVidia, such as cuDNN.

In other words, it goes something like this:

    Application
    Pytorch (and similar)
    cuDNN (and similar)
    CUDA (and similar)
    NVidia GPU

My opinion, based on what I saw those wizards do, is that reproducing the feature set and efficiency of cuDNN/cuBLAS is deeply nontrivial.

cherryteastain2y ago

One question I have that nobody, including an Intel AXG employee, has been able to answer satisfsctorily for me is why both Gaudi and Ponte Vecchio exist. Wouldn't Intel have better chances of success if they focused on one product line?

georgeburdell2y ago

Gaudi was brought into Intel via an acquisition. Ponte Vecchio was an internal program. It can be explained by a combination of management silos and perhaps pre-existing obligations for Ponte Vecchio with the government for how they both came into being

cherryteastain2y ago

Sad if it's a management/politics happenstance than a genius master plan.

meragrin_2y ago

From my understanding, Gaudi specializes in a specific use case (deep learning/AI) while Ponte Vecchio is more generic HPC. Also, DL/AI accelerators may not work well with newer models so the generic HPC hardware may be the only option for certain models until the DL/AI accelerators have a chance to catch up.

wmf2y ago

It's good risk reduction, especially since Ponte Vecchio failed.

thunderbird1202y ago

Gaudi3 is supposedly due this year with a 4X bump in Bf16 training over Gaudi2. Gaudi is an interesting product. Intel seems to have something pretty decent but it hasn't seen much of a volume release yet. Maybe that comes with V3? Not sure exactly what their strategy with it is.

We do know that in 2025 it's supposed to be part of Intel's Falcon Shores HPC XPU. This essentially takes a whole bunch of HPC compute and sticks it all on the same silicon to maximize throughput and minimize latency. Thanks to their tile-based chip strategy they can have many different versions of the chip with different HPC focuses by swapping out different tiles. AI certainly seems to be a major one, but it will be interesting to see what products they come up with.

memossyOP2y ago

It was interesting Aurora used GPU Max & definitely looking forward to Falcon Shores.

I think Gaudi2 was bad timed & they had to build stack, Gaudi3 is where I think we will see mass adoption given availability, way cheaper price/performance & maturer stack.

There is still weird stuff when using them but they are surprisingly solid.

tromp2y ago

I found this Intel website [1] more informative regarding architecture and capabilities of Gaudi2:

[1] https://www.intel.com/content/www/us/en/developer/articles/t...

BryanLegend2y ago

This message was brought to you by Intel

memossyOP2y ago

I mean they work well, here is another blog by Databricks: https://www.databricks.com/blog/llm-training-and-inference-i...

lostmsu2y ago

I would potentially be interested in Gaudi-based workstation. Supermicro servers seem good, but they do not have DisplayPort outputs, and jury-rigging them on is not something I'd do.

mittermayr2y ago

Frankly, this may be good to level out the market a bit. While it's been fun to see Nvidia rise up through this insanity, it would only be healthy to have others catch up here and there eventually.

qeternity2y ago

Has anyone been running LLMs on TPUs in prod? Curious to hear experiences.

emadm2y ago

Yeah they train well and very stably even int8, maxtext now has LLaMA and mistral support too, pytorch xla gets 50% MFU with spmd and you have some nice stacks like levanter

Haven't been too impressed with inference versus tensor rt llm for example though

mistrial92y ago

https://es.wikipedia.org/wiki/Antoni_Gaud%C3%AD

Gaudi is a famous name for a reason.. the flowing lines and frankly, nonsense and silliness, in the art and architecture of Gaudi stands for generations as a contrast to the relentless severity of formal classical arts (and especially a contrast to Intel electronic parts).

CrocODil2y ago

Does the performance picture change with Int8?

throwaway4good2y ago

Who fabs the Gaudi2? TSMC or Intel themselves?

throwaway4good2y ago

TSMC 7nm it seems.

memossyOP2y ago

"For Stable Diffusion 3, we measured the training throughput for the 2B Multimodal Diffusion Transformer (MMDiT) architecture model. Gaudi 2 trained images 1.5x faster than the H100-80GB, and 3x faster than A100-80GB GPU’s when scaled up to 32 nodes. "

Mistletoe2y ago

I can feel the NVDA stock slipping as we speak…

It has been amazing watching the groupthink at work on that stock when we just saw the same group do it on TSLA to disastrous effect. A similar no moat situation where they simply can’t imagine competitors ever existing.

varispeed2y ago

Typically stocks fall once I buy them and go up after I sell them. I am not planning on buying NVDA for now, so likely it will keep going up.

* just to be clear - this is a joke

bluedino2y ago

My father in law was telling me the other day that he was going buy some NVIDIA stock because it is going to go up to 1,400

gitfan862y ago

The Model Y is the best selling car in the world in 2023. Those of us who were buying in 2019 are still up quite a bit even though the stock was higher at one point. RIVN, Ford, GM all are losing a lot of money on every EV they sell. We were right to bet on TSLA being a major winner.

I actually put 40% of my TSLA into NVDA last year, because the demand for AI hardware is going to keep going up. I'm not saying the stock will never go down, I'm sure it will be volatile, but don't confuse short term volatility with long term technologic transformations.

ffgjgf12y ago

> The Model Y is the best selling

other manufacturers ship dozens of different models and then you have companies like VW or Stellantis that havd many different brands that basically sell the same model with slightly different chassis, styling etc. so it’s hardly comparable.

An anyway, as far as valuations go margins are as or even way more important than total numbers of cars shipped. Tesla had to cut prices and that didn’t work out that great for their stock price

> volatility with long term technologic transformations.

Intel’s stock peaked in 2000 despite most of the related technologic transformations happening in the subsequent decades, them basically becoming a monopoly and their revenue increasing multiple times.

memossyOP2y ago

It's a great company & will do well, plenty of demand & B100s/BH200s etc coming

The Hopper stuff is particulalry interesting

ffgjgf12y ago

They of course will likely do great, that doesn’t mean their stock price can’t be massively inflated. current valuation is pricing in both massive growth and insane (understatement) margins. Which basically means that they are expected to have no actual competition for years. That’s bot impossible but surely Intel/AMD can’t be this incompetent when there are piles of money just there for the taking.

j / k navigate · click thread line to collapse

60 comments

MasterScrat2y ago

I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!

memossyOP2y ago

The v5es and v5ps are pretty amazing at running SD, giving code for SD3 now to optimise it on those.

v5es are particularly interesting given the millions that will land and the large pod sizes, particularly well constructed for million token context windows.

doctorpangloss2y ago

But you and I can't buy a TPU. You and I can buy an H100.

isoprophlex2y ago

Speak for yourself! I can't even afford 1/10th of an H100.

elorant2y ago

He's probably speaking about availability, not affordability.

1 more reply

renewiltord2y ago

Which TPUs do you use? Cloud-hosted or your own hardware? Interesting insight.

1024core2y ago

"TPUs" are a Google-only product, available* only on GCP.

* Notwithstanding the Choral boards

MasterScrat2y ago

We started on V3s, now fully moved to V4s with some V5Es, investigating a full move towards V5E & V5P

memossyOP2y ago

We use v4s, v5es & v5ps. Mostly v5ps, very stable int8 training (versus the horror that is fp8 stability)

Flux1592y ago

memossyOP2y ago

Think it'll probably crack on with Gaudi3 at 4x performance, twice VRAM etc later this year.

We found cuda sycl conversion surprisingly good https://www.intel.com/content/www/us/en/developer/articles/t...

machdiamonds2y ago

GaggiX2y ago

>Intel's own site directs you to buy an 8x GPU provisioned server from Supermicro for more than 40k USD

Isn't that the price of a single H100?

sabujp2y ago

what you're telling me is that SMCI doesn't care if NVDA or INTC or even AMD outperforms on their hardware, they'll make axes and shovels for anyone and profit either way :)

GC_tris2y ago

Disclaimer: I technically still am employed at Genesis Cloud (though no longer actively involved).

Genesis Cloud started integration and testing of Gaudi2 quite a while ago. I fully agree with the take of the article.

I can't promise per hour rental, but for longer times they are available! (should you be interested you can find contact details on the website)

az2262y ago

Would you rent out a node for few days for benchmark testing?

az2262y ago

Link for the price?

1024core2y ago

NVIDIA's profit margin is almost 92% on an H100. I'm surprised more chip companies haven't jumped on a "ML accelerator" bandwagon by now.

wmf2y ago

There's a dozen AI chips already; how many do you want?

Now working ones is a different story.

adtac2y ago

For anyone deeply familiar with building one: what are the biggest problems you run into 3 months in that you didn't foresee?

Just curious because IME that's the point where the fun problems surface :)

jononor2y ago

ekelsen2y ago

Some analysis of how and/or why it is able to be 3x faster despite no hardware metric being 3x better would make this actually useful and insightful instead of advertising.

1 more reply

jsheard2y ago

memossyOP2y ago

I think as we go to enterprise workloads the total cost of ownership becomes important.

ABS2y ago

H100 was released almost exactly 1 year ago so I guess it's ok if Intel is now ready to compete with last year's model.

To those commenting about "no moat" remember CUDA is a huge part of it, it's actually HW+SW and both took a decade to mature, together

memossyOP2y ago

It took less than a day to port our code over, we do custom CUDA across modalities.

Gaudi2 was actually announced 2 years ago and is 7nm like the A100 80Gb it was meant to be competitive with, Gaudi3 later this year is probably going to be the inflection point as that ramps

The cost is like 1/3

https://www.intel.com/content/www/us/en/newsroom/news/vision...

ABS2y ago

"Announced" 2 years ago is different from its availability and ability:

- Intel acquired Habana in 2019

- Habana launched Gaudi2 in 2022

- only in H2 2023 Habana enabled FP8 which delivered around 100% improvement in time-to-train

On the rest I believe you but markets don't move based on single individual's/company's data points

memossyOP2y ago

We have hundreds of millions of downloads of our models so are looking for big scale as we move to every pixel being generated & this stuff goes from research to mass deployment.

nabla92y ago

  2024:  Nvidia's B100 TSMC 3nm (?)
  2024:  Intel Gaudi3  TSMC 5nm (*)
  2023:  AMD MI300X    TSMC 5nm/6nm 
  2022:  Nvidia H100   TSMC 4N
  2020   Nvidia A100   TSMC 7nm

(*): performance critical chiplets at least.

memossyOP2y ago

Falcon shores next year will be crazy with 300gb VRAM & new lith

imtringued2y ago

yukIttEft2y ago

I'm wondering how AI scientists work these days. Do they really hack Cudakernels or do they plug models together with highlevel toolkits like pytorch?

Considering its the latter, considering pytorch takes care of providing optimized backends for various hardwares, how big of a moat is Cuda then really?

david-gpu2y ago

Pytorch relies heavily on the extensive libraries of high-performance kernels provided by NVidia, such as cuDNN.

In other words, it goes something like this:

    Application
    Pytorch (and similar)
    cuDNN (and similar)
    CUDA (and similar)
    NVidia GPU

My opinion, based on what I saw those wizards do, is that reproducing the feature set and efficiency of cuDNN/cuBLAS is deeply nontrivial.

cherryteastain2y ago

georgeburdell2y ago

cherryteastain2y ago

Sad if it's a management/politics happenstance than a genius master plan.

meragrin_2y ago

wmf2y ago

It's good risk reduction, especially since Ponte Vecchio failed.

thunderbird1202y ago

memossyOP2y ago

It was interesting Aurora used GPU Max & definitely looking forward to Falcon Shores.

I think Gaudi2 was bad timed & they had to build stack, Gaudi3 is where I think we will see mass adoption given availability, way cheaper price/performance & maturer stack.

There is still weird stuff when using them but they are surprisingly solid.

tromp2y ago

I found this Intel website [1] more informative regarding architecture and capabilities of Gaudi2:

[1] https://www.intel.com/content/www/us/en/developer/articles/t...

BryanLegend2y ago

This message was brought to you by Intel

memossyOP2y ago

I mean they work well, here is another blog by Databricks: https://www.databricks.com/blog/llm-training-and-inference-i...

lostmsu2y ago

I would potentially be interested in Gaudi-based workstation. Supermicro servers seem good, but they do not have DisplayPort outputs, and jury-rigging them on is not something I'd do.

mittermayr2y ago

Frankly, this may be good to level out the market a bit. While it's been fun to see Nvidia rise up through this insanity, it would only be healthy to have others catch up here and there eventually.

qeternity2y ago

Has anyone been running LLMs on TPUs in prod? Curious to hear experiences.

emadm2y ago

Yeah they train well and very stably even int8, maxtext now has LLaMA and mistral support too, pytorch xla gets 50% MFU with spmd and you have some nice stacks like levanter

Haven't been too impressed with inference versus tensor rt llm for example though

mistrial92y ago

https://es.wikipedia.org/wiki/Antoni_Gaud%C3%AD

CrocODil2y ago

Does the performance picture change with Int8?

throwaway4good2y ago

Who fabs the Gaudi2? TSMC or Intel themselves?

throwaway4good2y ago

TSMC 7nm it seems.

memossyOP2y ago

Mistletoe2y ago

I can feel the NVDA stock slipping as we speak…

varispeed2y ago

Typically stocks fall once I buy them and go up after I sell them. I am not planning on buying NVDA for now, so likely it will keep going up.

* just to be clear - this is a joke

bluedino2y ago

My father in law was telling me the other day that he was going buy some NVIDIA stock because it is going to go up to 1,400

gitfan862y ago

ffgjgf12y ago

> The Model Y is the best selling

An anyway, as far as valuations go margins are as or even way more important than total numbers of cars shipped. Tesla had to cut prices and that didn’t work out that great for their stock price

> volatility with long term technologic transformations.

memossyOP2y ago

It's a great company & will do well, plenty of demand & B100s/BH200s etc coming

The Hopper stuff is particulalry interesting

ffgjgf12y ago

j / k navigate · click thread line to collapse