undefined | Better HN

0 pointstucnak1y ago0 comments

Which would be impressive had it _actually_ worked for ML workloads.

0 comments

huac1y ago

There's a number of scaled AMD deployments, including Lamini (https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu...) specifically for LLM's. There's also a number of HPC configurations, including the world's largest publicly disclosed supercomputer (Frontier) and Europe's largest supercomputer (LUMI) running on MI250x. Multiple teams have trained models on those HPC setups too.

Do you have any more evidence as to why these categorically don't work?

latchkey1y ago

> Do you have any more evidence as to why these categorically don't work?

They don't. Loud voices parroting George, with nothing to back it up.

Here are another couple good links:

https://www.evp.cloud/post/diving-deeper-insights-from-our-l...

https://www.databricks.com/blog/training-llms-scale-amd-mi25...

Hugsun1y ago

Does it not work for them? Where can I learn why?

tucnakOP1y ago

Just go have a look around Github issues in their ROCm repositories on Github. A few months back the top excuse re: AMD was that we're not supposed to use their "consumer" cards, however the datacenter stuff is kosher. Well, guess what, we have purchased their datacenter card, MI50, and it's similarly screwed. Too many bugs in the kernel, kernel crashes, hangs, and the ROCm code is buggy / incomplete. When it works, it works for a short period of time, and yes HBM memory is kind of nice, but the whole thing is not worth it. Some say MI210 and MI300 are better, but it's just wishful thinking as all the bugs are in the software, kernel driver, and firmware. I have spent too many hours troubleshooting entry-level datacenter-grade Instinct cards with no recourse from AMD whatsoever to pay 10+ thousands for MI210 a couple-year old underpowered hardware, and MI300 is just unavailable.

Not even from cloud providers which should be telling enough.

JonChesterfield1y ago

We absolutely hammered the MI50 in internal testing for ages. Was solid as far as I can tell.

Rocm is sensitive to matching kernel version to driver version to userspace version. Staying very much on the kernel version from a official release and using the corresponding driver is drastically more robust than optimistically mixing different components. In particular, rocm is released and tested as one large blob, and running that large blob on a slightly different kernel version can go very badly. Mixing things from GitHub with things from your package manager is also optimistic.

Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.

1 more reply

Workaccount21y ago

It's seriously impressive how well AMD has been able to maintain their incredible software deficiency for over a decade now.

2 more replies

jmward011y ago

Yeah, this has stopped me from trying anything with them. They need to lead with their consumer cards so that developers can test/build/evaluate/gain trust locally and then their enterprise offerings need to 100% guarantee that the stuff developers worked on will work in the data center. I keep hoping to see this but every time I look it isn't there. There is way more support for apple silicon out there than ROCm and that has no path to enterprise. AMD is missing the boat.

2 more replies

cavisne1y ago

Yeah, I think AMD will really struggle with the cloud providers.

Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).

AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.

They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?

j / k navigate · click thread line to collapse

0 comments

huac1y ago

Do you have any more evidence as to why these categorically don't work?

latchkey1y ago

> Do you have any more evidence as to why these categorically don't work?

They don't. Loud voices parroting George, with nothing to back it up.

Here are another couple good links:

https://www.evp.cloud/post/diving-deeper-insights-from-our-l...

https://www.databricks.com/blog/training-llms-scale-amd-mi25...

Hugsun1y ago

Does it not work for them? Where can I learn why?

tucnakOP1y ago

Not even from cloud providers which should be telling enough.

JonChesterfield1y ago

We absolutely hammered the MI50 in internal testing for ages. Was solid as far as I can tell.

Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.

1 more reply

Workaccount21y ago

It's seriously impressive how well AMD has been able to maintain their incredible software deficiency for over a decade now.

2 more replies

jmward011y ago

2 more replies

cavisne1y ago

Yeah, I think AMD will really struggle with the cloud providers.

Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).

AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.

They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?

j / k navigate · click thread line to collapse