undefined | Better HN

0 pointskkielhofner2y ago0 comments

Whisper is actually a great example of why Nvidia has such a stronghold on ML/AI and why it’s so difficult to compete.

There’s getting something to “work”, which is often enough of a challenge with ROCm. Then there’s getting it to work well (next challenge).

Then there’s getting it to work as well as Nvidia/CUDA.

With Whisper, as one example, you should be running it with ctranslate2[0]. Of all the platforms on their supported list you won’t find ROCm.

When you really start to look around you’ll find that ROCm is (at best) still very much in the “get it to work (sometimes)” stage. In most cases it’s still a long way away from getting it to work well, and even further away from making it actually competitive with Nvidia for serious use cases and applications.

People get excited about the progress ROCm has made getting basic things to work with PyTorch and this is good - progress is progress. But saving 20% on the hardware when the equivalent Nvidia product is often somewhere between 5-10x as performant (at a fraction of the development time) because of vastly superior software support you realize pretty quickly Nvidia is actually a bargain compared to AMD.

I’m desperately rooting for Nvidia to have some actual competition but after six years of ROCm and my own repeated failed attempts to have it make any sense overall I’m only more and more skeptical that real competition in the space will come from AMD.

[0] - https://github.com/OpenNMT/CTranslate2

0 comments

errnoh2y ago

While I agree that it's much more effort to get things working on AMD cards than it is with Nvidia, I was a bit surprised to see this comment mention Whisper being an example of "5-10x as performant".

https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.

EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )

kkielhofnerOP2y ago

> While I agree that it's much more effort to get things working on AMD cards than it is with Nvidia, I was a bit surprised to see this comment mention Whisper being an example of "5-10x as performant".

It easily is. See the benchmarks[0] from faster-whisper which uses Ctranslate2. That's 5x faster than OpenAI reference code on a Tesla V100. Needless to say something like a 4080 easily multiplies that.

> https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.

With all due respect to the author of the article this is "my first entry into ML" territory. They talk about a 5-10 second delay, my project can do sub 1 second times[1] even with ancient GPUs thanks to Ctranslate2. I don't have an RTX 4080 but if you look at the performance stats for the closest thing (RTX 4090) the performance numbers are positively bonkers - completely untouchable for anything ROCm based. Same goes for the other projects I linked, lmdeploy does over 100 tokens/s in a single session with LLama2 13b on my RTX 4090 and almost 600 tokens/s across eight simultaneous sessions.

> EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )

I don't understand what you're saying here. It (along with the other projects I linked here[2]) are fantastic examples of just how far behind the ROCm ecosystem is. ROCm isn't even on the radar for most of them as your linked issue highlights.

Things always get implemented in CUDA first (ten years in this space and I've never seen ROCm first) and ROCm users either wait months (minimum) for sub-par performance or never get it at all.

[0] - https://github.com/guillaumekln/faster-whisper#benchmark

[1] - https://heywillow.io/components/willow-inference-server/#ben...

[2] - https://news.ycombinator.com/item?id=37793635#37798902

j / k navigate · click thread line to collapse