This serious bug was open since May and AMD doesn't seem to respond as seriously as it should be.
Isn't geohot infamous for stealing other people's work?
PBCAK?
That said, ROCm only officially supports a fraction of its product line, and an odd smattering throughout at that. It's a joke compared to CUDA which will run on damn near anything. And AMD has a long, long history of dogshit drivers (at least on Windows.)
AMD just doesn't seem to give enough of a shit to invest money into securing top talent for this, and NVIDIA will continue to stomp them.
Are you meaning the Sony Playstation hacking where they took legal action against him, or are you meaning other stuff?
Shareholders of AMD should look into it and do some firings of top Executives/CEO until morale improves.
A long time ago AMD decided to 100% focus on budget consumer graphics (including consoles), that decision was the right decision at the time. However being in low-margin business it seems they don't have the people (or the budget to last-minute hire) to pump out the R&D for a generic neural network platform without moving people away from their consumer graphics division.
The article is unsatisfying because it doesnt explain WHY cuda reigns supreme.
One hypothesis put forward is that the main alternative ROCM is just not very complete and not very fast - thats a good argument.
Another hypothesis that is not considered is : CUDA reigns supreme, because NVIDIA GPUs reign supreme.
But people dont write CUDA code .. they write pytorch code ?!
The problems you generally experience are:
* Inexplicably poor performance
* Poor (and sometimes incorrect) documentation
* Difficulties debugging
* Crashes and hangsIf I'm AMD, I'd spend at least $1 billion/year figuring out the software side.
I can't think of an easier way for AMD to return value to shareholders than eroding CUDA advantage.
Heck, Meta invested something like $100b on VR so far and VR is not nearly the market that AI is.
I started playing around with porting some CUDA code to ROCm/HIP on a Ryzen laptop APU I had. While an "unsupported" configuration (which was understood), it all worked until AMD suddenly and explicitly blocked the ability to run on APU's. Currently the only way to get back to work on that project on that particular computer would be to run a closed-source patched driver from some rando on the internet. Needless to say, I lost interest.
Last I checked, there were only 7 consumer SKU's that could run AMD's current compute stack, the oldest being 1 generation old. Even among the enterprise hardware they only support ~2 generations back. So you can't even grab some old cheap recycled gear on e-bay to hack on their ecosystem.
Meanwhile, I can pull anything with an NVIDIA logo on it from a junkyard it'll happily run CUDA code that I wrote for the 8800GTX 15+ years ago.
Then there is the quality of hardware, debugging tools, IDE support, supported languages (again isn't only PyTorch), and libraries.
I know its still in development. But curious to know if someone has played around with it for the kind of needs discussed on this page.
PyTorch already does. But if you're saying "NN" and "pytorch" that already means you're outside of the audience for CUDA I'm talking about in the article. My own stuff was usually Bayesian Hierarchical Models, which at least at the time made pytorch completely useless (that was nearly a decade ago though—maybe that specific use case improved).
If you've tried to write actually new (or different enough) NNs or entirely different models, pytorch is too high-level, and sometimes even TF is too. Even aside from that, if you're a maintainer of BLAS or some specific library for sparse MM with very specific distributions that are optimized for it...
Anyway, those are the key cases, but even aside from that, if you've ever tried even with some higher-level libraries to do non-vanilla stuff, nothing works as well as it should. You get random, inscrutable errors that certainly do exist on NVIDIA GPUs/stuff-based-on-CUDA-under-the-hood, but way way fewer. For newer, custom stuff, getting things like numerical overflows or other completely breaking problems on alternative backends, but don't happen / work just fine on CPU or CUDA backend is not really that uncommon. Or the CUDA backend is just ridiculously faster. If you're doing something annoying, new, and complicated enough, there's no point in taking the aggravation.
The people who write the stuff that is used in PyTorch or other libraries definitely write CUDA code (in C++ etc). And then the people who use PyTorch just build on top of that.
I deliberately tried to keep it accessible and have non-technical (or just non-software) audiences also be able to get an intuition for why CUDA has such strong lock-in. Otherwise, the pushback I've often gotten "just re-write it" or "it's just software" which if it were so simple, people wouldn't need to be yelling so much at AMD across so many comments. Basically, people who can't fathom why software technical debt can ever be a thing. Or, if it is, China has infinite money and time anyway.
A high-level analysis should say that Huawei, AMD, and Intel all should easily invest enough to make this all work and compete with CUDA to push their hardware platforms. The reality is decentralized decision-making from users also makes it more of an expensive, uncertain bet that people will adopt. A bunch of the lower-level, underlying libraries that things are built on AND the researchers who do bleeding-edge research still have a huge amount of experience in and stuff built on CUDA.
To first order nobody writes any CUDA, and even if you do you are probably bad at it. The language is slightly easier to use than openCL but writing really performant code is still a nightmare (a pipeline of asynchronous memory copies from global to shared memory is not easy to program but this is a requirement for full performance on tensor cores).
So no, the moat really isn't the language. It's not even the libraries, it's the integration of the libraries into third party software like pytorch, jax, etc. This is the truly massive advantage NVIDIA has, and they got it by being early and by being installed in an awful lot of machines.
At least say why people wouldn’t be good at it. The documentation is poor, the GPUs are a black box or anything in that vein. Then they can help you learn instead of preemptively dismiss it.
I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS.
NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public.
It would be like trying to compete against Olympians, to use an analogy that we can all understand.
You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.
It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:
* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.
* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.
* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.
* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.
This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.
It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.
Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.
Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?
That's an extreme stretch, and far from truth.
Many people write CUDA, both in industry and academia.
I used to work in the GPU industry and this sort of view is both pervasive and misguided.
GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.
Because of this, and in spite of the amount of time and resources spent on validation and verification, the hardware often contains flaws. It is the responsibility of the drivers to work around these flaws in various ways. When a flaw hasn't been discovered and worked around yet, you perceive it as the GPU being unstable or crashing.
There is no fast simple solution to this. You need a finely tuned corporate machine from beginning to end. Better hiring processes, better management, better design processes, better verification processes, better software development practices, better marketing and sales, better customer relations. Everything.
This is like saying combustion engines are immensely complex machines when your car suddenly loses power on the highway for no apparent reason and then when you restart the engine it works for another five minutes again. When you drive on normal roads it works flawlessly. It must be the engine, right? After all, it is the most complicated aspect!
Except in reality it is far more likely for it to be a problem in the electronics driving the fuel pump or spark plug.
AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm.
Pretty bad idea, especially in midst of the AI hype.
why can't xyz company build apps/websites/products that don't have bugs??
I believe LLMs will be commoditised while the compute power will be the next big thing.
not if this moat could be leveraged into a monopoly on AI chips, to the detriment of society.
I want to see competition in this space.
Unfortunately, the market rally of nvidia stock is suggesting that most investors are expecting this monopoly to eventuate.
Therefore, it is in the interest of society to ensure that such a software moat is not established. Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.
Realistically what happened is that after a few decades of development, competitors arose and took the market. In the meantime, Microsoft became rich. Who cares
Can you talk more about this? Would love to understand.
Intel should be shoveling out 16GB Arc graphics cards for free to every graduate program in the country who can fill out a web form. In a couple years, they'd displace NVIDIA.
AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.
Yes, there has been progress. However, when you look at the amount of money that AMD and Intel throw at software vs how much NVIDIA throws at software, it's an instant facepalm moment.
NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.
I think Nvidia sees it too. That's why they're moving upstream by providing the entire stack from CUDA, GPUs, interconnects chips, networking chips, racks, OS, software, models.
I think the "CUDA moat" people like OP are underselling Nvidia. They're positioning themselves as the full-stack AI provider. Forget CUDA.
- Great at legacy C++ code.
- Great at new C++ code.
- Great at embedded/high performance/distributed code.
- Are experts in Linear Algebra and Calculus
- Are competent at Machine Learning and similar problems.
Now imagine, that after you find ~10-50 competent senior engineers who can each segment and train 1-5 engineers, you also need to hire 10-20 managers, PMs and directors who are smart enough to do more than "copy NVidia's offering from last year", and wise enough to still build a 1:1 compatibility layer.
Apple is likely seeing more traction on their metal API by virtue that it is reasonably well guaranteed to be around in ~5 years, and is common on multiple device platforms that students/devs use or customers deploy.
It gets even stranger when considering that as major GPU makers, both AMD and Intel have lots of access to such talent.
My personal experience shows CUDA to in fact be a very deep moat. In ~12 years CUDA and ~6 ROCm (since Vega) I’ve never met a professional who says otherwise, including those at top500.org AMD sites.
From what I’ve seen online this take really seems to come from some kind of Linux desktop Nvidia grudge/bad experience or just good ‘ol gaming/desktop team red vs green vs blue nonsense.
Many things can be said about Nvidia and all kinds of things can be debated but suggesting that Nvidia has > 90% market share simply and solely because people drink Nvidia kool-aid is a wild take.
Isn't that what HIPIFY does? https://github.com/ROCm/HIPIFY
https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/...
Many microcontroller companies have terrible software support: no free C/C++ compilers, clunky IDEs, too much reliance on 3rd party software providers, no decent code libraries...
Even if they have software support, the code is bad and bloated. Look at ST's HAL libraries, for example. Thankfully, an open source or free tool often comes to the rescue, usually through the efforts of dedicated individual programmers. But billion-dollar companies relying on such 3rd party tooling seems insane to me.
AMD recently got rid of one of the CUDA compatibility layers instead of extending it.
And they need to release high-RAM versions of their next gaming GPUs. More than anything else that will incentivize people to switch. If they're selling 36 GB while Nvidia is still selling 24 GB, people will do what it takes to move over.
This takes a ton of employees which is hard for a company with a fraction of the software employees of Nvidia. (On that note there's 1185 engineering job postings on the AMD site right now... https://careers.amd.com/careers-home/jobs?categories=Enginee...)
"They" (being AMD) didn't. The person they contracted put in a clause that allowed him to open source the work (years AFTER) AMD stopped paying him.
- Abandoning ZLUDA was maybe not the best choice
- Not accepting the fact that software is equally as important as hardware is wrong
- Pushing more vram into their cards would attract more people
- Fix hardware issues (especially with the restarts on every fail) should be high priority
Chip War has a great section on how the Soviet Union tried a “just copy/steal” strategy in semiconductors and fell hopelessly behind because of it. It’s a great theoretical idea to just copy/steal and fast-follow, but semiconductors, AI, and other “harder technologies” require building human and intellectual capital that will get better with time. From there, you need to have the prior generation to keep up with ever-increasing complexity and difficulty as these things get more advanced.
I disagree with your section on Huawei and China. China isn't just trying to just copy/steal AI. In terms of models, China is a bit behind in LLMs but arguably more ahead in self-driving cars. China is throwing everything at semiconductor manufacturing instead because that's where their bottleneck truly is - not CUDA. Had Huawei had access to TSMC's 5nm and 3nm, they might already be equal to Nvidia in raw GPU prowess. After all, HiSilicon's Kirin already matched/exceeded Qualcomm before the Trump ban. Their 5G chips/implementation were well ahead of anyone else. In software, it's easier for China to adopt a CUDA alternative because China is usually really good at unifying under one vision - especially when they have to.