HipKittens: Fast and furious AMD kernels (opens in new tab)

(hazyresearch.stanford.edu)

244 pointsdataminer6mo ago91 comments

91 comments

Full disclosure, we have a contract with AMD to get Llama 405B training on MI350X on MLPerf.

Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.

They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.

For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.

latchkey6mo ago

As CEO of an AMD NeoCloud for the past 2 years, it is so nice to hear all this and also see the turn around. It is what I bet my business on from the start and I can concur with what George is saying 100%.

The out of box experience can be a bit rough around the edges on bleeding edge stuff, but it isn't anything near as bad as it used to be. For example, a month ago nanochat wasn't working well and now it is. The important thing is that people now care enough to make it work.

At the end of the day, AI does need viable options. Having a monopoly on all AI hardware and software might be a good thing for share holders, but isn't a good thing for what is looking like a fundamental technology, akin to the internet.

ivape6mo ago

That’s interesting, I was specifically looking for AMD hardware being offered by neoclouds, they seem to be rare.

I like your bet though. The difference between NVDA and AMD has never really existed on a hardware level for decades. AMD has always been on par, and software is software, it will catch up.

AMD will be a stock many people will miss because the opportunity has presented itself at the height of AI bubble talk, and this will leave many in the dust. Doubling and tripling of their market cap is pretty much a forgone conclusion.

latchkey6mo ago

You're right, it is a much smaller ecosystem, but I think that is partly intentional as a way to focus efforts and not feed into the bubble, which I feel is a smart move. These are the official partners [0]. I'm Hot Aisle.

George was very smart, $500k in the $90's. I saw it coming even earlier than him, but that's cause I was already aware the hardware was good from my own experiences.

[0] https://www.amd.com/en/products/accelerators/instinct/eval-r...

LogicFailsMe6mo ago

Will it catch up or will it forever chase nvidia's tail? I'm betting on the latter unless another AI winter happens. And contrary to anti-generative AI social media talking points, the literature suggests The Red Queen's race is continuing apace IMO.
Nvidia remains undefeated at responding to hardware threats with hardware diving catches to this day. What scenario prevents them from yet another one of their diving catches? I'm genuinely curious as to how one could pull that off. It's like challenging Google in search: even if you deliver better product and some have, the next thing you know Google is doing the same thing or better with deeper pockets.

1 more reply

WithinReason6mo ago

How far is Tinygrad from being able to represent/search the kind of optimisations listed in the article? i.e.:

  1. data layouts to avoid local memory bank conflicts
  2. read patterns from global memory to optimize L2 cache reuse
  3. warp specialisation

How complex is it to add these into tinygrad?

georgehotz6mo ago

1 and 2 are supported, 1 you need to specify, 2 will be found with BEAM. We are working on reimplementing HipKittens in tinygrad, all the stuff is there to do it. See the amd_uop_matmul example.

tinygrad doesn't support 3 yet, it's not needed on any AMD GPUs, and not needed on NVIDIA consumer. It wouldn't be hard to add, but it's important to figure out how it best fits with the existing abstractions. I think everything will eventually move to a more producer-consumer model.

0-_-06mo ago

Good luck with the AMD contract! I imagine HipKittens came at just the right time.

fulafel6mo ago

Does consumer hardware (non-MI) need proprietary kernel drivers for running rocm + pytorch?

kieranl6mo ago

No. But you might need a specific version of rocm built for your gpu. These are built on https://github.com/ROCm/TheRock

Right now AI support on AMD is officially only on specific models. But they are working hard to turn this around to have broader support. And making progress.

fulafel6mo ago

Vulkan compute is also getting some good press as a local llm platform (at least on the linux side), will be interesting to see which crosses the line to "can ship production quality apps on this" first.

georgehotz6mo ago

Nope! Works fine with in-tree somewhat recent kernel. The AMD driver is actually open source, not just a wrapper into a big on device blob like the NVIDIA one. tinygrad also has a driver that doesn't even need the kernel module, just mmapping the PCIe BAR into Python.

buckle80176mo ago

> Cerebras isn't available anywhere.

That sounds like they're winning.

bratao6mo ago

One thing I don't understand about Nvidia’s valuation is that right now a small number of algorithms have 'won,' such as Transformers. The data is very important. Compared to the past where customized code was much more common, such as modeling code and HPC, the ecosystem was very important and it was almost impossible to implement all CUDA and related code.

Competitors now only need to optimize for a narrow set of algorithms. If a vendor can run vLLM and Transformers efficiently, a massive market becomes available. Consequently, companies like AMD or Huawei should be able to catch up easily. What, then, is Nvidia’s moat? Is InfiniBand enough?"

jillesvangurp6mo ago

You are right to question their moat. My view on this is that there's a lot of pressure from essentially all other trillion dollar companies (MS, Google, Amazon, Apple, etc.) to not get locked into a NVidia only ecosystem. Each of those do their own chips. They also use Nvidia but not exclusively. An Android or IOS phone has no nvidia capable chips whatsoever. Neither do most laptops. Apple's M series CPUs don't support it at all typically. And with the exception of some gaming or workstation class laptops, most windows/linux laptops come with either AMD or Intel GPUs. Or lately Qualcomm ARM based architectures with custom GPUs.

Nvidias valuation and moat are centered around data center class GPUs used for training. I don't think they effectively have that space to themselves for much longer. Google is already using their own TPUs at scale for both training and inference. They still use some Nvidia stuff but they seem to be able to keep that off the critical path for anything that needs to run at "Google scale". OpenAI just ordered a bunch of AMD hardware. A lot of AI engineers use Apple laptops that rely on the M series hardware.

In short, the Cuda moat is shrinking. It's still relevant of course and there are a lot of tooling and frameworks that depend on it. That's why everybody still uses it. But not exclusively. And there's a lot of extremely well funded and active development to cut loose from it. AMD of course wants in. So does Intel. And so does everybody else. This HipKittens thing looks like it makes some big steps towards a more neutral software ecosystem.

wmf6mo ago

Infiniband is being replaced with UEC (and it isn't needed for inference). For inference there is no moat and smart players are buying/renting AMD or Google TPUs.

mandelken6mo ago

I didn't know you can you buy Google TPUs now?

amypetrik86mo ago

What you really want to buy is a Ming Mecca chip. Original model came out around 2003, but they've been iterating. These things are bigger than AMD or nvidia silicon, actually even much larger than a gigantic Cerebras wafer, typically 500-900 million USD in price. As you could guess, Ming Mecca is not broadly publicized, historically used for NSA crypto cracking although now adapted to AI and used for data crunching from gathered messages. More recently all those gathered messages have been used for training strategic /tactical intelligence developments to oversee and deploy resources optimally via a cluster of, at least last I heard, 18 Ming Mecca v7 chips

mattlondon6mo ago

You can pay to use them https://cloud.google.com/tpu

knowitnone36mo ago

You can buy older less capable TPUs https://www.seeedstudio.com/Coral-USB-Accelerator-p-2899.htm...

2 more replies

patagurbon6mo ago

Do you have evidence for this? I don’t think Nvidia is switching to Ultra Ethernet, just adding it to the product line-up

wmf6mo ago

Sorry, I don't mean Nvidia is adopting UEC (they probably hate it). I should have said UEC can substitute for Infiniband.

LtdJorge6mo ago

The vast amount of CUDA libraries for anything you can think of. I think there’s where they have the biggest leverage.

observationist6mo ago

AI is going to be so ubiquitous, something principled and open is going to supersede cuda at some point, as HTML5 did for Flash. CUDA isn't like an x86 vs ARM situation where they can use hardware dominance for decades, it's a higher level language, and being compatible with a wide range of systems benefits NVIDIA and their competitors. They're riding out their relative superiority for now, but we're going to see a standards and interoperability correction sometime soon, imo. NVIDIA will drive it, and it will gain them a few more years of dominance, but afaik nothing in their hardware IP means CUDA compatibility sacrifices performance or efficiency. They're also going to want to compete in the Chinese market, so being flexible about interoperability with their systems gains them a bit of market access that might otherwise be lost.

There's a ton of pressure on the market to decouple nvidia's proprietary software from literally everything important to AI, and they will either gracefully transition and control it, or it will reach a breaking point and someone else will do it for (and to) them. I'm sure they've got finance nerds and quants informing and minmaxing their strategy, so they probably know to the quarter when they'll pivot and launch their FOSS, industry leading standards narrative (or whatever the strategy is.)

bigyabai6mo ago

> but we're going to see a standards and interoperability correction sometime soon, imo.

I thought this too, in 2015. OpenCL looked really promising, but Apple bailed and neither AMD nor Intel had the funding to keep up with Nvidia's research. It sorta floundered, even though Nvidia GPUs smugly ran OpenCL code with benchmark-leading performance.

Nvidia won the datacenter because of hardware. You could release a perfect CUDA-to-Vulkan translator tomorrow, and they still wouldn't be dethroned until better hardware replaced it. Intel is swirling the drain, Qualcomm is hedging their bets on mobile, AMD is (still) too underfunded - Apple is the only company with the design chops and TSMC inroads to be a serious threat, and they can't release a datacenter product to save their life. It's understandable why people think Nvidia is a monopoly, Team Green is pulling a full-on "Luigi wins by doing nothing" in 2025: https://knowyourmeme.com/memes/luigi-wins-by-doing-absolutel...

The market has almost no pressure to decouple from Nvidia - nobody else has mature solutions. It requires a preestablished player to make a similarly risky play, which might rule out everyone who's sitting at the table.

toasterlovin6mo ago

> as HTML5 did for Flash

Uh, Flash died because Apple refused to support it on mobile Safari. Perhaps Flash would have died anyway, but that is the proximate cause. And Apple's competitors were falling over themselves to market Flash support as a competitive advantage vs. iPhone.

bryanlarsen6mo ago

To rephrase the OP's point: transformers et al are worth trillions. All the other CUDA uses are worth tens or hundreds of billions. They've totally got that locked up, but researchers is a smaller market than video games.

ivape6mo ago

I don’t think NVDA will have anything like a real moat, and more like whatever the difference was between iOS and Android. The gist of it is, the big bang of AI has happened and that universe is rapidly expanding, just like it once did for smart phones. There is the Apple of AI which is NVDA, and then there is Android (AMD). Moats are irrelevant here because the universe has just started rapidly expanding for them.

Apple didn’t really “win” out against Android, and it would be a very wrong way of measuring what actually happened. Yet, Apple could have been seen as more premium during various points of that timeline. The truth of the matter was, it was never a swimming race at any point in that smartphone timeline. It was simply a flood that you could convince yourself was an orderly race.

I believe the same is happening now, and it’s in Nvidias interest to maintain the narrative that there is a race and they are winning it. Believing something like this during the smartphone era would have been foolish.

ACCount376mo ago

By far the easiest way to implement that "small number of algorithms" is with universal number-grinding hardware. Which also protects you against any architectural developments. Hardware takes a damn long time to make.

mountainriver6mo ago

Transformers aren’t really one thing, the way they are implemented is wildly different. If it wasn’t then vllm and TRL would be easy

1 more reply

ehnto6mo ago

They also don't actually have a moat in the sense that they have patented technology keeping others out of the game. The other chip makers are coming for their lunch eventually.

ekropotin6mo ago

It’s all about deeply entrenched ecosystem NVIDIA had been building around CUDA for decades. It’d super hard to replicate this hardware-software platform.

Plus strategic partnerships with cloud providers.

And InfinityBand, yes

vagab0nd6mo ago

If your competitor has a 5-year lead, and is working as hard as you are, or harder, then you are not gonna catch up any time soon. Also yes networking.

dwheeler6mo ago

That's only true if future improvements are easy to create as past ones, that customers care as much about those improvements, and there are no other differentiators.

For example, many companies do well by selling a less capable but more affordable and available product.

o11c6mo ago

The thing the "just optimize AI" crowd misses is that this isn't like optimizing a programming language implementation, where even the worst implementation is likely only 100x slower than a good implementation.

AI is millions of times slower than optimal algorithms for most things.

wewewedxfgdf6mo ago

You'd think AMD would swing in on something like this and fund it with the money needed to succeed. I have no knowledge of it but my guess is no, AMD never misses an opportunity to miss an opportunity - when it comes to GPUs and AI.

AMDAnon6mo ago

AMD pays the bare minimum in software to get a product out the door. The company does not even have working performance testing and regressions routinely get shipped to customers. Benchmarks the executives see are ad hoc and not meaningful.

HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.

This isn't fixable overnight. Company-wide DevOps and infrastructure is outsourced to TCS in India who have no idea what they're doing. Teams with good leadership maintain their own shadow IT teams. ROCm didn't have such a team until hyperscalers lost their shit over our visibly poor development practices.

Even if AMD did extend an offer to hire all the people in the article, it would be below-market since the company benchmarks against Qualcomm, Broadcom, and Walmart, instead of Google, Nvidia, or Meta.

We haven't had a fully funded bonus in the past 4+ years.

schainks6mo ago

> We haven't had a fully funded bonus in the past 4+ years.

This is WILD to hear considering how well it appears AMD is executing from the outside.

AMDAnon6mo ago

> considering how well it appears AMD is executing from the outside.

The party line is that the stock price is up because the market expects us to perform well in the future, and we won't get a bonus until we actually perform well.

1 more reply

BNE6mo ago

> Teams with good leadership maintain their own shadow IT teams.

Yes, this is true. Painfully true.

JonChesterfield6mo ago

This doesn't sound right. I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.

They've paid serious amounts in RSUs over the last six years. Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs. Bonus might be team dependent, I remember being annoyed and nicely surprised by it in different years.

The aql profiler confuses me quite a lot but it's definitely a tool for measuring performance.

slavik816mo ago

I don't think anon is correct, but I can understand how they'd come to their conclusions. I certainly didn't choose AMD to maximize my pay, though it's always been a comfortable salary.

With regards to performance, there are some things tracked carefully and other things that are not tracked at all. I suspect that is why some folks think we're really good at it and others think we're terrible. There's lots of room for improvement, though. Excitement over trivial performance regressions is more a sign of immaturity than of good tracking.

AMDAnon6mo ago

> I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.

It depends on team, we have some testing, and progress is being made. But it's not "working" or comprehensive as we get complaints from our big customers. We should be replicating their setup internally and not have them catch problems.

> Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs.

We need to pay top of market to steal people from our competitors. We can't pay less than Nvidia and outcompete them. Paying less is a signal we're aiming for second and to copy the market leader.

observationist6mo ago

The MBAs are in charge, and now AMD is the new Intel?

It's not only not fixable overnight, but it's not fixable at all if the leadership thinks they can coast on simply being not as bad as Intel, and Intel has a helluva lot of inertia and ability to simply sell OEM units on autopilot.

Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.

AMDAnon6mo ago

The MBAs have always been in charge to an extent.

But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.

The mindset is that we maintain a comfortable second place by creating a shittier but cheaper product. That is how AMD has operated since 1959 as a second source to Fairchild Semiconductor and Intel. It's going to remain the strategy of the company indefinitely with Nvidia. Attempting to become better would cost too much.

> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.

Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.

What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers. Or phase in the same over a longer period of time. The company is full of people that do nothing because we've paid under market for so long. That's fine when competing against Intel, it's not acceptable when competing against Microsoft, Amazon, OpenAI, Google, and Nvidia.

Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.

3 more replies

FuckButtons6mo ago

Madness. I see the accountants are in charge then.

0manrho6mo ago

> AMD never misses an opportunity to miss an opportunity

Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.

Problem is, AMD has a terrible history of supporting it's hardware (either just outright lack of support, cough Radeon VII; or constantly scrapping things and starting over and thus the ecosystem never matured) and is at a massive deficit behind the CUDA ecosystem meaning that a lot of that hardware's potential is squandered by the lack of compatibility with CUDA and/or a lack of investment in comparable alternative. Those factors has given NVidia the momentum it has because most orgs/devs will look at the support/ecosystem delta, and ask themselves why they'd expend the resources reinventing the CUDA wheel to leverage AMD hardware when they can just spend that money/time investing in CUDA and NVidia instead.

To their credit, AMD it seems has learned it's lesson as they're actually trying to invest in ROCm and their Instinct ecosystem and seem to be sticking to their guns on it and we're starting to see people pick it up but they're still far behind Nvidia and CUDA.

One key area that Nvidia is far ahead of AMD on in the hardware space is networking.

AMDAnon6mo ago

> constantly scrapping things and starting over and thus the ecosystem never matured

AMD hires talented people at below-market and doesn't promote them or give raises. This causes employees to aim at resume-driven development by reinventing the wheel so they can get a job somewhere else.

It's a similar problem to Google, except at Google it's because promotions are explicitly for people that ship new products.

BNE6mo ago

Our hardware is arguably better (spec for spec) apart from critical areas like memory bandwidth, and GPU to GPU bandwidth. You can tweak your implementations to get the same if not better performance. We do that, we see this, our customers see this.

ROCM pre Rock, suffers from the ossification in the engineering organization. The Rock seeks to completely change that, and the team driving it is amazing. Try out the pre-alpha installer. It is already better than the default installer.

There is hope.

0manrho6mo ago

> There is hope.

Indeed. For clarity, I agree the performance is certainly there. My comment about being behind was in the context of marketshare and ecosystem maturity compared to CUDA. In fact, I'd say there's more than just hope but actual meaningful progress and commitment being made there, and I'm happy to see it.

ivape6mo ago

I wouldn’t even look at it like they are learning their lesson. The total addressable market is 1T according to them, and they are usually very conservative with their approach and projections. They will solve the software issue because there is simply too much money in it.

elteto6mo ago

From the performance comparison table, basically AMD could be NVIDIA right now, but they aren’t because… software?

That’s a complete institutional and leadership failure.

Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.

It goes to show that some companies just don’t “get” software. Not even AMD!

bryanlarsen6mo ago

CUDA was started in 2004. AMD was basically broke until they hit a home run with Ryzen in 2017.

p_l6mo ago

Funnily enough AMD was actually the first with GPGPU... they just floundered and managed to start 3 or more completely new software stacks for it, while CUDA focused not just on keeping one backward compatible one, but also made it work from cheapest NVS card to high end parts.

wmobit6mo ago

I'd go so far as to say it's the exact opposite. It's faster and easier to change the hardware than the software.

elteto6mo ago

Counterproof: attempt to modify your graphics card. Then attempt to modify a piece of code. Which one was easier?

1 more reply

suprjami6mo ago

AMD have had people contribute optimised ROCm kernels in the past. They closed the PR without merge. ROCm are not interested in this. Baffling behaviour.

wmf6mo ago

It is now funded and working.

LtdJorge6mo ago

First rule of AMD stock is nobody understands AMD stock. I guess it’s also the same for AMD’s software endeavors.

LtdJorge6mo ago

Ahh, composable-kernel. The highest offender in the list of software that have produced unrecoverable OOMs in my Gentoo system (it’s actually Clang while compiling CK, which uses upwards of 2.5GB per thread).

slavik816mo ago

I was recently reviewing a CK package for Debian. My test build crashed due to OOM using -j32 on a 64GB workstation, so I tried with -j1 to be safe. That completed successfully after 190 hours!

I think I may need to reduce the number of architectures it's built for to successfully compile it on the official Debian buildd infrastructure, but my (unverified) understanding is that most of its reverse dependencies only need the header-only parts of the library anyway.

I'm told they're working on improving the build times via a few different methods.

LtdJorge6mo ago

Same, -j32 with 64GB on a 3950x. I use 50% of ZRAM, but it’s still not enough most of the times, so I had to make a config called less-threads that only uses 24, with ZRAM enabled.

I also use OOMD, but I have to work on separating my systemd units better, OOMD has killed my greetd session before, and with that my entire tree of userland processes :D

nalllar6mo ago

Spending >10 minutes doing template instantiation for a single kernel for a single ISA is impressive!

`device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_instance`, what are you doing to our poor friend clang?

LtdJorge6mo ago

And they say Rust is slow!

semessier6mo ago

without having implemented inference, just by looking at it from a math perspective this is base linear algebra/BLAS. I am very much wondering what a lean inference optimized API with covering 80% of all use cases across dtypes and sparsity would look like. Probably a far cry from what's in CUDA and probably all that's needed for practical inference.

9999000009996mo ago

With these new developments, are there any implications for getting LLMs running well on consumer AMD chips ?

For example, the following laptop which I'm thinking of picking up, has both a strong AMD CPU/IGPU and a RTX 5080. Could we see the AMD side competing with the RTX?

I know a dedicated gpu will always be faster though.

>HP OMEN MAX 16-ak0003nr 16" Gaming Laptop Computer - Shadow Black Aluminum AMD Ryzen AI 9 HX 375 (2.0GHz) Processor; NVIDIA GeForce RTX 5080 16GB GDDR7; 32GB DDR5-5600 RAM; 1TB Solid State Drive

ehnto6mo ago

I run Qwen3 Coder 30b through Ollama on an RTX7900XTX. It works great, I suspect some load gets passed to the 32gb system memory and Ryzen 7 CPU.

It's not quite as fast as like Sonnet 4 from an API, but it's really not that bad.

It's really great for quick questions so I don't have to google stuff, and it's probably Sonnet4 level of competency at achieving coding tasks.

No API served model has been fast enough to remove the urge to do something else while waiting for bigger tasks, so the UX is more or less the same in that regard.

Opencode + ollama + Qwen3 Coder has been a very reasonable alternative to ClaudeCode with Sonnet4.

That is amazing for something running locally.

It is possible that if you actually need AI to be doing all your coding, that you're going to feel differently about the setup. But as a small assistant it's great.

christkv6mo ago

That's great I have been eyeing a Strix Halo and was wondering how well smaller models are doing. This is great news from the perspective of running local agents.

JonChesterfield6mo ago

I got one of those running whisper yesterday, hopeful the bigger llms will run shortly. You'd need rocm 7 which seems to be much better than 6.4 was.

1 more reply

electroglyph6mo ago

not the best model to use as a showcase, it's blistering fast on anything that isn't a toaster

ehnto6mo ago

Great! That's what I am pointing out, it's a 30b param model that fits into an AMD card and runs great. That's what we want.

fulafel6mo ago

You might think that a dGPU is always faster but the limited memory capacity bites you there (unless you go to datacenter dGPUs that cost tens of thousnds). Look at eg https://www.ywian.com/blog/amd-ryzen-ai-max-plus-395-native-... or the various high end Mac results.

9999000009996mo ago

So I want this Thinkpad.

https://www.lenovo.com/us/en/p/laptops/thinkpad/thinkpadp/th...?

AMD Ryzen™ AI 9 HX PRO 370 Processor (2.00 GHz up to 5.10 GHz) Operating System Windows 11 Pro 64 Graphic Card Integrated AMD Radeon™ 890M Memory 64 GB DDR5-5600MT/s (SODIMM)(2 x 32 GB)

But I also seriously want to run LLMs. My hunch is a gaming laptop is the best way to do this on the go without spending 5000$ for a Thinkpad with a high end graphics card.

JonChesterfield6mo ago

Anyone know whether there are things built on https://github.com/HazyResearch/ThunderKittens?

I think this is a port of that to HIP, where generally ports of cuda things to hip are of vague professional interest, but much more so if the library is used by other things.

jiehong6mo ago

> what is raw assembly? can't understand it? that's the point!

Raw assembly vs cooked assembly?

Also, I think this attitude wasn’t the most common on CPUs, and people used to write assembly by hand just fine (and sometimes some still do). I think we shouldn’t be afraid of assembly like that.

Compilers could write that assembly in the end, just like the do for CPUs!

yunnpp6mo ago

Yeah, comments like these really make you question the authors' background in optimization. Never mind that AMD actually publishes ISA specs for all of their graphics IPs -- it is not their point that you don't understand it -- what's holding GPU programming back is often that the underlying assembly primitives are not exposed in the high level languages.

I also do wonder what 'raw assembly' is supposed to be. Is it like sushi? Perhaps it is left as future work in the paper for the authors to answer.

villgax6mo ago

Totally ignored B300 for some reason

nextworddev6mo ago

Long $amd?

j / k navigate · click thread line to collapse

91 comments

georgehotz6mo ago

Full disclosure, we have a contract with AMD to get Llama 405B training on MI350X on MLPerf.

latchkey6mo ago

ivape6mo ago

That’s interesting, I was specifically looking for AMD hardware being offered by neoclouds, they seem to be rare.

I like your bet though. The difference between NVDA and AMD has never really existed on a hardware level for decades. AMD has always been on par, and software is software, it will catch up.

latchkey6mo ago

George was very smart, $500k in the $90's. I saw it coming even earlier than him, but that's cause I was already aware the hardware was good from my own experiences.

[0] https://www.amd.com/en/products/accelerators/instinct/eval-r...

LogicFailsMe6mo ago

1 more reply

WithinReason6mo ago

How far is Tinygrad from being able to represent/search the kind of optimisations listed in the article? i.e.:

  1. data layouts to avoid local memory bank conflicts
  2. read patterns from global memory to optimize L2 cache reuse
  3. warp specialisation

How complex is it to add these into tinygrad?

georgehotz6mo ago

1 and 2 are supported, 1 you need to specify, 2 will be found with BEAM. We are working on reimplementing HipKittens in tinygrad, all the stuff is there to do it. See the amd_uop_matmul example.

0-_-06mo ago

Good luck with the AMD contract! I imagine HipKittens came at just the right time.

fulafel6mo ago

Does consumer hardware (non-MI) need proprietary kernel drivers for running rocm + pytorch?

kieranl6mo ago

No. But you might need a specific version of rocm built for your gpu. These are built on https://github.com/ROCm/TheRock

Right now AI support on AMD is officially only on specific models. But they are working hard to turn this around to have broader support. And making progress.

fulafel6mo ago

georgehotz6mo ago

buckle80176mo ago

> Cerebras isn't available anywhere.

That sounds like they're winning.

bratao6mo ago

jillesvangurp6mo ago

wmf6mo ago

Infiniband is being replaced with UEC (and it isn't needed for inference). For inference there is no moat and smart players are buying/renting AMD or Google TPUs.

mandelken6mo ago

I didn't know you can you buy Google TPUs now?

amypetrik86mo ago

mattlondon6mo ago

You can pay to use them https://cloud.google.com/tpu

knowitnone36mo ago

You can buy older less capable TPUs https://www.seeedstudio.com/Coral-USB-Accelerator-p-2899.htm...

2 more replies

patagurbon6mo ago

Do you have evidence for this? I don’t think Nvidia is switching to Ultra Ethernet, just adding it to the product line-up

wmf6mo ago

Sorry, I don't mean Nvidia is adopting UEC (they probably hate it). I should have said UEC can substitute for Infiniband.

LtdJorge6mo ago

The vast amount of CUDA libraries for anything you can think of. I think there’s where they have the biggest leverage.

observationist6mo ago

bigyabai6mo ago

> but we're going to see a standards and interoperability correction sometime soon, imo.

toasterlovin6mo ago

> as HTML5 did for Flash

bryanlarsen6mo ago

ivape6mo ago

ACCount376mo ago

mountainriver6mo ago

Transformers aren’t really one thing, the way they are implemented is wildly different. If it wasn’t then vllm and TRL would be easy

1 more reply

ehnto6mo ago

They also don't actually have a moat in the sense that they have patented technology keeping others out of the game. The other chip makers are coming for their lunch eventually.

ekropotin6mo ago

It’s all about deeply entrenched ecosystem NVIDIA had been building around CUDA for decades. It’d super hard to replicate this hardware-software platform.

Plus strategic partnerships with cloud providers.

And InfinityBand, yes

vagab0nd6mo ago

If your competitor has a 5-year lead, and is working as hard as you are, or harder, then you are not gonna catch up any time soon. Also yes networking.

dwheeler6mo ago

That's only true if future improvements are easy to create as past ones, that customers care as much about those improvements, and there are no other differentiators.

For example, many companies do well by selling a less capable but more affordable and available product.

o11c6mo ago

AI is millions of times slower than optimal algorithms for most things.

wewewedxfgdf6mo ago

AMDAnon6mo ago

HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.

We haven't had a fully funded bonus in the past 4+ years.

schainks6mo ago

> We haven't had a fully funded bonus in the past 4+ years.

This is WILD to hear considering how well it appears AMD is executing from the outside.

AMDAnon6mo ago

> considering how well it appears AMD is executing from the outside.

The party line is that the stock price is up because the market expects us to perform well in the future, and we won't get a bonus until we actually perform well.

1 more reply

BNE6mo ago

> Teams with good leadership maintain their own shadow IT teams.

Yes, this is true. Painfully true.

JonChesterfield6mo ago

This doesn't sound right. I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.

The aql profiler confuses me quite a lot but it's definitely a tool for measuring performance.

slavik816mo ago

I don't think anon is correct, but I can understand how they'd come to their conclusions. I certainly didn't choose AMD to maximize my pay, though it's always been a comfortable salary.

AMDAnon6mo ago

> I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.

> Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs.

We need to pay top of market to steal people from our competitors. We can't pay less than Nvidia and outcompete them. Paying less is a signal we're aiming for second and to copy the market leader.

observationist6mo ago

The MBAs are in charge, and now AMD is the new Intel?

Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.

AMDAnon6mo ago

The MBAs have always been in charge to an extent.

But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.

> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.

Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.

Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.

3 more replies

FuckButtons6mo ago

Madness. I see the accountants are in charge then.

0manrho6mo ago

> AMD never misses an opportunity to miss an opportunity

Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.

One key area that Nvidia is far ahead of AMD on in the hardware space is networking.

AMDAnon6mo ago

> constantly scrapping things and starting over and thus the ecosystem never matured

It's a similar problem to Google, except at Google it's because promotions are explicitly for people that ship new products.

BNE6mo ago

There is hope.

0manrho6mo ago

> There is hope.

ivape6mo ago

elteto6mo ago

From the performance comparison table, basically AMD could be NVIDIA right now, but they aren’t because… software?

That’s a complete institutional and leadership failure.

Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.

It goes to show that some companies just don’t “get” software. Not even AMD!

bryanlarsen6mo ago

CUDA was started in 2004. AMD was basically broke until they hit a home run with Ryzen in 2017.

p_l6mo ago

wmobit6mo ago

I'd go so far as to say it's the exact opposite. It's faster and easier to change the hardware than the software.

elteto6mo ago

Counterproof: attempt to modify your graphics card. Then attempt to modify a piece of code. Which one was easier?

1 more reply

suprjami6mo ago

AMD have had people contribute optimised ROCm kernels in the past. They closed the PR without merge. ROCm are not interested in this. Baffling behaviour.

wmf6mo ago

It is now funded and working.

LtdJorge6mo ago

First rule of AMD stock is nobody understands AMD stock. I guess it’s also the same for AMD’s software endeavors.

LtdJorge6mo ago

slavik816mo ago

I was recently reviewing a CK package for Debian. My test build crashed due to OOM using -j32 on a 64GB workstation, so I tried with -j1 to be safe. That completed successfully after 190 hours!

I'm told they're working on improving the build times via a few different methods.

LtdJorge6mo ago

Same, -j32 with 64GB on a 3950x. I use 50% of ZRAM, but it’s still not enough most of the times, so I had to make a config called less-threads that only uses 24, with ZRAM enabled.

I also use OOMD, but I have to work on separating my systemd units better, OOMD has killed my greetd session before, and with that my entire tree of userland processes :D

nalllar6mo ago

Spending >10 minutes doing template instantiation for a single kernel for a single ISA is impressive!

`device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_instance`, what are you doing to our poor friend clang?

LtdJorge6mo ago

And they say Rust is slow!

semessier6mo ago

9999000009996mo ago

With these new developments, are there any implications for getting LLMs running well on consumer AMD chips ?

For example, the following laptop which I'm thinking of picking up, has both a strong AMD CPU/IGPU and a RTX 5080. Could we see the AMD side competing with the RTX?

I know a dedicated gpu will always be faster though.

>HP OMEN MAX 16-ak0003nr 16" Gaming Laptop Computer - Shadow Black Aluminum AMD Ryzen AI 9 HX 375 (2.0GHz) Processor; NVIDIA GeForce RTX 5080 16GB GDDR7; 32GB DDR5-5600 RAM; 1TB Solid State Drive

ehnto6mo ago

I run Qwen3 Coder 30b through Ollama on an RTX7900XTX. It works great, I suspect some load gets passed to the 32gb system memory and Ryzen 7 CPU.

It's not quite as fast as like Sonnet 4 from an API, but it's really not that bad.

It's really great for quick questions so I don't have to google stuff, and it's probably Sonnet4 level of competency at achieving coding tasks.

No API served model has been fast enough to remove the urge to do something else while waiting for bigger tasks, so the UX is more or less the same in that regard.

Opencode + ollama + Qwen3 Coder has been a very reasonable alternative to ClaudeCode with Sonnet4.

That is amazing for something running locally.

It is possible that if you actually need AI to be doing all your coding, that you're going to feel differently about the setup. But as a small assistant it's great.

christkv6mo ago

That's great I have been eyeing a Strix Halo and was wondering how well smaller models are doing. This is great news from the perspective of running local agents.

JonChesterfield6mo ago

I got one of those running whisper yesterday, hopeful the bigger llms will run shortly. You'd need rocm 7 which seems to be much better than 6.4 was.

1 more reply

electroglyph6mo ago

not the best model to use as a showcase, it's blistering fast on anything that isn't a toaster

ehnto6mo ago

Great! That's what I am pointing out, it's a 30b param model that fits into an AMD card and runs great. That's what we want.

fulafel6mo ago

9999000009996mo ago

So I want this Thinkpad.

https://www.lenovo.com/us/en/p/laptops/thinkpad/thinkpadp/th...?

AMD Ryzen™ AI 9 HX PRO 370 Processor (2.00 GHz up to 5.10 GHz) Operating System Windows 11 Pro 64 Graphic Card Integrated AMD Radeon™ 890M Memory 64 GB DDR5-5600MT/s (SODIMM)(2 x 32 GB)

But I also seriously want to run LLMs. My hunch is a gaming laptop is the best way to do this on the go without spending 5000$ for a Thinkpad with a high end graphics card.

JonChesterfield6mo ago

Anyone know whether there are things built on https://github.com/HazyResearch/ThunderKittens?

I think this is a port of that to HIP, where generally ports of cuda things to hip are of vague professional interest, but much more so if the library is used by other things.

jiehong6mo ago

> what is raw assembly? can't understand it? that's the point!

Raw assembly vs cooked assembly?

Compilers could write that assembly in the end, just like the do for CPUs!

yunnpp6mo ago

I also do wonder what 'raw assembly' is supposed to be. Is it like sushi? Perhaps it is left as future work in the paper for the authors to answer.

villgax6mo ago

Totally ignored B300 for some reason

nextworddev6mo ago

Long $amd?

j / k navigate · click thread line to collapse