I just experiment with some local LLMs, but the differences are pretty huge:
Llama 3 8B, Raspberry Pi 5: 2-3 Tokens/second (but it works!)
Llama 3 8B, RTX 4080: ~60 Tokens/second
Llama 3 8B, groq.com LPU, ~1300 Tokens/second
Llama 3 70B, AMD 7800X3D: 1-2 Tokens/second
Llama 3 70B, groq.com LPU, ~330 Tokens/second
There seem to be huge gaps between CPU, GPU and specialized inference ASICs. I'm guessing that right now there aren't many genius-level architecture breakthroughs, and that it's more about how much memory and silicon real estate you're willing to dedicate to AI inference.
I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.
How much RAM is required for this result? It's quite impressive that it even works as well as it does.
So really, they lose nothing. They've already booked sales of everything there is to sell. So might as well now turn attention to those who might be customers two years from now, and make them feel like the wait will be worth it.
Even for marketing claims that’s pretty wild.
Still lots of trajectory left in just scale up plan it seems
How much lower can these go though? 2bit? 1.58bit? 1bit? It seems that these massive gains have a very hard stop to gains that AMD and Nvidia will use to raise their stock price before it all comes to a sudden end.
(I miss the old PC era where the world at large was benefiting in tandem from new things happening (or falling behind from not adapting)).
Except we call it "cloudframe" now.
I think their "consumer GPU" did so bad recently that AMD could just as well, you know, simply liquidate the "consumer GPU" division and stop pretending.
I'm in the "consumer GPU" market myself; what AMD GPU do I buy today? -- Radeon Pro VII, launched in 2020 and the best AMD consumer GPU I can find today.
It's such a divide. I could optimize my software for such powerful GPUs as the Mi300 line.. but why do that, given that probably I won't even see one such GPU in my lifetime.
And they announced a workstation version with 48GB: https://www.phoronix.com/news/AMD-Radeon-PRO-W7900-Dual-Slot
Paper launches aren't anything new. It's always been a thing especially in hardware.
Why not? Because they’re sold out to hyperscalers?
they're 15k - who exactly is disappointed they won't be able to buy one?
Just a few weeks ago I spoke to someone who shelled out $10k for running LLM's locally. I've seen more expensive builds as well.
HPC centers and research clusters.
8x AMD MI300X (192GB, 750W) GPU
8x H100 (80GB, 700W) GPU
What would be the result against 8x H100 NVL (188GB, <800W) GPU
?AMD still has to prove themselves in this.