undefined | Better HN

0 pointsDCKing5d ago0 comments

If two things hold up - 1) this is actually a 2-300B parameter model and 2) this is actually competitive with frontier OpenAI and Anthropic models (and not just benchmaxing), the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

For comparison, DeepSeek V4 Flash is all the rage now for small efficient models. It's very good for its size but far from the performance of the latest GPT Pro and Opus models. The vanilla variant has 284B parameters. It fits on both 256GB and 512GB Mac Studios and hits about 20-30 tokens/second.

The implication of all this here is that you could have a (somewhat sluggish) Opus in a small box at home. At least once competing models and hardware to run them will be available (high end Mac Studios have been discontinued).

Something tells me that this means that Google's performance numbers here are inflated.

0 comments

WarmWash5d ago

Opus is estimated to be around 4T parameters, and 5.5 around 9T. [1] And while 3.5 at least qualifies to be in the same neighborhood, which is stunning if these numbers are all true, it may be that closing that last ~10% difference needs 50x more parameters.

[1]https://arxiv.org/pdf/2604.24827

easygenes4d ago

Their methods are only calibrated on open models (of course) and they admit very broad confidence bounds. You can also just see from comparing their estimates of the same models at different reasoning levels that there are major confounders to this. I would err on the absolute lowest side of their estimates for frontier models (e.g. 3T for GPT-5.5, 1.5-2T for Opus 4.5+).

stymaar5d ago

> the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

That wouldn't surprise me at all actually, models like Qwen3.6-35B are comparable to frontier level models from a year ago and I wouldn't be surprised if we had self-hostable open weight models matching Opus 4.7 in a year. Assuming that Google has one year of advance against Chinese lab isn't far fetched given how much resources they have compared to their Chinese competitors.

DCKingOP5d ago

I think there was a leap around Opus 4/4.1 that hasn't quite been equalled by self hostable models yet. Perhaps full Kimi K2.6 and Deepseek V4 Pro can achieve Opus 4.1 levels (it's hard to compare anyway, benchmarks are largely a game nowadays), but both of these are also north of 1000B parameters and therefore really impractical to run at home for the foreseeable future.

It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

stymaar5d ago

> It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

People used to believe the same about GPT-4, and I'm not convinced this is going to be different this time.

You do need a very big model if you want something that remembers random trivia about everything, but I'm not convinced this is needed to do meaningful work.

tarruda5d ago

> 300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

I run 2.54 BPW 397B Qwen 3.5 GGUF on a 128G mac studio at 20 tokens/second generation and 200 tokens/second processing. I'm not suggesting it matches the performance of the full BF16 model, but I did run some benchmarks locally and the results were pretty good:

- MMLU: 87.96%

- GPQA diamond: 86.36%

- IfEval: 91.13%

- GSM8k: 92.57%

So I think we have been at the "frontier capabilities at home" for a few months now.

LarsDu884d ago

TurboQuant. They can fit more in less now

DCKingOP4d ago

TurboQuant is a runtime optimization for a model's KV cache and doesn't allow for reduction in model size.

verdverm5d ago

Since I started using Qwen-3.6 35B A3B, I believe frontier like capability will be more than enough in these smaller models within a year or two, at least for coding. They don't need to memorize facts into their weights, which likely has very interesting implications that I'm not going speculatively decode

j / k navigate · click thread line to collapse

0 comments

WarmWash5d ago

[1]https://arxiv.org/pdf/2604.24827

easygenes4d ago

stymaar5d ago

> the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

DCKingOP5d ago

It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

stymaar5d ago

> It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

People used to believe the same about GPT-4, and I'm not convinced this is going to be different this time.

You do need a very big model if you want something that remembers random trivia about everything, but I'm not convinced this is needed to do meaningful work.

tarruda5d ago

> 300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

- MMLU: 87.96%

- GPQA diamond: 86.36%

- IfEval: 91.13%

- GSM8k: 92.57%

So I think we have been at the "frontier capabilities at home" for a few months now.

LarsDu884d ago

TurboQuant. They can fit more in less now

DCKingOP4d ago

TurboQuant is a runtime optimization for a model's KV cache and doesn't allow for reduction in model size.

verdverm5d ago

j / k navigate · click thread line to collapse