undefined | Better HN

0 pointstrouve_search4h ago0 comments

OK, I'm 100% rooting for both Mistral and task focused small models.

But Mistral has fall really far behind since 2025Q3. It seems they can't get good reasoning models working at even medium context sizes, which is necessary to be at the table right now.

Gemma4 and Qwen3.6 are currently best in the small size; Mistral's "small" model has ~4x the parameter count at 120B and isn't even competing with models a quarter its size.

Back one year ago with Mistral Small 3.1 they were keeping up, but they've fallen into irrelevancy right now.

If Mistral seriously wants to play the on-prem and small task-specific model game, a decent proxy would be to build models that get the r/localLlama crowd excited

0 comments

ar03h ago

I agree. I am a paying Le Chat Pro user, really rooting for a European alternative. But the quality difference between Mistral and the frontier labs is growing too big to ignore. It’s worrying to me that they didn’t talk much about new models at the conference, because that is really where their focus should be IMHO.

I am wondering what is keeping them back, though: Money? Compute? Skills? Training data? My fear is that you are really only getting really good models by training on very dubious data (outputs from the frontier models etc) and that Mistral is too European and too enterprisey to take those risks.

mattnewton2h ago

My theory with no insider information: it’s a little of all of the above, but mostly money. To some extent, you can dig yourself out of a data hole with RL and a lot of compute. And you can buy a lot of compute and some data with a lot of money. Big labs have been operating in this regime for a while and it’s one of the drivers behind their costs beyond just scaling the weights and doing the actual training. Mistral just doesn’t have access to this level of compute or the money to try and muscle their way in.

MichaelZuo1h ago

Don’t they supposedly have a huge amount of EU support?

Or at least there’s been a lot of noise about that.

3 more replies

greyskull3h ago

> task focused small models

This is tangential: and forgive my ignorance here, but is there an inherent reason why there aren't smaller, focused models from the frontier model providers?

I'm thinking something like a software-specific subset of Opus that is the default for use in Claude Code. Smaller, cheaper to deploy and consume, maybe faster.

pavpanchekha3h ago

OpenAI used to make Codex-specific models, but they stopped. What I've gathered from interviews and similar is that training two models isn't worth the (small) lift from having a coding-specific model. You're pre-training on everything anyway, and coding RL is reasonably useful for general-purpose models too.

greyskull2h ago

Interesting. I'd have guessed there would be meaningful opex benefits to serving smaller models.

baq3h ago

agreed, the next price increase from frontier labs (and the inevitable limits decrease in subscription tiers) will have people thinking real hard about their model providers and that's when mistral should be ready. however, given their recent performance, I realistically don't have my hopes high up.

djvdq3h ago

Also, new Medium 3.5 is far more expensive than previous Mistral models, and much more expensive than e.g. Deepseek

amunozo2h ago

DeepSeek is both cheaper and better than Mistral.

gregorygoc2h ago

Because they distill

2 more replies

rhdunn1h ago

Yeah. I run LLM models locally and for me 22B-32B is the largest I'm willing to invest in trying out.

Even though Mistral 4 has 6B active parameters per token (allowing 3-3.5 per token parameters to be loaded on a 4090), the ~240GB download + storage is pushing the limits of being able to try this out locally, especially if you are downloading and evaluating multiple models.

It also makes it harder for other people to make downstream finetunes like with what happened with the older Mistral/Magistral models.

coredev_2h ago

I don't agree that they are falling behind. Using both chat and cli I get what I need and it's comparable to "sota" when I compare.

lettergram3h ago

We actually found the Mistral Small 4, quantized to 4bit was comparable to Qwen 3.6 27B and is roughly the same size. At least from our experience on our use cases, the quantization of the Mistral model worked far better than trying to quantize the Qwen family.

Fully agree to your point though, Mistral in general is far behind where I'd expect and Qwen in particular is crushing it at the smaller sizes.

Personally, I'd consider anything 20B params and above a "medium" model. Small being <20B and large >100B. I think obviously we can get to the huge 1-2T param models, but frankly the margin of accuracy improvement for the speed hit is kinda insane (1-2% for many metrics).

rhdunn1h ago

It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:

1. tiny <2-3B -- easily runnable on lower-spec hardware

2. small 4-8B -- runnable on 8GB GPUs

3. medium 9-12B -- runnable on 12GB GPUs

4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs

5. very large 25-32GB -- runnable on 32GB GPUs

6. huge >32GB -- not easily runnable on consumer GPUs without compromising performance (offloading layers to the CPU/RAM), quality (heavy quantization, esp. at <= Q4), or price (investing in multi-GPU setups and/or server-grade hardware).

You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.

sroussey56m ago

As a Mac user:

1. tiny <2-3B -- could run in a browser even, mac neo

2. small 4-8B -- last of browser options, MacBook Air base

3. medium 9-24B -- 32GB machine, air or pro notebook or mini

4. large 25-48B -- 64GB, pro notebook or mini

5. x-large 49-100B -- 128GB MacBook Pro or Studio

6. Huge > 100B -- 256/512GB Mac Studio

1 more reply

dyauspitr1h ago

Mistral is bad bad. For its use cases I feel like India’s Sarvam is doing better.

ctrlkctrlsnew4m ago

channeling Rocky (extraterrestrial) there I see :)

echelon4h ago

Nobody trying to compete with Google, OpenAI, and Anthropic should be playing the small models / local models game.

Foundation model labs should be building very large reasoning models, then leaving it to the community to distill them down.

You can't scale a small model up, but you can scale a small model down.

I'm convinced the only way we'll have a seat at the table in the future and avoid total runaway takeoff is if there are very large models within 80% of the capabilities of the frontier models. Tiny RTX models do diddly squat to remain competitive.

Build open weights models for running on H200s. I'll spin them up on RunPod or Lambda.

farley133h ago

I do think there's a chance open weight models have a bit of a moment with the costs of frontier models growing on business balance sheets. It's unfortunate from my "privacy loving" PoV that it's mostly Chinese models filling the gap. ( the top models on openrouter for instance ).

I have used Mistral models out of pure ideology for web agents and the like which aren't doing a lot of heavy lifting.

theturtletalks2h ago

Antirez’s Deepseek 4 Flash implementation that can run on MacBooks also was a revelation. It runs decently on M5 Max 128GB and it’s pointing out other bottlenecks like prefill speed which will improve.

ahnick3h ago

I thought distillation meant small models don't have to compete with the big models and can always eventually achieve close parity, but it's just a matter of time to do the distillation? (i.e. how much lag do you want to live with) Am I oversimplifying?

gertlabs2h ago

There is likely a theoretical limit to how much intelligence you can pack into a model of a given size (especially when stretching that over a large input context size).

Our evals are pretty complex so we only recently started testing ~30B class models, which are now becoming quite smart (on par with the frontier from 1 year ago). Mistral is far behind, but I'm rooting for them.

Data at https://gertlabs.com/rankings

j / k navigate · click thread line to collapse

0 comments

ar03h ago

mattnewton2h ago

MichaelZuo1h ago

Don’t they supposedly have a huge amount of EU support?

Or at least there’s been a lot of noise about that.

3 more replies

greyskull3h ago

> task focused small models

This is tangential: and forgive my ignorance here, but is there an inherent reason why there aren't smaller, focused models from the frontier model providers?

I'm thinking something like a software-specific subset of Opus that is the default for use in Claude Code. Smaller, cheaper to deploy and consume, maybe faster.

pavpanchekha3h ago

greyskull2h ago

Interesting. I'd have guessed there would be meaningful opex benefits to serving smaller models.

baq3h ago

djvdq3h ago

Also, new Medium 3.5 is far more expensive than previous Mistral models, and much more expensive than e.g. Deepseek

amunozo2h ago

DeepSeek is both cheaper and better than Mistral.

gregorygoc2h ago

Because they distill

2 more replies

rhdunn1h ago

Yeah. I run LLM models locally and for me 22B-32B is the largest I'm willing to invest in trying out.

It also makes it harder for other people to make downstream finetunes like with what happened with the older Mistral/Magistral models.

coredev_2h ago

I don't agree that they are falling behind. Using both chat and cli I get what I need and it's comparable to "sota" when I compare.

lettergram3h ago

Fully agree to your point though, Mistral in general is far behind where I'd expect and Qwen in particular is crushing it at the smaller sizes.

rhdunn1h ago

It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:

1. tiny <2-3B -- easily runnable on lower-spec hardware

2. small 4-8B -- runnable on 8GB GPUs

3. medium 9-12B -- runnable on 12GB GPUs

4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs

5. very large 25-32GB -- runnable on 32GB GPUs

You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.

sroussey56m ago

As a Mac user:

1. tiny <2-3B -- could run in a browser even, mac neo

2. small 4-8B -- last of browser options, MacBook Air base

3. medium 9-24B -- 32GB machine, air or pro notebook or mini

4. large 25-48B -- 64GB, pro notebook or mini

5. x-large 49-100B -- 128GB MacBook Pro or Studio

6. Huge > 100B -- 256/512GB Mac Studio

1 more reply

dyauspitr1h ago

Mistral is bad bad. For its use cases I feel like India’s Sarvam is doing better.

ctrlkctrlsnew4m ago

channeling Rocky (extraterrestrial) there I see :)

echelon4h ago

Nobody trying to compete with Google, OpenAI, and Anthropic should be playing the small models / local models game.

Foundation model labs should be building very large reasoning models, then leaving it to the community to distill them down.

You can't scale a small model up, but you can scale a small model down.

Build open weights models for running on H200s. I'll spin them up on RunPod or Lambda.

farley133h ago

I have used Mistral models out of pure ideology for web agents and the like which aren't doing a lot of heavy lifting.

theturtletalks2h ago

ahnick3h ago

gertlabs2h ago

There is likely a theoretical limit to how much intelligence you can pack into a model of a given size (especially when stretching that over a large input context size).

Data at https://gertlabs.com/rankings

j / k navigate · click thread line to collapse