undefined | Better HN

0 pointsgertlabs5d ago0 comments

We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

Data at https://gertlabs.com/rankings

0 comments

nl5d ago

Elon says Opus is 5T (and I would expect he'd know)

> It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

The have plenty if data. They use very large amounts of verifiable synthetic data in (lots in coding and math) cover the gap.

Also the frontier labs are paying people to do tasks, tracking the trajectories and training on that. Most of the optimization is in RL based on these trajectories.

stymaar5d ago

> Elon says Opus is 5T (and I would expect he'd know)

Even if he knew, why would anyone expect Elon not to lie about anything?

> The have plenty if data.

I don't think data is the problem either, but compute is: if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

nl4d ago

> if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

Yes it is. Spending $100M on training runs is common, and $1B might be in scope for some of the large models.

Sonnet 3.5 cost "a few 10s of millions of dollars" back in 2024: https://simonwillison.net/2025/Jan/29/on-deepseek-and-export...

nl5d ago

I mean in general I'm pretty doubtful about things he says, but in this he was comparing Grok and it sort of makes sense in the context: https://x.com/elonmusk/status/2042123561666855235

gertlabsOP5d ago

This is what we do at gertlabs.com - the foundation labs are actually starving for better data. Having quality data is not the same as having a lot of data. Human curated data / RLHF cannot scale to a 5T model and synthetic data pipelines are very much a work in progress in the industry.

Some interesting notes:

- Training a small model with large model output resulted in LESS improvement than distilling a less smart model onto the same small architecture [0]. We are starting to hit intelligence density limits in small models (<30B models may be nearing saturation now)

- good RL environments incidentally also make for good benchmarking

[0] https://arxiv.org/html/2502.12143v1

merb5d ago

Wouldn’t it be good to start investigating into a micro model architecture? Like first model checks the context and routes to the Java optimized model, etc. would make it also simpler to load/unload models in memory.

So extremely small models that are only good for a certain task like programming languages. A little bit of a model at the front that is extremely good in classification of tasks and than a more complex model that can bring each of these micro models back together

lukeundtrug5d ago

My guess is that we underestimate how much non-Java data and context in general is needed to create a good Java coding model. It could be true that a good Java model would be of 80-90% the size of a comparable overall coding model.

Obviously, I have no idea but I guess it’s not as simple as “just train only on Java code and reduce size to 1/10th”.

puilp05025d ago

I think you're describing Mixture-of-Experts.

KronisLV5d ago

> they don't have the data to optimize a model of that size.

So where does humanity cap out? The statement more or less implies that there's a ceiling of our ability to train models which might be below what LLMs are capable of (e.g. not AGI but how good coding agents they might ever become, for example).

maipen5d ago

I’m not sure if synthetic data is enough.

Xai paying cursor to train models with their data, tell us that having an agent tool like claude code is important for quality data acquisition. That’s why they recently shipped grok build

I think we will see insane SOTA models from xai in the next few months.

easygenes5d ago

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

opsnooperfax5d ago

Wouldn’t that be an exciting plot twist? That the release cadence of the big labs doesn’t actually reflect any meaningful improvements, or bigger models, but it’s a marketing ploy to start ratcheting up prices for good ARR numbers prior to the big IPO where the celebrity executives bail out of the stalling plane.

beacon2945d ago

I agree with this sentiment but the reasoned anecdotes do not agree. I imagine the flagship models have modalities/usages that we hn-ers don't imagine easily.

MisterPea5d ago

I exclusively use gemini models and this has been my experience.

I mitigate it by creating dense planning docs for everything and executing iteratively.

Lot's of time wasted on procedure unfortunately

Glohrischi5d ago

It was estimated that Mythos is 10T.

And serving is not training. For distilling you need to train the big models to have something to be distilled.

j / k navigate · click thread line to collapse

0 comments

nl5d ago

Elon says Opus is 5T (and I would expect he'd know)

> It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

The have plenty if data. They use very large amounts of verifiable synthetic data in (lots in coding and math) cover the gap.

Also the frontier labs are paying people to do tasks, tracking the trajectories and training on that. Most of the optimization is in RL based on these trajectories.

stymaar5d ago

> Elon says Opus is 5T (and I would expect he'd know)

Even if he knew, why would anyone expect Elon not to lie about anything?

> The have plenty if data.

nl4d ago

> if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

Yes it is. Spending $100M on training runs is common, and $1B might be in scope for some of the large models.

Sonnet 3.5 cost "a few 10s of millions of dollars" back in 2024: https://simonwillison.net/2025/Jan/29/on-deepseek-and-export...

nl5d ago

I mean in general I'm pretty doubtful about things he says, but in this he was comparing Grok and it sort of makes sense in the context: https://x.com/elonmusk/status/2042123561666855235

gertlabsOP5d ago

Some interesting notes:

- good RL environments incidentally also make for good benchmarking

[0] https://arxiv.org/html/2502.12143v1

merb5d ago

lukeundtrug5d ago

Obviously, I have no idea but I guess it’s not as simple as “just train only on Java code and reduce size to 1/10th”.

puilp05025d ago

I think you're describing Mixture-of-Experts.

KronisLV5d ago

> they don't have the data to optimize a model of that size.

maipen5d ago

I’m not sure if synthetic data is enough.

Xai paying cursor to train models with their data, tell us that having an agent tool like claude code is important for quality data acquisition. That’s why they recently shipped grok build

I think we will see insane SOTA models from xai in the next few months.

easygenes5d ago

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

opsnooperfax5d ago

beacon2945d ago

I agree with this sentiment but the reasoned anecdotes do not agree. I imagine the flagship models have modalities/usages that we hn-ers don't imagine easily.

MisterPea5d ago

I exclusively use gemini models and this has been my experience.

I mitigate it by creating dense planning docs for everything and executing iteratively.

Lot's of time wasted on procedure unfortunately

Glohrischi5d ago

It was estimated that Mythos is 10T.

And serving is not training. For distilling you need to train the big models to have something to be distilled.

j / k navigate · click thread line to collapse