undefined | Better HN

0 pointsgertlabs1mo ago0 comments

Objective, detailed benchmark results at https://gertlabs.com

Early takeaways: from this release, DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast.

The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently. Hopefully that improves as new providers host the model. Flash is working fine, and is currently performing competitively with recent releases, but only on agentic workflows. Check back in 24 hours for full combined scoring with tool use and long context for both models.

Many of the frontier Chinese AI labs have released near-frontier models that are just a little bit behind Opus 4.6 in terms of speed, tool use ability, or long context handling. Open weights are winning the AI race, led by China. Crazy couple weeks of releases.

Mimo V2.5 Pro by Xiaomi (not open weights) is actually the best performer of the latest string of Chinese releases in our combined, comprehensive benchmarks, despite getting less attention. Kimi K2.6 is the most interesting open weights release, still. DeepSeek is not the leader in the space anymore.

An interesting pattern with the latest string of Chinese releases is the much better agentic boost (models are not as smart out of the box, but their ability to iterate in a loop with tools makes up most of the difference). Deepseek V4 Flash exemplifying this -- not a smart model on the first try, but it makes up for it over the course of a session.

0 comments

Squarex1mo ago

I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.

gertlabsOP1mo ago

No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.

Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.

And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.

Squarex1mo ago

The word "objective" just seems too authoritative to me.

tw19841mo ago

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".

Squarex1mo ago

Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.

segmondy1mo ago

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

Squarex1mo ago

It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.

swiftcoder1mo ago

If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

orbital-decay1mo ago

Only if the benchmark is private and done properly on relevant tasks, which is rarely the case. I can guarantee that you have a ton of blind spots if you look at it through the lens of a ranking ladder in some generic tasks.

dandaka1mo ago

Interesting that you rate Claude Opus 4.6 lower than 4.5 and 4.7, while community consensus puts it on top.

nostrebored1mo ago

I think most hardcore people I know are still sticking with 4.5 for coding workflows

kamranjon1mo ago

I'm particularly interested in it being REALLY fast - do you have any rough tok/s numbers for the flash model? I'm excited for unsloth to drop some quants that I can try and run locally, but really curious how it's been performing speed wise. In general I actually over-index on speed over intelligence. I'd rather a model make mistakes quickly and correct in a follow-up than take forever to get a slightly better initial result.

gertlabsOP1mo ago

Take a look at the Time column in https://gertlabs.com/?mode=oneshot_coding -- this is the total time to complete a solution for a reasonably complex problem end-to-end (you would have to divide by avg submission size to estimate tok/s). It's fast in the sense that most of the smart, recent Chinese releases are quite slow, especially the DeepSeek Pro variant. Opus 4.7 is also quite fast.

If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.

So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.

Lord_Zero1mo ago

Why no mention of GPT-5.5?

gertlabsOP1mo ago

Waiting on public API release. Once it drops, results will be up within 24 hours.

gertlabsOP1mo ago

Results are up. GPT 5.5 is a beast.

wahnfrieden1mo ago

Have you considered running models like GPT 5.5 inside their agent harness (Codex)?

1 more reply

j / k navigate · click thread line to collapse

0 comments

Squarex1mo ago

gertlabsOP1mo ago

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

Squarex1mo ago

The word "objective" just seems too authoritative to me.

tw19841mo ago

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".

Squarex1mo ago

segmondy1mo ago

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

Squarex1mo ago

swiftcoder1mo ago

If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

orbital-decay1mo ago

dandaka1mo ago

Interesting that you rate Claude Opus 4.6 lower than 4.5 and 4.7, while community consensus puts it on top.

nostrebored1mo ago

I think most hardcore people I know are still sticking with 4.5 for coding workflows

kamranjon1mo ago

gertlabsOP1mo ago

If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.

So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.

Lord_Zero1mo ago

Why no mention of GPT-5.5?

gertlabsOP1mo ago

Waiting on public API release. Once it drops, results will be up within 24 hours.

gertlabsOP1mo ago

Results are up. GPT 5.5 is a beast.

wahnfrieden1mo ago

Have you considered running models like GPT 5.5 inside their agent harness (Codex)?

1 more reply

j / k navigate · click thread line to collapse