undefined | Better HN

0 pointsAboutplants20d ago0 comments

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

0 comments

observationist20d ago

Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

baq20d ago

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

observationist20d ago

I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.

1 more reply

adonese20d ago

Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.

1 more reply

bigyabai20d ago

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

observationist20d ago

If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.

1 more reply

basch20d ago

>ChatGPT image gen is just straight up better

Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.

thewebguyd20d ago

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

gregpred20d ago

Memory (model usage over time) is the moat.

energy12320d ago

Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

kseniamorph20d ago

makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.

druskacik20d ago

That has been true for some time now, definitely since Claude 3 release two years ago.

j / k navigate · click thread line to collapse

0 comments

observationist20d ago

baq20d ago

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

observationist20d ago

1 more reply

adonese20d ago

1 more reply

bigyabai20d ago

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

observationist20d ago

1 more reply

basch20d ago

>ChatGPT image gen is just straight up better

Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.

thewebguyd20d ago

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

gregpred20d ago

Memory (model usage over time) is the moat.

energy12320d ago

Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

kseniamorph20d ago

druskacik20d ago

That has been true for some time now, definitely since Claude 3 release two years ago.

j / k navigate · click thread line to collapse