undefined | Better HN

0 pointsWhitneyLand6mo ago0 comments

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

0 comments

NaomiLehman6mo ago

they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.

I like to compare them using chathub using the same prompts

Gemini still calls me "the architect" in half of the prompts. It's very cringe.

mpalmer6mo ago

    Gemini still calls me "the architect" in half of the prompts. It's very cringe.

Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?

NaomiLehman6mo ago

it absolutely does. and human employees don't call me "the architect." that's the point.

gregw26mo ago

I wonder if under the covers it uses your word choices to infer your Myers-Briggs personality type and you are INTJ so it calls you "The Architect"?? Crazy thought but conceivable...

1 more reply

sothatsit6mo ago

It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.

This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.

beepbooptheory6mo ago

I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.

j / k navigate · click thread line to collapse

0 comments

NaomiLehman6mo ago

I like to compare them using chathub using the same prompts

Gemini still calls me "the architect" in half of the prompts. It's very cringe.

mpalmer6mo ago

    Gemini still calls me "the architect" in half of the prompts. It's very cringe.

Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?

NaomiLehman6mo ago

it absolutely does. and human employees don't call me "the architect." that's the point.

gregw26mo ago

I wonder if under the covers it uses your word choices to infer your Myers-Briggs personality type and you are INTJ so it calls you "The Architect"?? Crazy thought but conceivable...

1 more reply

sothatsit6mo ago

It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.

beepbooptheory6mo ago

j / k navigate · click thread line to collapse