undefined | Better HN

0 pointsmtlynch7mo ago0 comments

What's going on with their SWE bench graph?[0]

GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%. And 4o is an identical bar to o3, but it's labeled 30.8%...

[0] https://i.postimg.cc/DzkZZLry/y-axis.png

0 comments

Aurornis7mo ago

As someone who spent years quadruple checking every figure in every slide for years to avoid a mistake like this, it’s very confusing to see this out of the big launch announcement of one of the most high profile startups around.

Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.

ertgbnm7mo ago

It's literally a billion dollar plus release. I get more scrutiny on my presentations to groups of 10 people.

dbg314157mo ago

I take a strange comfort in still spotting AI typos. Makes it obvious their shiny new "toy" isn't ready to replace professionals.

They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.

The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.

1 more reply

croemer7mo ago

Yes this is quite shocking. They could have just had o3 fact check the slides and it would have noticed...

throwaway0123_57mo ago

I thought so too, but I gave it a screenshot with the prompt:

> good plot for my presentation?

and it didn't pick up on the issue. Part of its response was:

> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.

I think visual reasoning is still pretty far from text-only reasoning.

abirch7mo ago

o3 did fact check the slides and it fixed its lower score.

mixologic7mo ago

They let the AI make the bars.

kridsdale37mo ago

Vibegraphing.

1 more reply

varispeed7mo ago

and then check.

1 more reply

alfalfasprout7mo ago

Probably generated with GPT-5

smartmic7mo ago

The needle now presses a little deeper into the bubble.

achrono7mo ago

I think this just further demonstrates the truth behind the truly small & scrappy teams culture at OpenAI that an ex-employee recently shared [1].

Even with the way the presenters talk, you can sort of see that OAI prioritizes speed above most other things, and a naive observer might think they are testing things a million different ways before releasing, but actually, they're not.

If we draw up a 2x2 for Danger (High/Low) versus Publicity (High/Low), it seems to me that OpenAI sure has a lot of hits in the Low-Danger High-Publicity quadrant, but probably also a good number in the High-Danger Low-Publicity quadrant -- extrapolating purely from the sheer capability of these models and the continuing ability of researchers like Pliny to crack through it still.

[1] https://calv.info/openai-reflections

KoolKat237mo ago

I don't think they give a shit. This is a sales presentation to the general public and the correct data is there. If one is pedantic enough they can see the correct number, if not it sells well. If they really cared grok etc. Would be on there too.

whatever17mo ago

The opposite view is to show your execs the middle finger on nitpicking. Their product is definitely not more important than ChatGPT-5. So your typo does not matter. It didn't ever matter.

nicce7mo ago

It is not mistake. It is common tactic to make illusion of improvement.

dvfjsdhgfv7mo ago

Would they risk such an obvious blunder and being ridiculed for being "AI-sloppy"? I don't believe it.

2 more replies

MrNeon7mo ago

I've seen that sentiment on reddit as well and I can't phantom how you think it being on purpose is more likely than a mistake when

1 - The error is so blatantly large

2 - There is a graph without error right next to it

3 - The errors are not there in the system card and the presentation page

1 more reply

blitzar7mo ago

It wouldnt have taken years of quadruple checks to spot that one.

everfrustrated7mo ago

Possibly they rushed to bring forward the release annoucement

maldonad07mo ago

It's not a mistake. It's meant to misled.

real_marcfawzi7mo ago

Humans hallucinate output all the time.

rafark7mo ago

Not as much as current llms. But the point is that AIs are supposed to be better than us, kind of how people built calculators to be more reliable than the average person and faster than anyone.

renewiltord7mo ago

I'm just going to wildly speculate.

1. They had many teams who had to put their things on a shared Google Sheets or similar

2. They used placeholders to prevent leaks

2.a. Some teams put their content just-in-time

3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream

4. Other teams corrected their content

5. The presentation view being started means that only the ones in 2.a were correct.

Now we wait to see.

bigyabai7mo ago

6. (Occam's Razor) It just didn't perform that well in trials for that specific eval.

1 more reply

yz-exodao7mo ago

Also, what's this??? https://imgur.com/a/5CF34M6

croemer7mo ago

Imgur is down, hug of death from screenshot links on HN.

  {"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}

Or rate limited.

Anon10967mo ago

This is what Imgur shows to blacklisted IPs. You probably have a VPN on that is blocked.

1 more reply

card_zero7mo ago

https://i.postimg.cc/mrF87xpQ/YMADeqH.jpg

1 more reply

koolala7mo ago

stats say this image got 500 views. imgur is much much more populated than HN

1 more reply

clolege7mo ago

Not GPT-5 trying to deceive us about how deceptive it is?

therein7mo ago

Why would you think it is anything special? Just because Sam Altman said so? The same guy who told us he was scared of releasing GPT-2.5 but now calling its abilities "toddler/kindergarten" level?

2 more replies

jasonjmcghee7mo ago

Deception - guessing it's % of responses that deceived the user / gave misleading information

yz-exodao7mo ago

Sure, but 50.0 > 47.4...

1 more reply

godelski7mo ago

In everything except the first set of bars, bigger bar == bigger number.

But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.

drmidnight7mo ago

GPT-5 generated the chart

lacoolj7mo ago

Best answer on this page.

Thanks for the laugh. I needed it.

arjie7mo ago

Must be some sort of typo type thing in the presentation since the launch site has it correct here https://openai.com/index/introducing-gpt-5/#:~:text=Accuracy...

Look at the image just above "Instruction following and agentic tool use"

mcs52807mo ago

They vibecharted

netule7mo ago

This reminds me of the agent demo's MLB stadium map from a few weeks ago: https://youtu.be/1jn_RpbPbEc?t=1435 (at timestamp)

Completely bonkers stuff.

datadrivenangel7mo ago

https://news.ycombinator.com/item?id=44830684

Bluestein7mo ago

New term of art :)

datadrivenangel7mo ago

stable diffusion is great for this!

croemer7mo ago

The barplot is wrong, the numbers are correct. Looks like they had a dummy plot and never updated it, only the numbers to prevent leaking?

Screenshot of the blog plot: https://imgur.com/a/HAxIIdC

hnuser1234567mo ago

Haha, even with that, it says 4o does worse with 2 passes than with 1.

Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.

croemer7mo ago

Those are different benchmarks

1 more reply

tacker20007mo ago

Wow imgur has gone to shit. I open the image on mobile and then try to zoom it and bam some other “related content” is opened…!

jama2117mo ago

Yeah it’s basically unusable now

jml7c57mo ago

That's been an issue for years. Their swipe detection is completely broken.

edwinarbus7mo ago

cross-posting:

https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."

blog: https://openai.com/index/introducing-gpt-5/

anigbrowl7mo ago

(whispers) they're bullshit artists

It's like those idiotic ads at the end of news articles. They're not going after you, the smart discerning logician, they're going after the kind of people that don't see a problem. There are a lot of not-smart people and their money is just as good as yours but easier to get.

hansmayer7mo ago

Exactly this, but it will still be a net negative for all of us. Why? Increasingly I have to argue with non-technical geniuses who have "checked" some complex technical issue with ChatGPT, they themselves lacking even the basic foundations in computer science. So you have an ever increasing number of smartasses who think that this technology finally empowers them. Finally they get "level up" with that arrogant techie. And this will ultimately doom us, because as we know, idiots are in majority and they often overrule the few sane voices.

bhouston7mo ago

Sounds like a graph that was generated via AI. :)

Mawr7mo ago

Don't ask questions, just consume product.

nonhaver7mo ago

also wondering this. had to pause the livestream to make sure i wasnt crazy. definitely eyebrow raising

bwestergard7mo ago

"GPT-5, please generate a slideshow for your launch presentation."

Bluestein7mo ago

"Dang it! Claude!, please ..."

mbowcut27mo ago

it looks like the 2nd and 3rd bar never got updated from the dummy data placeholders lol.

seydor7mo ago

someone copy pasted the 3rd bar to the 2nd

181728282861777mo ago

Probably generated by an LLM

Upvoter337mo ago

Tufte used to call this creating a "visual lie" - you just don't start the y-axis at 0, you start it wherever, in order to maximize the difference. it's dishonest.

amarcheschi7mo ago

52 above 60 seems wrong whatever way you put it

mikert897mo ago

AGI is launching, lets complain about the charts

amarcheschi7mo ago

Any time now

j / k navigate · click thread line to collapse

0 comments

Aurornis7mo ago

Even the small presentations we gave to execs or the board were checked for errors so many times that nothing could possibly slip through.

ertgbnm7mo ago

It's literally a billion dollar plus release. I get more scrutiny on my presentations to groups of 10 people.

dbg314157mo ago

I take a strange comfort in still spotting AI typos. Makes it obvious their shiny new "toy" isn't ready to replace professionals.

They talk about using this to help families facing a cancer diagnosis -- literal life or death! -- and we're supposed to trust a machine that can't even spot a few simple typos? Ha.

The lack of human proofreading says more about their values than their capabilities. They don't want oversight -- especially not from human professionals.

1 more reply

croemer7mo ago

Yes this is quite shocking. They could have just had o3 fact check the slides and it would have noticed...

throwaway0123_57mo ago

I thought so too, but I gave it a screenshot with the prompt:

> good plot for my presentation?

and it didn't pick up on the issue. Part of its response was:

> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.

I think visual reasoning is still pretty far from text-only reasoning.

abirch7mo ago

o3 did fact check the slides and it fixed its lower score.

mixologic7mo ago

They let the AI make the bars.

kridsdale37mo ago

Vibegraphing.

1 more reply

varispeed7mo ago

and then check.

1 more reply

alfalfasprout7mo ago

Probably generated with GPT-5

smartmic7mo ago

The needle now presses a little deeper into the bubble.

achrono7mo ago

I think this just further demonstrates the truth behind the truly small & scrappy teams culture at OpenAI that an ex-employee recently shared [1].

[1] https://calv.info/openai-reflections

KoolKat237mo ago

whatever17mo ago

The opposite view is to show your execs the middle finger on nitpicking. Their product is definitely not more important than ChatGPT-5. So your typo does not matter. It didn't ever matter.

nicce7mo ago

It is not mistake. It is common tactic to make illusion of improvement.

dvfjsdhgfv7mo ago

Would they risk such an obvious blunder and being ridiculed for being "AI-sloppy"? I don't believe it.

2 more replies

MrNeon7mo ago

I've seen that sentiment on reddit as well and I can't phantom how you think it being on purpose is more likely than a mistake when

1 - The error is so blatantly large

2 - There is a graph without error right next to it

3 - The errors are not there in the system card and the presentation page

1 more reply

blitzar7mo ago

It wouldnt have taken years of quadruple checks to spot that one.

everfrustrated7mo ago

Possibly they rushed to bring forward the release annoucement

maldonad07mo ago

It's not a mistake. It's meant to misled.

real_marcfawzi7mo ago

Humans hallucinate output all the time.

rafark7mo ago

Not as much as current llms. But the point is that AIs are supposed to be better than us, kind of how people built calculators to be more reliable than the average person and faster than anyone.

renewiltord7mo ago

I'm just going to wildly speculate.

1. They had many teams who had to put their things on a shared Google Sheets or similar

2. They used placeholders to prevent leaks

2.a. Some teams put their content just-in-time

3. The person running the presentation started the presentation view once they had set up video etc. just before launching stream

4. Other teams corrected their content

5. The presentation view being started means that only the ones in 2.a were correct.

Now we wait to see.

bigyabai7mo ago

6. (Occam's Razor) It just didn't perform that well in trials for that specific eval.

1 more reply

yz-exodao7mo ago

Also, what's this??? https://imgur.com/a/5CF34M6

croemer7mo ago

Imgur is down, hug of death from screenshot links on HN.

  {"data":{"error":"Imgur is temporarily over capacity. Please try again later."},"success":false,"status":403}

Or rate limited.

Anon10967mo ago

This is what Imgur shows to blacklisted IPs. You probably have a VPN on that is blocked.

1 more reply

card_zero7mo ago

https://i.postimg.cc/mrF87xpQ/YMADeqH.jpg

1 more reply

koolala7mo ago

stats say this image got 500 views. imgur is much much more populated than HN

1 more reply

clolege7mo ago

Not GPT-5 trying to deceive us about how deceptive it is?

therein7mo ago

Why would you think it is anything special? Just because Sam Altman said so? The same guy who told us he was scared of releasing GPT-2.5 but now calling its abilities "toddler/kindergarten" level?

2 more replies

jasonjmcghee7mo ago

Deception - guessing it's % of responses that deceived the user / gave misleading information

yz-exodao7mo ago

Sure, but 50.0 > 47.4...

1 more reply

godelski7mo ago

In everything except the first set of bars, bigger bar == bigger number.

But also scale is really off... I don't think anything here is proportionally correct even within the same grouping.

drmidnight7mo ago

GPT-5 generated the chart

lacoolj7mo ago

Best answer on this page.

Thanks for the laugh. I needed it.

arjie7mo ago

Must be some sort of typo type thing in the presentation since the launch site has it correct here https://openai.com/index/introducing-gpt-5/#:~:text=Accuracy...

Look at the image just above "Instruction following and agentic tool use"

mcs52807mo ago

They vibecharted

netule7mo ago

This reminds me of the agent demo's MLB stadium map from a few weeks ago: https://youtu.be/1jn_RpbPbEc?t=1435 (at timestamp)

Completely bonkers stuff.

datadrivenangel7mo ago

https://news.ycombinator.com/item?id=44830684

Bluestein7mo ago

New term of art :)

datadrivenangel7mo ago

stable diffusion is great for this!

croemer7mo ago

The barplot is wrong, the numbers are correct. Looks like they had a dummy plot and never updated it, only the numbers to prevent leaking?

Screenshot of the blog plot: https://imgur.com/a/HAxIIdC

hnuser1234567mo ago

Haha, even with that, it says 4o does worse with 2 passes than with 1.

Edit: Nevermind, just now the first one is SWE-bench and 2nd is aider.

croemer7mo ago

Those are different benchmarks

1 more reply

tacker20007mo ago

Wow imgur has gone to shit. I open the image on mobile and then try to zoom it and bam some other “related content” is opened…!

jama2117mo ago

Yeah it’s basically unusable now

jml7c57mo ago

That's been an issue for years. Their swipe detection is completely broken.

edwinarbus7mo ago

cross-posting:

https://x.com/sama/status/1953513280594751495 "wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though."

blog: https://openai.com/index/introducing-gpt-5/

anigbrowl7mo ago

(whispers) they're bullshit artists

hansmayer7mo ago

bhouston7mo ago

Sounds like a graph that was generated via AI. :)

Mawr7mo ago

Don't ask questions, just consume product.

nonhaver7mo ago

also wondering this. had to pause the livestream to make sure i wasnt crazy. definitely eyebrow raising

bwestergard7mo ago

"GPT-5, please generate a slideshow for your launch presentation."

Bluestein7mo ago

"Dang it! Claude!, please ..."

mbowcut27mo ago

it looks like the 2nd and 3rd bar never got updated from the dummy data placeholders lol.

seydor7mo ago

someone copy pasted the 3rd bar to the 2nd

181728282861777mo ago

Probably generated by an LLM

Upvoter337mo ago

Tufte used to call this creating a "visual lie" - you just don't start the y-axis at 0, you start it wherever, in order to maximize the difference. it's dishonest.

amarcheschi7mo ago

52 above 60 seems wrong whatever way you put it

mikert897mo ago

AGI is launching, lets complain about the charts

amarcheschi7mo ago

Any time now

j / k navigate · click thread line to collapse