From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf] (opens in new tab)

(fertrevino.com)

127 pointsfertrevino9mo ago96 comments

I recently worked on running a thorough healthcare eval on GPT-5. The results show a (slight) regression in GPT-5 performance compared to GPT-4 era models.

I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf

96 comments

aresant9mo ago

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

TrainedMonkey9mo ago

GPT-5 feels like cost engineering. The model is incrementally better, but they are optimizing for least amount of compute. I am guessing investors love that.

narrator9mo ago

I agree. I have found GPT-5 significantly worse on medical queries. It feels like it skips important details and is much worse than o3, IMHO. I have heard good things about GPT-5 Pro, but that's not cheap.

I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.

After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.

fertrevinoOP9mo ago

Interesting, it seems the anecdotal experience agrees with the benchmark results.

rbinv9mo ago

Afaik, there is currently no "GPT-5 Pro". Did you mean o3-pro or o1-pro (via API)?

Currently, GPT-5 sits at $10/1M output tokens, o3-pro at $80, and o1-pro at a whopping $600: https://platform.openai.com/docs/pricing

Of course this is not indicative of actual performance or quality per $ spent, but according to my own testing, their performance does seem to scale in line with their cost.

2 more replies

RestartKernel9mo ago

I wonder how that math works out. GPT-5 keeps triggering a thinking flow even for relatively simple queries, so each token must be a magnitude cheaper to make this worth the trade-off in performance.

JimDabell9mo ago

I’ve found that it’s super likely to get stuck repeating the exact same incorrect response over and over. It used to happen occasionally with older models, but it happens frequently now.

Things like:

Me: Is this thing you claim documented? Where in the documentation does it say this?

GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.

Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.

GPT: Exact same response, word-for-word.

Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.

GPT: Exact same response, word-for-word.

Me: Here are some random words to test if you are listening to me: foo, bar, baz.

GPT: Exact same response, word-for-word.

It’s so repetitive I wonder if it’s an engineering fault, because it’s weird that the model would be so consistent in its responses regardless of the input. Once it gets stuck, it doesn’t matter what I enter, it just keeps saying the same thing over and over.

namibj9mo ago

Go back and edit a prompt of yours in the conversation instead of continuing with garbage in the context.

1 more reply

slashdev9mo ago

If one conversation goes in a bad direction, it's often best to just start over. The bad context often poisons the existing session.

TrainedMonkey9mo ago

That sounds like query caching... which would also align with cost engineering angle.

UltraSane9mo ago

Since the routing is opaque they can dynamically route queries to cheaper models when demand is high.

yieldcrv9mo ago

Yeah look at their open source models and how you get such high parameters in such low vram

Its impressive but a regression for now, in direct comparison to just high parameter model

woeirua9mo ago

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

p1esk9mo ago

What would you expect?

fertrevinoOP9mo ago

Mixed results indeed. While it leads the benchmark in two question types, it falls short in others which results in the overall slight regression.

xnx9mo ago

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

fertrevinoOP9mo ago

That would be an interesting extension. MedGemma isn't part of the original benchmark either [1]. Since Gemini 2.0 Flash is on 6th place, expectations are for MedGemma to achieve higher than that :)

[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard

hypoxia9mo ago

Did you try it with high reasoning effort?

ares6239mo ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

dcre9mo ago

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

aprilthird20219mo ago

> Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.

1 more reply

brendoelfrendo9mo ago

Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.

To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."

5 more replies

SequoiaHope9mo ago

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

ares6239mo ago

I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

3 more replies

chairmansteve9mo ago

Or...

"Did you try a room full of chimpanzees with typewriters?"

ancorevard9mo ago

so since reasoning_effort is not discussed anywhere, I assume you used the default which is "medium"?

energy1239mo ago

Also, were tool calls allowed? The point of reasoning models is to delete the facts so finite capacity goes towards the dense reasoning engine rather than recall, with the facts sitting elsewhere.

0xDEAFBEAD9mo ago

So which of these benchmarks are most relevant for an ordinary user who wants to talk to AI about their health issues?

I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)

Maybe in practice it's better to look at RAG benchmarks, since a lot of AI tools will search online for information before giving you an answer anyways? (Memorization of info would matter less in that scenario)

username1359mo ago

I wonder what changed with the models that created regression?

teaearlgraycold9mo ago

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

degamad9mo ago

Obligxkcd: https://xkcd.com/1838/

fertrevinoOP9mo ago

loved the cartoon :)

oezi9mo ago

There is some speculation that GPT-5 uses a router to decide which expert model to deploy (e.g. to mini vs o/thinking models). So the router might decide that the query can be solved by a cheaper model and this model gives worse results.

1 more reply

causality09mo ago

I've definitely seen some unexpected behavior from gpt5. For example, it will tell me my query is banned and then give me a full answer anyway.

andai9mo ago

Did this use reasoning or not? GPT-5 with Minimal reasoning does roughly the same as 4o on benchmarks.

credit_guy9mo ago

Here's my experience: for some coding tasks where GPT 4.1, Claude Sonnet 4, Gemini 2.5 Pro were just spinning for hours and hours and getting nowhere, GPT 5 just did the job without a fuss. So, I switched immediately to GPT 5, and never looked back. Or at least I never looked back until I found out that my company has some Copilot limits for premium models and I blew through the limit. So now I keep my context small, use GPT 5 mini when possible, and when it's not working I move to the full GPT 5. Strangely, it feels like GPT 5 mini can corrupt the full GPT 5, so sometimes I need to go back to Sonnet 4 to get unstuck. To each their own, but I consider GPT 5 a fairly bit move forward in the space of coding assistants.

agos9mo ago

any thread on HN about AI (there's constantly at least one in homepage nowadays) goes like this:

"in my experience [x model] one shots everything and [y model] stumbles and fumbles like a drunkard", for _any_ combination of X and Y.

I get the idea of sharing what's working and what's not, but at this point it's clear that there are more factors to using these with success and it's hard to replicate other people's successful workflows.

benlc9mo ago

Interestingly I'm experiencing the opposite as you. Was mostly using Claude Sonnet 4 and GPT 4.1 through copilot for a few months and was overall fairly satisfied with it. First task I threw at GPT 5, it excelled in a fraction of the time Sonnet 4 normally takes, but after a few iterations, it all went downhill. GPT 5 almost systematically does things I didn't ask it to do. After failing to solve an issue for almost an hour, I switched back to Claude which fixed it in the first try. YMMV

AndyNemmity9mo ago

Yeah, GPT 5 got into death loops faster than any other LLM, and I stopped using it for anything more than UI prototypes.

czk9mo ago

its possible to use gpt-5-high on the plus plan with codex-cli, its a whole different beast! i dont think theres any other way for plus users to leverage gpt-5 with high reasoning.

codex -m gpt-5 model_reasoning_effort="high"

CuriouslyC9mo ago

GPT-5 is like an autistic savant

mattwad9mo ago

i thought cursor was getting really bad, then i found out i was on a gpt 5 trial. gonna stick with claude :)

kumarvvr9mo ago

I have an issue with the words "understanding", "reasoning", etc when talking about LLMs.

Are they really understanding, or putting out a stream of probabilities?

munchler9mo ago

Does it matter from a practical point of view? It's either true understanding or it's something else that's similar enough to share the same name.

axdsk9mo ago

The polygraph is a good example.

The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.

I think these misnomers can cause real issues like thinking the LLM is "reasoning".

dexterlagan9mo ago

Agreed, but in the case of the lie detector, it seems it's a matter of interpretation. In the case of LLMs, what is it? Is it a matter of saying "It's a next-word calculator that uses stats, matrices and vectors to predict output" instead of "Reasoning simulation made using a neural network"? Is there a better name? I'd say it's "A static neural network that outputs a stream of words after having consumed textual input, and that can be used to simulate, with a high level of accuracy, the internal monologue of a person who would be thinking about and reasoning on the input". Whatever it is, it's not reasoning, but it's not a parrot either.

sema4hacker9mo ago

The latter. When "understand", "reason", "think", "feel", "believe", and any of a long list of similar words are in any title, it immediately makes me think the author already drank the kool aid.

manveerc9mo ago

In the context of coding agents, they do simulate “reasoning” when you feed them the output and it is able to correct itself.

qwertytyyuu9mo ago

I agree with “feel” and “believe” but what words would you suggest instead of “understand” and “reason’?

sema4hacker9mo ago

None. Don't anthropomorphize at all. Note that "understanding" has now been removed from the HN title but not the linked pdf.

1 more reply

vexna9mo ago

kool aid or not -- "reasoning" is already part of the LLM verbiage (e.g `reasoning` models having `reasoningBudget`). The meaning might not be 1:1 to human reasoning, but when the LLM shows its "reasoning" it does look _appear_ like a train of thought. If I had to give what it's doing a name (like I'm naming a function), I'd be hard pressed to not go with something like `reason`/`think`.

insin9mo ago

    prefillContext()

hodgehog119mo ago

What does understanding mean? Is there a sensible model for it? If not, we can only judge in the same way that we judge humans: by conducting examinations and determining whether the correct conclusions were reached.

Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".

Workaccount29mo ago

https://ai.vixra.org/pdf/2506.0065v1.pdf

Lays out pretty well what our current knowledge on understanding is

jmpeax9mo ago

Do you yourself really understand, or are you just depolarizing neurons that have reached their threshold?

octomind9mo ago

It can be simultaneously true that human understanding is just a firing of neurons but that the architecture and function of those neural structures is vastly different than what an LLM is doing internally such that they are not really the same. Encourage you to read Apple’s recent paper on thinking models; I think it’s pretty clear that the way LLMs encode the world is drastically inferior to what the human brain does. I also believe that could be fixed with the right technical improvements, but it just isn’t the case today.

dmead9mo ago

He doesn't know the answer to that and neither do you.

lucisferre9mo ago

[flagged]

dang9mo ago

Can you please not post like this to HN? It's against the site rules (https://news.ycombinator.com/newsguidelines.html).

The idea is: if you have a substantive point, make it thoughtfully; if not, please don't comment until you do.

1 more reply

dang9mo ago

OK, we've removed all understanding from the title above.

fragmede9mo ago

Care to provide reasoning as to why?

dang9mo ago

The article's title was longer than 80 chars, which is HN's limit. There's more than one way to truncate it.

The previous truncation ("From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding") was baity in the sense that the word 'understanding' was provoking objections and taking us down a generic tangent about whether LLMs really understand anything or not. Since that wasn't about the specific work (and since generic tangents are basically always less interesting*), it was a good idea to find an alternate truncation.

So I took out the bit that was snagging people ("understanding") and instead swapped in "MedHELM". Whatever that is, it's clearly something in the medical domain and has no sharp edge of offtopicness. Seemed fine, and it stopped the generic tangent from spreading further.

* https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

1 more reply

woeirua9mo ago

Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.

BoredPositron9mo ago

It's hacker news. You can handle a PDF.

jeffbee9mo ago

I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).

HeatrayEnjoyer9mo ago

PDFs can run almost anything and have an attack surface the size of Greece's coast.

1 more reply

j / k navigate · click thread line to collapse

96 comments

aresant9mo ago

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

TrainedMonkey9mo ago

GPT-5 feels like cost engineering. The model is incrementally better, but they are optimizing for least amount of compute. I am guessing investors love that.

narrator9mo ago

After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.

fertrevinoOP9mo ago

Interesting, it seems the anecdotal experience agrees with the benchmark results.

rbinv9mo ago

Afaik, there is currently no "GPT-5 Pro". Did you mean o3-pro or o1-pro (via API)?

Currently, GPT-5 sits at $10/1M output tokens, o3-pro at $80, and o1-pro at a whopping $600: https://platform.openai.com/docs/pricing

Of course this is not indicative of actual performance or quality per $ spent, but according to my own testing, their performance does seem to scale in line with their cost.

2 more replies

RestartKernel9mo ago

I wonder how that math works out. GPT-5 keeps triggering a thinking flow even for relatively simple queries, so each token must be a magnitude cheaper to make this worth the trade-off in performance.

JimDabell9mo ago

I’ve found that it’s super likely to get stuck repeating the exact same incorrect response over and over. It used to happen occasionally with older models, but it happens frequently now.

Things like:

Me: Is this thing you claim documented? Where in the documentation does it say this?

GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.

Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.

GPT: Exact same response, word-for-word.

Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.

GPT: Exact same response, word-for-word.

Me: Here are some random words to test if you are listening to me: foo, bar, baz.

GPT: Exact same response, word-for-word.

namibj9mo ago

Go back and edit a prompt of yours in the conversation instead of continuing with garbage in the context.

1 more reply

slashdev9mo ago

If one conversation goes in a bad direction, it's often best to just start over. The bad context often poisons the existing session.

TrainedMonkey9mo ago

That sounds like query caching... which would also align with cost engineering angle.

UltraSane9mo ago

Since the routing is opaque they can dynamically route queries to cheaper models when demand is high.

yieldcrv9mo ago

Yeah look at their open source models and how you get such high parameters in such low vram

Its impressive but a regression for now, in direct comparison to just high parameter model

woeirua9mo ago

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

p1esk9mo ago

What would you expect?

fertrevinoOP9mo ago

Mixed results indeed. While it leads the benchmark in two question types, it falls short in others which results in the overall slight regression.

xnx9mo ago

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

fertrevinoOP9mo ago

That would be an interesting extension. MedGemma isn't part of the original benchmark either [1]. Since Gemini 2.0 Flash is on 6th place, expectations are for MedGemma to achieve higher than that :)

[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard

hypoxia9mo ago

Did you try it with high reasoning effort?

ares6239mo ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

dcre9mo ago

aprilthird20219mo ago

1 more reply

brendoelfrendo9mo ago

Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

5 more replies

SequoiaHope9mo ago

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

ares6239mo ago

I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

3 more replies

chairmansteve9mo ago

Or...

"Did you try a room full of chimpanzees with typewriters?"

ancorevard9mo ago

so since reasoning_effort is not discussed anywhere, I assume you used the default which is "medium"?

energy1239mo ago

Also, were tool calls allowed? The point of reasoning models is to delete the facts so finite capacity goes towards the dense reasoning engine rather than recall, with the facts sitting elsewhere.

0xDEAFBEAD9mo ago

So which of these benchmarks are most relevant for an ordinary user who wants to talk to AI about their health issues?

I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)

username1359mo ago

I wonder what changed with the models that created regression?

teaearlgraycold9mo ago

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

degamad9mo ago

Obligxkcd: https://xkcd.com/1838/

fertrevinoOP9mo ago

loved the cartoon :)

oezi9mo ago

1 more reply

causality09mo ago

I've definitely seen some unexpected behavior from gpt5. For example, it will tell me my query is banned and then give me a full answer anyway.

andai9mo ago

Did this use reasoning or not? GPT-5 with Minimal reasoning does roughly the same as 4o on benchmarks.

credit_guy9mo ago

agos9mo ago

any thread on HN about AI (there's constantly at least one in homepage nowadays) goes like this:

"in my experience [x model] one shots everything and [y model] stumbles and fumbles like a drunkard", for _any_ combination of X and Y.

benlc9mo ago

AndyNemmity9mo ago

Yeah, GPT 5 got into death loops faster than any other LLM, and I stopped using it for anything more than UI prototypes.

czk9mo ago

its possible to use gpt-5-high on the plus plan with codex-cli, its a whole different beast! i dont think theres any other way for plus users to leverage gpt-5 with high reasoning.

codex -m gpt-5 model_reasoning_effort="high"

CuriouslyC9mo ago

GPT-5 is like an autistic savant

mattwad9mo ago

i thought cursor was getting really bad, then i found out i was on a gpt 5 trial. gonna stick with claude :)

kumarvvr9mo ago

I have an issue with the words "understanding", "reasoning", etc when talking about LLMs.

Are they really understanding, or putting out a stream of probabilities?

munchler9mo ago

Does it matter from a practical point of view? It's either true understanding or it's something else that's similar enough to share the same name.

axdsk9mo ago

The polygraph is a good example.

The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.

I think these misnomers can cause real issues like thinking the LLM is "reasoning".

dexterlagan9mo ago

sema4hacker9mo ago

The latter. When "understand", "reason", "think", "feel", "believe", and any of a long list of similar words are in any title, it immediately makes me think the author already drank the kool aid.

manveerc9mo ago

In the context of coding agents, they do simulate “reasoning” when you feed them the output and it is able to correct itself.

qwertytyyuu9mo ago

I agree with “feel” and “believe” but what words would you suggest instead of “understand” and “reason’?

sema4hacker9mo ago

None. Don't anthropomorphize at all. Note that "understanding" has now been removed from the HN title but not the linked pdf.

1 more reply

vexna9mo ago

insin9mo ago

    prefillContext()

hodgehog119mo ago

Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".

Workaccount29mo ago

https://ai.vixra.org/pdf/2506.0065v1.pdf

Lays out pretty well what our current knowledge on understanding is

jmpeax9mo ago

Do you yourself really understand, or are you just depolarizing neurons that have reached their threshold?

octomind9mo ago

dmead9mo ago

He doesn't know the answer to that and neither do you.

lucisferre9mo ago

[flagged]

dang9mo ago

Can you please not post like this to HN? It's against the site rules (https://news.ycombinator.com/newsguidelines.html).

The idea is: if you have a substantive point, make it thoughtfully; if not, please don't comment until you do.

1 more reply

dang9mo ago

OK, we've removed all understanding from the title above.

fragmede9mo ago

Care to provide reasoning as to why?

dang9mo ago

The article's title was longer than 80 chars, which is HN's limit. There's more than one way to truncate it.

* https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

1 more reply

woeirua9mo ago

Interesting topic, but I'm not opening a PDF from some random website. Post a summary of the paper or the key findings here first.

BoredPositron9mo ago

It's hacker news. You can handle a PDF.

jeffbee9mo ago

I approve of this level of paranoia, but I would just like to know why PDFs are dangerous (reasonable) but HTML is not (inconsistent).

HeatrayEnjoyer9mo ago

PDFs can run almost anything and have an attack surface the size of Greece's coast.

1 more reply

j / k navigate · click thread line to collapse