I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf
eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).
But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).
Hallucination resistance better but only modestly.
Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.
I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.
After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.
Currently, GPT-5 sits at $10/1M output tokens, o3-pro at $80, and o1-pro at a whopping $600: https://platform.openai.com/docs/pricing
Of course this is not indicative of actual performance or quality per $ spent, but according to my own testing, their performance does seem to scale in line with their cost.
Things like:
Me: Is this thing you claim documented? Where in the documentation does it say this?
GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.
Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.
GPT: Exact same response, word-for-word.
Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.
GPT: Exact same response, word-for-word.
Me: Here are some random words to test if you are listening to me: foo, bar, baz.
GPT: Exact same response, word-for-word.
It’s so repetitive I wonder if it’s an engineering fault, because it’s weird that the model would be so consistent in its responses regardless of the input. Once it gets stuck, it doesn’t matter what I enter, it just keeps saying the same thing over and over.
Its impressive but a regression for now, in direct comparison to just high parameter model
[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard
“Did you try running it over and over until you got the results you wanted?”
One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.
As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.
To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."
"Did you try a room full of chimpanzees with typewriters?"
I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)
Maybe in practice it's better to look at RAG benchmarks, since a lot of AI tools will search online for information before giving you an answer anyways? (Memorization of info would matter less in that scenario)
"in my experience [x model] one shots everything and [y model] stumbles and fumbles like a drunkard", for _any_ combination of X and Y.
I get the idea of sharing what's working and what's not, but at this point it's clear that there are more factors to using these with success and it's hard to replicate other people's successful workflows.
codex -m gpt-5 model_reasoning_effort="high"
Are they really understanding, or putting out a stream of probabilities?
The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.
I think these misnomers can cause real issues like thinking the LLM is "reasoning".
prefillContext()Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".
Lays out pretty well what our current knowledge on understanding is
The idea is: if you have a substantive point, make it thoughtfully; if not, please don't comment until you do.
The previous truncation ("From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding") was baity in the sense that the word 'understanding' was provoking objections and taking us down a generic tangent about whether LLMs really understand anything or not. Since that wasn't about the specific work (and since generic tangents are basically always less interesting*), it was a good idea to find an alternate truncation.
So I took out the bit that was snagging people ("understanding") and instead swapped in "MedHELM". Whatever that is, it's clearly something in the medical domain and has no sharp edge of offtopicness. Seemed fine, and it stopped the generic tangent from spreading further.
* https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...