I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf
eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).
But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).
Hallucination resistance better but only modestly.
Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.
I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.
After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.
Things like:
Me: Is this thing you claim documented? Where in the documentation does it say this?
GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.
Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.
GPT: Exact same response, word-for-word.
Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.
GPT: Exact same response, word-for-word.
Me: Here are some random words to test if you are listening to me: foo, bar, baz.
GPT: Exact same response, word-for-word.
It’s so repetitive I wonder if it’s an engineering fault, because it’s weird that the model would be so consistent in its responses regardless of the input. Once it gets stuck, it doesn’t matter what I enter, it just keeps saying the same thing over and over.
Its impressive but a regression for now, in direct comparison to just high parameter model
[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard
“Did you try running it over and over until you got the results you wanted?”
"Did you try a room full of chimpanzees with typewriters?"
I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)
Maybe in practice it's better to look at RAG benchmarks, since a lot of AI tools will search online for information before giving you an answer anyways? (Memorization of info would matter less in that scenario)
"in my experience [x model] one shots everything and [y model] stumbles and fumbles like a drunkard", for _any_ combination of X and Y.
I get the idea of sharing what's working and what's not, but at this point it's clear that there are more factors to using these with success and it's hard to replicate other people's successful workflows.
codex -m gpt-5 model_reasoning_effort="high"
Are they really understanding, or putting out a stream of probabilities?
The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.
I think these misnomers can cause real issues like thinking the LLM is "reasoning".
Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".
Lays out pretty well what our current knowledge on understanding is