The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.
It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.
But to me it's very clear that the product that gets this right will be the one I use.
Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.
IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.
I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).
LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.
Now people will say that this is unavoidable given the way in which transformers work. And this is true.
But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.
As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.
Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.
A useful mental model is: imagine if I presented you with a potential new recruit's complete data (resume, job history, recordings of the job interview, everything) but you only had 1 second to tell me "hired: YES OR NO"
And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.
Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?
Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.
Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.
Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.
Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.
Retrieval.
And then hallucination even in the face of perfect context.
Both are currently unsolved.
(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)
I've been working on this problem with https://citellm.com, specifically for PDFs.
Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).
Checking any claim becomes simple: click and see the exact source.
Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.
So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.
0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...
You can find it right next to the image you are talking about.
But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...
I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!
Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.
Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output
I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.
Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).
I think we align on what we want out of models:
""" Don't add useless babelling before the chats, just give the information direct and explain the info.
DO NOT USE ENGAGEMENT BAITING QUESTIONS AT THE END OF EVERY RESPONSE OR I WILL USE GROK FROM NOW ON FOREVER AND CANCEL MY GPT SUBSCRIPTION PERMANENTLY ONLY. GIVE USEFUL FACTUAL INFORMATION AND FOLLOW UPS which are grounded in first principles thinking and logic. Do not take a side and look at think about the extreme on both ends of a point before taking a side. Do not take a side just because the user has chosen that but provide infomration on both extremes. Respond with raw facts and do not add opinions.
Do not use random emojis. Prefer proper marks for lists etc. """
Those spelling/grammar errors are actually there and I don't want to change it as its working well for me.
They're literally incapable of this. Any number they give you is bullshit.
- It is faster which is appreciated but not as fast as Opus 4.5
- I see no changes, very little noticeable improvements over 5.1
- I do not see any value in exchange for +40% in token costs
All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.
I just think they're all struggling to provide real world improvements
The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.
The medium-reasoning version also improves: 62.7 → 72.1.
The no-reasoning version also improves: 22.1 → 27.5.
Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).
Generate an SVG of an octopus operating a pipe organ
Generate an SVG of a giraffe assembling a grandfather clock
Generate an SVG of a starfish driving a bulldozer
https://gally.net/temp/20251107pelican-alternatives/index.ht...
GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.
Would like to know how much they are optimizing for your pelican....
Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.
Congrats OpenAI team. Huge day for you folks!!
Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.
Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.
Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.
Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.
And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!
As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.
The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.
I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.
Gemini 3 Pro, by contrast, can genuinely handle long contexts.
Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...
Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.
...
>THANK YOU!!
Man you're way too excited.
Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.
Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".
This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.
Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.
Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.
I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!
It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.
For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.
Also works really well when some of my questions may not have been worded correctly and ChatGPT has gone in a direction I don't want it to go. Branch, word my question better and get a better answer.
Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.
The tools need to figure out how to manage context for us. This isn't something we have to deal with when working with other humans - we reliably trust that other humans (for the most part) retain what they are told. Agentic use now is like training a team mate to do one thing, then taking it out back to shoot it in the head before starting to train another one. It's inefficient and taxing on the user.
This was earlier this year... So I started giving internal presentations on basic context management, best practices, etc after that for our engineering team.
Now I kind of wonder if I’m missing out by not continuing the conversation too much, or by not trying to use memory features.
I don’t understand how agentic IDEs handle this either. Or maybe it’s easier - it just resends the entire codebase every time. But where to cut the chat history? It feels to me like every time you re-prompt a convo, it should first tell itself to summarize the existing context as bullets as its internal prompt rather than re-sending the entire context.
This (and the price increase) points to a new pretrained model under-the-hood.
GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.
Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?
Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.
The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.
2.5 Pro: $1.25 input, $10 output (million tokens)
3 Pro Preview: $2 input, $12 output (million tokens)
And of course Grok's unhinged persona is... something else.
That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.
Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.
But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").
The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.
To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.
So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.
But they publish all the same numbers, so you can make the full comparison yourself, if you want to.
Apple only compares to themselves. They don't even acknowledge the existence of others.
Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.
Also notes that some may not be producible.
Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.
As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.
None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.
Oh I know this from my time at Google. The actual purpose is to do a quick check for known malware and phishing. Of course these days such things are better dealt with by the browser itself in a privacy preserving way (and indeed that’s the case), so it’s unnecessary to reveal to Google which links are clicked. It’s totally fine to manipulate them to make them go directly to the website.
What an understatement. It has me thinking „man, fuck this“ on the daily.
Just today it spontaneously lost an entire 20-30 minutes long thread and it was far from the first time. It basically does it any time you interrupt it in any way. It’s straight up data loss.
It’s kind of a typical Google product in that it feels more like a tech demo than a product.
It has theoretically great tech. I particularly like the idea of voice mode, but it’s noticeably glitchy, breaks spontaneously often and keeps asking annoying questions which you can’t make it stop.
Opus 4.5 has been a step above both for me, but the usage limits are the worst of the three. I'm seriously considering multiple parallel subscriptions at this point.
Google, if you can find a way to export chats into NotebookLM, that would be even better than the Projects feature of ChatGPT.
Depends, even though Gemini 3 is a bit better than GPT5.1, the quality of the ChatGPT apps themselves (mobile, web) have kept me a subscriber to it.
I think Google needs to not-google themselves into a poor app experience here, because the models are very close and will probably continue to just pass each other in lock step. So the overall product quality and UX will start to matter more.
Same reason I am sticking to Claude Code for coding.
I still find a lot to be annoyed with when it comes to Gemini's UI and its... continuity, I guess is how I would describe it? It feels like it starts breaking apart at the seams a bit in unexpected ways during peak usages including odd context breaks and just general UI problems.
But outside of UI-related complaints, when it is fully operational it performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself.
Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced
> Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.
I think the biggest issue OpenAI is facing is the numbers: Google is at the moment a near $4 trillion company. They can splurge a near infinite amount of money to win the race.
Google is so big they they created their own TPUs, which is mindboggling.
Which new user is going to willingly pay an OpenAI subscription once he knows that gemini.google.com gives access to a state of the art model? And Google makes sure to remind users who search that they can "continue the discussion" with Gemini.
Maybe the dirty Altman tricks like cornering the entire RAM market can work but I don't see how they can beat Google by playing fair. OpenAI shall need every single dirty trick in the book, including circular funding / shady deals with NVidia to stay relevant vs the behemoth that Google is.
And how has chatgpt lost when ure not comparing the chatgpt that just came out to the Gemini that just came out? Gemini is just annoying to use.
and Google just benchmaxxed I didn't see any significant difference (paying for both) and the same benchmaxxing probably happening for chatgpt now as well, so in terms of core capabilities I feel stuff has plateaued. more bout overall experience now where Gemini suxx.
I really don't get how "search integration" is a "strength"?? can you give any examples of places where you searched for current info and chatgpt was worse? even so I really don't get how it's a moat enough to say chatgpt has lost. would've understood if you said something like tpu versus GPU moat.
anyway, cancelled my chatgpt subscription.
Assuming you meant "leave the app open", I have the same frustration. One of the nice things about the ChatGPT app is you can fire off a req and do something else. I also find Gemini 3 Pro better for general use, though I'm keen to try 5.2 properly
For me both Gemini and ChatGPT (both paid versions Key in Gemini and ChatGPT Plus) give me similiar results in terms of "every day" research. Im sticking with ChatGPT at the moment, as the UI and scaffolding around the model is in my view better at ChatGpt (e.g. you can add more than one picture at once...)
For Software Development, I tested Gemini3 and I was pretty disappointed in comparison to Claude Opus CLI, which is my daily driver.
Also, I would never, ever, trust Google for privacy or sign into a Google account except on YouTube (and clear cookies afterwards to stop them from signing me into fucking Search too).
>OCR is phenomenal
I literally tried to OCR a TYPED document in Gemini today and it mangled it so bad I just transcribed it myself because it would take less time than futzing around with gemini.
> Gemini handles every single one of my uses cases much better and consistently gives better answers.
>coding
I asked it to update a script by removing some redundant logic yesterday. Instead of removing it it just put == all over the place essentially negating but leaving all the code and also removing the actual output.
>Stocks analysis
lol, now I know where my money comes from.
I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.
Just put out the tool: https://github.com/ralliesai/tenk
Model hallucinated half of the data?! Sorry we can't go back on this decision, that would make us look bad!
Or when some silly model will push everyone to invest in some radicoulous company and everybody will do it. Poisoning data attack to inject some I am Future Inc ™ company with high investment rate. After few months pocket money and vanish.
We are certainly going to live in interesting times.
https://docs.google.com/spreadsheets/d/1DVh5p3MnNvL4KqzEH0ME...
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
It'll be noteworthy to see the cost-per-task on ARC AGI v2.
I can't even anymore. Sorry this is not going anywhere.
Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.
I doubt it, given it is more expensive than the old model.
Did you test it?
All of your benchmarks mean nothing to me until you include Claude Sonnet on them.
In my experience, GPT hasn’t been able to compete with Claude in years for the daily “economically valuable” tasks I work on.
But how much of each product they release also just a factor of how much they are willing to spend on inference per query in order to stay competitive?
I always wonder how much is technical change vs turning a knob up and down on hardware and power consumption.
GTP5.0 for example seemed like a lot of changes more for OpenAI's internal benefit (terser responses, dynamic 'auto' mode to scale down thinking when not required etc.)
Wondering if GPT5.2 is also case of them in 'code red mode' just turning what they already have up to 11 as a fastest way to respond to fiercer competion.
edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)
I'll stick with plug and play API instead.
Unsupported parameter: 'top_p' is not supported with this model.
Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.
The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.
Competition works!
GDPval seems particularly strong.
I wonder why they held this back.
1) Maybe this is uneconomical ?
2) Did the safety somehow hold back the company ?
looking forward to the internet trying this and posting their results over the next week or two.
COMPETITION!
Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.
Baseline safety (direct harmful requests): 96% refusal rate
With jailbreaking: 22% refusal rate
4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).
The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.
Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...
Confirming prior reporting about them hiring junior analysts
The closest parallel I’ve found is Peter Gärdenfors’ work on conceptual spaces, where meaning isn’t symbolic but geometric. Fedorenko’s research on predictive sequencing in the brain fits too. In both cases, the idea is that language follows a trajectory through a shaped mental space, and that’s basically what GPT is doing. It doesn’t know anything, but it generates plausible paths through a statistical terrain built from our own language use.
So when it “hallucinates”, that’s not a bug so much as a result of the system not being grounded. It’s doing what it was designed to do: complete the next step in a pattern. Sometimes that’s wildly useful. Sometimes it’s nonsense. The trick is knowing which is which.
What’s weird is that once you internalise this, you can work with it as a kind of improvisational system. If you stay in the loop, challenge it, steer it, it feels more like a collaborator than a tool.
That’s how I use it anyway. Not as a source of truth, but as a way of moving through ideas faster.
The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.
Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.
The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.
This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.
And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.
I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.
Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.
Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).
With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).
I feel like there is a small chance I could actually make this work in some areas of the business now. 400k is a really big context window. The last time I made any serious attempt I only had 32k tokens to work with. I still don't think these things can build the whole product for you, but if you have a structured configuration abstraction in an existing product, I think there is definitely uplift possible.
Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.
I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.
"All models" section on https://platform.openai.com/docs/models is quite ridiculous.
I have a bad feeling about this.
The benchmark changes are incredible, but I have yet to notice a difference in my codebases as of yet.
We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.
I love the way they talk about incorrect responses:
> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.
“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“
I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.
10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.
With a subsidized cost of $200/month for OpenAI it would be cheaper to hirer a part-time minimum wage worker than it would be to contract with OpenAI.
And that is the rosiest estimate OpenAI has.
Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.
Same query - what romanian football player won the premier league
update. Even instant returns correct result without problems
I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.
I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.
I’m ok paying money for it, so maybe someone coded this already?
OpenAI and Anthrophic is my current preference. Looking forward to know what others use.
Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.
Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)
> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.
Does that term have special meaning in the AI/LLM world? I never heard it before. I Google'd the term "System Card LLM" and got a bunch of hits. I am so surprised that I never saw the term used here in HN before.
Also, the layout looks exactly like a scientific paper written in LaTeX. Who is the expected audience for this paper?
I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.
I guess I must "listen" to the article...
The problem is complicated, but very solvable.
I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.
So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.
I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”
They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.
To add, it also only seems to be using about 25% context on Opus 4.5. Weird!
This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?
No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.
GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.
Great for humanity probably but but for the individuals?
>- The UI should be calming and realistic.
Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.
We tried the same prompts we asked previous models today, and found out [1].
The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.
The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.
Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.
[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363
maybe it's just because the gpt5.2 in cursor is super stupid?
gpt-5.2 $1.75 $0.175 $14.00
gpt-5.1 $1.25 $0.125 $10.00I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.
I remain excited about new models. It's like finding my coworker be 10% smarter every other week.
The winner in this race will be whoever gets small local models to perform as well on consumer hardware. It'll also pop the tech bubble in the US.
If this is what AI has to offer, we are in a gigantic bubble
Seems not yet with 5.2
What a sociopathic way to sell
For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.
I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.
I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.
I don't get how these models are useful in the real world.
What am I missing, is this just me?
I guess it truly an enterprise model.
(edit: I'm sorry I didn't read enough on the topic, my apologies)