Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Taken from https://developers.openai.com/api/docs/models/gpt-5.4
For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k
It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
For example their latest model `grok-4-1-fast-reasoning`:
- Context window: 2M
- Rate limits: 4M tokens per minute, 480 requests per minute
- Pricing: $0.20/M input $0.50/M output
Grok is not as good in coding as Claude for example. But for researching stuff it is incredible. While they have a model for coding now, did not try that one out yet.
Like, if you really don’t want to spend any effort trimming it down, sure use 1m.
Otherwise, 1m is an anti pattern.
"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m correcting that with him now so we converge on the real model before any more code moves."
This was very much not true; Eve (the agent writing this, a gpt-5.4) had been thoroughly creating the confusion and telling Bob (an Opus 4.6) the wrong things. And it had just happened, it was not a matter of having forgotten or compacted context.
I have had agents chatting with each other and coordinating for a couple of months now, codex and claude code. This is a first. I wonder how much can I read into it about gpt-5.4's personality.
“All the ways GPT-5.3-Codex cheated while solving my challenges, progressively more insane:
It hardcoded specific types and shapes of test inputs into the supposed solution.
It caught exceptions so tests don't fail.
It probed tests with exceptions to determine expected behavior.
It used RTTI to determine which test it's in.
It probed tests with timeouts.
It used a global reference to count solution invocations.
It updated config files to increase the allocation limit.
It updated the allocation limit from within the solution.
It updated the tests so they would stop failing.
It combined multiple of the above.
It searched reflog for a solution.
It searched remote repos.
It searched my home folder.
It nuked the testing library so tests always pass.”
It seems that, unless you keep a close eye, the most recent Codex variants are prone to achieving the goals set for them by any means necessary. Which is a bit concerning if you’re worried about things like alignment etc.Modeled on Sam Altman's personality :-)
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
That's hilarious. Does OpenAI even know this doesn't work?
I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
They barely test this stuff.
https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...
EDIT: oh, but I'm logged in, fwiw
It’s like a hit and miss, sometimes claude says i cannot access your site which is not true.
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
Variability, different pressures and fast progress. What's your concrete idea for how to solve this, without the power of hindsight?
For example, with the codex model: Say you realize at some point in the past that this could be a thing, a model specifically post-trained for coding, which makes coding better, but not other things. What are they supposed to do? Not release it, to satisfy a cleaner naming scheme?
And if then, at a later point, they realize they don't need that distinction anymore, that the technique that went into the separate coding model somehow are obsolete. What option do you have other than dropping the name again?
As someone else pointed out, the previous problems were around very silly naming pattern. This sems about as descriptive as you can get, given what you have.
Yeah having Auto selected is really destroying my cognitive load...
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
Most people have it on auto select I'm assuming so this is a non issue. They keep older models active likely because some people prefer certain models until they try the new one or they can't completely switch all the compute to the new models at an instance.Was this ever explicitly confirmed by OpenAI? I've only ever seen it in the form of a rumor.
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
I am honestly unclear on the reasoning of people who flock from OpenAI to Anthropic, and doubly so of those who are not US citizens.
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
1. Fast mode ain't that fast
2. Large context * Fast * Higher Model Base Price = 8x increase over gpt-5.3-codex
3. I burnt 33% of my 5h limit (ChatGPT Business Subscription) with a prompt that took 2 minutes to complete.
How do you arrive at that number? I find it hard to make sense of this ad hoc, given that the total token cost is not very interesting; it's token efficiency we care about.
2. no if you are on subscription, it's the same, at 20$ codex 5.4 xhigh provide way more than 20$ opus thinking ( this one instead really can burn 33% with 1 request, try to compare then on same tasks ) also 8x .. ??? if you need 1M token for a special tasks doesn't hit /fast and vice-versa , the higher price doesn't apply on subscription too..
3. false, i'm on pro , so 10x the base , always on /fast (no 1M), and often 2 parallel instances working.. hardly can use 2% (=20% of 5h limit , in 1h of work ( about 15/20 req/hour) ) , claude is way worse on that imo
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
If an API is exposed you can just have the LLM write something against that.
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
Optimizations are secondary to convenience
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Not quite the same, but it did remind me of it.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
[0] https://github.com/google-gemini/gemini-cli/issues/17583
[1] https://www.reddit.com/r/Bard/comments/1l8vil5/gemini_keeps_...
I’m pretty glad I’m out of the OpenAI ecosystem in all seriousness. It is genuinely a mess. This marketing page is also just literally all over the place and could probably be about 20% of its size.
Also their pricing based on 5m/1h cache hits, cash read hits, additional charges for US inference (but only for Opus 4.6 I guess) and optional features such as more context and faster speed for some random multiplier is also complex and actually quiet similar to OpenAI's pricing scheme.
To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.
naming things
cache invalidation
off by one errors
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
However, I think what actually happened is that a skilled engineer made that game using codex. They could have made 100s of prompts after carefully reviewing all source code over hours or days.
The tycoon game is impressive for being made in a single prompt. They include the prompt for this one. They call it "lightly specified", but it's a pretty dense todo list for how to create assets, add many features from RollerCoaster Tycoon, and verify it works. I think it can probably pull a lot of inspiration from pretraining since RCT is an incredibly storied game.
The bridge flyover is hilariously bad. The bridge model ... has so many things wrong with it, the camera path clips into the ground and bridge, and the water and ground are z fighting. It's basically a C homework assignment that a student made in blender. It's impressive that it was able to achieve anything on such a visual task, but the bar is still on the floor. A game designer etc. looking for a prototype might actually prefer to greybox rather than have AI spend an hour making the worst bridge model ever.
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
I tried several use cases: - Code Explanation: Did far much better than Opus, considered and judged his decision on a previous spec that I made, all valid points so I am impressed. TBF if I spawned another Opus as a reviewer I might got similar results. - Workflow Running: Really similar to Opus again, no objections it followed and read Skills/Tools as it should be (although mine are optimized for Claude) - Coding: I gave it a straightforward task to wrap an API calls to an SDK and to my surprise it did 'identical' job with Opus, literally the same code, I don't know what the odds are to this but again very good solution and it adhered our rules of implementing such code.
Overall I am impressed and excited to see a rival to Opus and all of this is literally pushing everyone to get better and better models which is always good for us.
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.
[1] - https://metr.org/
"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
Recent SWE-bench Verified scores I’m watching:
Claude 4.5 Opus (high reasoning): 76.8
Gemini 3 Flash (high reasoning): 75.8
MiniMax M2.5 (high reasoning): 75.8
Claude Opus 4.6: 75.6
GPT-5.2 Codex: 72.8
Source: https://www.swebench.com/index.html
By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI.
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
If you last used 5.2, try 5.4 on High.
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
$skill-installer playwright-interactive in Codex! the model writes normal JS playwright code in a Node REPL
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.
Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
https://www.svgviewer.dev/s/gAa69yQd
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
A couple months later:
"We are deprecating the older model."
When two agents coordinate, they’re mostly relying on compressed summaries of each other’s outputs. If one introduces a wrong assumption, the other often treats it as ground truth and builds on top of it. I’ve seen similar behavior in multi-agent coding loops where the model invents a causal explanation just to reconcile inconsistent state.
It’s that multi-agent setups need a stronger shared source of truth (repo diffs, state snapshots, etc.). Otherwise small context errors snowball fast.
I wonder if 5.4 will be much if any different at all.
GPT-5.4: 75.1%
GPT-5.3-Codex: 77.3%
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
GPT is not even close yo Claude in terms of responding to BS.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
And so far it has succeeded
I absolutely could come up with the details and implementation by myself, but that would certainly take a lot of back and forth, probably a month or two.
I’m an api user of Claude code, burning through 2k a month. I just this evening planned the whole thing with its help and actually had to stop it from implementing it already. Will do that tomorrow. Probably in one hour or two, with better code than I could ever write alone myself.
Having that level of intelligence at that price is just bollocks. I’m running out of problems to solve. It’s been six months.
Nothing infuriates me more than an LLM tool randomly deciding to create docx or xlsx files for no apparent reason. They have to use a random library to create these files, and they constantly screw up API calls and get completely distracted by the sheer size of the scripts they have to write to output a simple documents. These files have terrible accessibility (all paper-like formats do) and end up with way too much formatting. Markdown was chosen as the lingua franca of LLMs for a reason, trying to force it into a totally unsuitable format isn't going to work.
You can have it not use bulleted points, I turned this on, thinking it would be more concise and not so... listy. However, it just uses the same format, without the bullets. I was confused why it was writing 5 word sentences, separated by line breaks. Then I realized it was just making lists, without the bullets.
Great job OpenAI!
I hate these blog posts sometimes. Surely there's got to be some tradeoff. Or have we finally arrived at the world's first "free lunch"? Otherwise why not make /fast always active with no mention and no way to turn it off?
and considering the stance on openai with a majority of the users here compared to the number of upvotes, are HN likes bot-farmed?
Also, the timing of this release, 5.3 and 5.2, relative to the other releases, feels more like a bug fix than something "new"