That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.
Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.
1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.
2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.
3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.
4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.
I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.
Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise
However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.
If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.
If o3 was faster and cheaper I'd use it a lot more.
I'm curious what your workflows are !
the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)
I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution
Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.
No way, is there any way to see the dialog or recreate this scenario!?
> Given the persistence of the error despite multiple attempts to refine the type definitions, I'm unable to fix this specific TypeScript error without a more profound change to the type structure or potentially a workaround that might compromise type safety or accuracy elsewhere. The current type definitions are already quite complex.
The two prior paragraphs, in case you're curious:
> I suspect the issue might be a fundamental limitation or bug in how TypeScript is resolving these highly recursive and conditional types when they are deeply nested. The type system might be "giving up" or defaulting to a less specific type ({ __raw: T }) prematurely.
> Since the runtime logic seems to be correctly hydrating the nested objects (as the builder.build method recursively calls hydrateHelper), the problem is confined to the type system's ability to represent this.
I found, as you can see in the first of the prior two paragraphs, that Gemini often wanted to claim that the issue was on TypeScript's side for some of these more complex issues. As proven by Opus, this simply wasn't the case.
idk whats the hype about gemini, it's really not that good imho
I do not understand how those machines work.
I get that with most of the better models I've tried, although I'd probably personally favor OpenAI's models overall. I think a good system prompt is probably the best way there, rather than relying in some "innate" "clean code" behavior of specific models. This is a snippet of what I use today for coding guidelines: https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313...
> That being said it occasionally does something absolutely stupid. Like completely dumb
That's a bit tougher, but you have to carefully read through exactly what you said, and try to figure out what might have led it down the wrong path, or what you could have said in the first place for it avoid that. Try to work it into your system prompt, then slowly build up your system prompt so every one-shot gets closer and closer to being perfect on every first try.
With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.
its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.
If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.
When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.
What is new money coming into OpenAI getting now?
At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.
They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.
Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.
"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).
All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.
I already see lots of normal people share screenshots of the AI Overview responses.
One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.
I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.
That means one stumble on the next foundational model and their market share drops in half in like 2 months.
Now the same is true for the other llms as well.
For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.
tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.
It seems like the market is fine with seeking specific LLMs for specific kinds of tasks, as opposed to some omni-LLM one-stop-shop that does everything. The market has already and rapidly moved beyond from ChatGPT.
Not to mention I am willing to bet that Gemini has radically more usage than OpenAI's models simply by virtue of being plugged into Google Search. There are distribution effects, I just don't think OpenAI has the strongest position!
I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.
Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.
Probably your point still stands.
Although it does feel likely that at minimum, they are neck and neck with Google and others.
What? Apple has a revenue of 400B and a market cap of 3T
Edit: I am dumb, ignore the second half of my post.
I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway
The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.
they haven't been number one for quite some time and still people can't stop presenting them as the leaders
Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.
Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.
Revenue is not the metric by which these companies are valued...
OAI on the other hand must spend a lot of additional money for every single new user, both free and paid. Adding million new OAI users tomorrow would mean gigantic negative red hole in the profits (adding to the existing negative). OAI has no or almost no benefits of scale, unlike other industries.
I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.
In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.
06-06 is unambiguously after 05-06 regardless of date format.
they are clearly trolling OpenAI's 4o and o4 models.
Sure I'm a lazy bum, I call the variable "json" instead of "jsonStringForX", but it's contextual (within a closure or function), and I appreciate the feedback, but it makes reviewing the changes difficult (too much noise).
For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.
#Set up the SFTTrainer
print("Setting up SFTTrainer...")
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
args=sft_config,
processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME
)
print("SFTTrainer ready.")
I haven't tried with this latest version, but the 05-06 pro still did it wrong.It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.
"# Added this function" "# Changed this to fix the issue"
No, I know, I was there! This is what commit messages for, not comments that are only relevant in one PR.
# Removed iterMod variable here because it is no longer needed.
It's like it spent too much time hanging out with an engineer who doesn't trust version control and prefers to just comment everything out.Still enjoying Gemini 2.5 Pro more than Claude Sonnet these days, though, purely on vibes.
i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.
It's something I read a lottle while ago in a larger article but can't remember which article it was.
Something like, "Forbidden character list: [—, –]" or "Do NOT use the characters '—' or '–' in any of your output"
I'm thinking of cancelling my ChatGPT subscription because I keep hitting rate limits.
Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.
Also note that AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.
This is not true for the Gemini 2.5 Pro Preview model, at least. Although this model API is not available on the Free Tier [1], you can still use it on AI Studio.
Seconded.
Either way, Google's transparency with this is very poor - I saw the limits from a VP's tweet
I haven't used Claude, but Gemini has always returned better answers to general questions relative to ChatGPT or Copilot. My impression, which could be wrong, is that Gemini is better in situations that are a substitute for search. How do I do this on the command line, tell me about this product, etc. all give better results, sometimes much better, on Gemini.
But everyone is using them for different things and it doesn't always generalize. Maybe Claude was great at typescript or ruby or something else I don't do. But for some of us, it definitely was not astroturf for Gemini. My whole team was talking about how much better it was.
What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.
I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.
[1]https://nitter.net/OfficialLoganK/status/1930657743251349854...
Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/
Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.
The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.
That looks like a pretty nice bump!
But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.
This table seems to indicate it's markedly worse?
https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...
- "Something went wrong error" after too many prompts in a day. This was an undocumented rate limit because it never occurs earlier in the day and will immediately disappear if you subscribe for and use a new paid account, but it won't disappear if you make a new free account, and the error going away is strictly tied to how long you wait. Users complained about this for over a year. Of course they lied about the real reasons for this error, and it was never fixed until a few days ago when they rug pulled paying users by introducing actual documented tight rate limits.
- "You've been signed out" error if the model has exceeded its output token budget (or runtime duration) for a single inference, so you can't do things like what Anthropic recommends where you coax the model to think longer.
- I have less definitive evidence for this but I would not be surprised if they programmatically nerf the reasoning effort parameter for multiturn conversations. I have no other explanation for why the chain of thought fails to generate for small context multiturn chats but will consistently generate for ultra long context singleturn chats.
After that i moved to OpenAI, Gemini models just seem unreliable on that regard.
Isn’t this what you can do with system instructions?
Are you talking about Sonnet 4 which never came to Windsurf because Anthropic does not want to support OpenAI?
However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.
With the Claude Max development, non-vibing users seem to be going to Claude Code. This makes me think that maybe Cursor should have taken an exit, cause Claude Code is gonna eat everyone's lunch?
I've been preferring to use Copilot agent mode with Sonnet 4, but it asks you to intervene a lot.
Direct chat and copy pasting code? Seems clunky.
Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.
Cline + open router in VSCode?
Something else?