Gemini-2.5-pro-preview-06-05 (opens in new tab)

(deepmind.google)

349 pointsjcuenod11mo ago230 comments

230 comments

Impressive seeing Google notch up another ~25 ELO on lmarena, on top of the previous #1, which was also Gemini!

That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.

Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.

joshmlewis11mo ago

o3 is still my favorite over even Opus 4 in most cases. I've spent hundreds of dollars on AI code gen tools in the last month alone and my ranking is:

1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.

2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.

3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.

4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.

I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.

spaceman_202011mo ago

I use o3 a lot for basic research and analysis. I also find the deep research tool really useful for even basic shopping research

Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise

jml7811mo ago

Gemini deep research runs circles around OpenAI deep research. It goes way deeper and uses way more sources.

throwaway31415511mo ago

It's interesting you say that because o3, while being a considerable improvement over OpenAI's other models, still doesn't match the performance of Opus 4 and Gemini 2.5 Pro by a long shot for me.

However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.

svachalek11mo ago

If you're coding through chat apps you're really behind the times. Try an agent IDE or plugin.

5 more replies

jorvi11mo ago

What's most annoying about Gemini 2.5 is that it is obnoxiously verbose compared to Opus 4. Both in explaining the code it wrote and the amount of lines it writes and comments it adds, to the point where the output is often 2-3x more than Opus 4.

You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.

1 more reply

joshmlewis11mo ago

What languages do you use it with and IDE? I use it in Cursor mainly with Max reasoning on. I spent around $300 on token based usage for o3 alone in May still only accepting around 33% of suggestions though. I made a post on X about this the other day but I expect that amount of rejections will go down significantly by the end of this year at the rate things are going.

2 more replies

vendiddy11mo ago

I find o3 to be the clearest thinker as well.

If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.

If o3 was faster and cheaper I'd use it a lot more.

I'm curious what your workflows are !

monkpit11mo ago

Have you used Cline with opus+sonnet? Do you have opinions about Claude code vs cline+api? Curious to hear your thoughts!

jonplackett11mo ago

How do you find o3 vs o4-mini?

joshmlewis11mo ago

For coding at least, I don't bother with anything less than the top thinking models. They do have their place for some tasks in agentic systems but time is money and I don't want to waste time trying to coral less skilled models when there are more powerful ones available.

1 more reply

pqdbr11mo ago

How do you choose which model to use with Claude Code?

joshmlewis11mo ago

I have the Max $200 plan so I set it to Opus until it limits me to Sonnet 4 which has only happened in two out of a few dozen sessions so far. My rule of thumb in Cursor is it's worth paying for the Max reasoning models for pretty much every request unless it's stupid simple because it produces the best code each time without any funny business you get with cheaper models.

1 more reply

jasonjmcghee11mo ago

In case you're asking for the literal command...

/model

VeejayRampay11mo ago

we need to stop it with the anecdotal evidence presented by one random dude

1 more reply

batrat11mo ago

What I like about Gemini is the search function that is very very good compared to others. I was blown away when I asked to compose me an email for a company that was sending spam to our domain. It literally searched and found not only the abuse email of the hosting company but all the info about the domain and the host(mx servers, ip owners, datacenters, etc.). Also if you want to convert a research paper into a podcast it did it instantly for me and it's fun to listen.

baq11mo ago

I’ve been giving the same tasks to claude 4 and gemini 2.5 this week and gemini provided correct solutions and claude didn’t. These weren’t hard tasks either, they were e.g. comparing sql queries before/after rewrite - Gemini found legitimate issues where claude said all is ok.

1 more reply

Szpadel11mo ago

in my experience this highly depends case by case. For some cases Gemini crushed my problem, but in next one stuck and couldn't figure out simple bug.

the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)

I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution

varunneal11mo ago

Have you tried o3 on those problems? I've found o3 to be much more impressive than Opus 4 for all of my use cases.

johnfn11mo ago

To be honest, I haven't, because the "This model is extremely expensive" popup on Cursor makes me a bit anxious - but given the accolades here I'll have to give it a shot.

cwbriscoe11mo ago

I haven't tried all of the favorites, just what is available with Jetbrains AI, but I can say that Gemini 2.5 is very good with Go. I guess that makes sense in a way.

zamadatix11mo ago

I think the only way to be particularly impressed with new leading models lately is to hold the opinion all of the benchmarks are inaccurate and/or irrelevant and it's vibes/anecdotes where the model is really light years ahead. Otherwise you look at the numbers on e.g. lmarena and see it's claiming a ~16% preference win rate for gpt-3.5-turbo from November of 2023 over this new world-leading model from Google.

johnfn11mo ago

Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is 1206, which is an 86% win rate. https://chatgpt.com/share/6841f69d-b2ec-800c-9f8c-3e802ebbc0...

zamadatix11mo ago

gpt-3.5-turbo-1106 from November 2023 was 1170, 1206 is for the March variant.

Change that and you get ~84%, flip the order (i.e. the win rate of GPT-3.5 is ~16%). I.e. the point is a two year old model still wins far too often to be excited about each new top model for the last two years, not that the two year old model is better.

Workaccount211mo ago

People can ask whatever they want on LMarena, so a question like "List some good snacks to bring to work" might elicit a win for a old/tiny/deprecated model simply because it lists the snack the user liked more.

AstroBen11mo ago

are you saying that's a bad way to judge a model? Not sure why we'd want ones that choose bad snacks

tempusalaria11mo ago

I agree I find claude easily the best model, at least for programming which is the only thing I use LLMs for

lispisok11mo ago

>That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability

Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.

Alifatisk11mo ago

> after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it

No way, is there any way to see the dialog or recreate this scenario!?

johnfn11mo ago

The chat was in Cursor, so I don't know a way to provide a public link, but here is the last paragraph that it output before I (and it) gave up. I honestly could have re-prompted it from scratch and maybe it would have gotten it, but at this point I was pretty sure that even if it did, it was going to make a total mess of things. Note that it was iterating on a test failure and had spun through multiple attempts at this point:

> Given the persistence of the error despite multiple attempts to refine the type definitions, I'm unable to fix this specific TypeScript error without a more profound change to the type structure or potentially a workaround that might compromise type safety or accuracy elsewhere. The current type definitions are already quite complex.

The two prior paragraphs, in case you're curious:

> I suspect the issue might be a fundamental limitation or bug in how TypeScript is resolving these highly recursive and conditional types when they are deeply nested. The type system might be "giving up" or defaulting to a less specific type ({ __raw: T }) prematurely.

> Since the runtime logic seems to be correctly hydrating the nested objects (as the builder.build method recursively calls hydrateHelper), the problem is confined to the type system's ability to represent this.

I found, as you can see in the first of the prior two paragraphs, that Gemini often wanted to claim that the issue was on TypeScript's side for some of these more complex issues. As proven by Opus, this simply wasn't the case.

AmazingTurtle11mo ago

for bulk data extraction on personal real life data I experienced that even gpt-4o-mini outperforms latest gemini models in both quality and cost. i would use reasoning models but their json schema response is different from the non-reasonig models, as in: they can not deal with union types for optional fields when using strict schemas... anyway.

idk whats the hype about gemini, it's really not that good imho

tymonPartyLate11mo ago

I just realized that Opus 4 is the first model that produced "beautiful" code for me. Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional. I had my first "wow" moment with it in a while. That being said it occasionally does something absolutely stupid. Like completely dumb. And when I ask it "why did you do this stupid thing", it replies "oh yeah, you're right, this is super wrong, here is an actual working, smart solution" (proceeds to create brilliant code)

I do not understand how those machines work.

diggan11mo ago

> Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional

I get that with most of the better models I've tried, although I'd probably personally favor OpenAI's models overall. I think a good system prompt is probably the best way there, rather than relying in some "innate" "clean code" behavior of specific models. This is a snippet of what I use today for coding guidelines: https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313...

> That being said it occasionally does something absolutely stupid. Like completely dumb

That's a bit tougher, but you have to carefully read through exactly what you said, and try to figure out what might have led it down the wrong path, or what you could have said in the first place for it avoid that. Try to work it into your system prompt, then slowly build up your system prompt so every one-shot gets closer and closer to being perfect on every first try.

tymonPartyLate11mo ago

Thanks for sharing, I'll copy your rules :)

Tostino11mo ago

My issue is that every time i've attempted to use Opus 4 to solve any problem, I would burn through my usage cap within a few min and not have solved the problem yet because it misunderstood things about the context and I didn't get the prompt quite right yet.

With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.

simon1ltd11mo ago

I've also experienced the same, except it produced the same stupid code all over again. I usually use one model (doesn't matter which) until it starts chasing it's tail, then I feed it to a different model to have it fix the mistakes by the first model.

tomr7511mo ago

how does it have access to DOM? are you using it with cursor/browser MCP?

chollida111mo ago

I'd start to worry about OpenAI, from a valuation standpoint. The company has some serious competition now and is arguably no longer the leader.

its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.

If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.

When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.

What is new money coming into OpenAI getting now?

At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.

They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.

Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.

jstummbillig11mo ago

There is some serious confusion about the strength of OpenAIs position.

"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).

All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.

aeyes11mo ago

Google has a text input box on google.com, as soon as this gives similar responses there is no need for the average user to use ChatGPT anymore.

I already see lots of normal people share screenshots of the AI Overview responses.

jstummbillig11mo ago

You are skipping over the part where you need to bring normal people, specially young normal people, back to google.com for them to see anything at all on google.com. Hundreds of millions of them don't go there anymore.

1 more reply

paxys11mo ago

> as soon as this gives similar responses

And when is that going to be? Google clearly has the ability to convert google.com into a ChatGPT clone today if they wanted to. They already have a state of the art model. They have a dozen different AI assistants that no one uses. They have a pointless AI summary on top of search results that returns garbage data 99% of the time. It's been 3+ years and it is clear now that the company is simply too scared to rock the boat and disrupt its search revenue. There is zero appetite for risk, and soon it'll be too late to act.

askafriend11mo ago

As the other poster mentioned, young people are not going there. What happens when they grow up?

candiddevmike11mo ago

ChatGPT is going to be Kleenex'd. They wasted their first mover advantage. Replace ChatGPT's interface with any other LLM and most users won't be able to tell the difference.

ComplexSystems11mo ago

"People have no idea what claude or gemini are"

One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.

jstummbillig11mo ago

If it were that simple to sway markets through marketing, we would see Pepsi/Coca-Cola or McDonalds/BurgerKing swing like crazy all the time from "one well-placed ad campaign" to the next. We do not.

1 more reply

chollida111mo ago

Chatgpt has no moat of any kind though.

I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.

That means one stumble on the next foundational model and their market share drops in half in like 2 months.

Now the same is true for the other llms as well.

potatolicious11mo ago

I think this pretty substantially overstates ChatGPT's stickiness. Just because something is widely (if not universally) known doesn't mean it's universally used, or that such usage is sticky.

For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.

tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.

It seems like the market is fine with seeking specific LLMs for specific kinds of tasks, as opposed to some omni-LLM one-stop-shop that does everything. The market has already and rapidly moved beyond from ChatGPT.

Not to mention I am willing to bet that Gemini has radically more usage than OpenAI's models simply by virtue of being plugged into Google Search. There are distribution effects, I just don't think OpenAI has the strongest position!

I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.

lizardking11mo ago

Xerox was a verb too

PantaloonFlames11mo ago

> At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.

Probably your point still stands.

jadbox11mo ago

Currently I only find OpenAI to be clearly better for image generation: like illustrations, comics, or photo editing for home project ideation.

bufferoverflow11mo ago

And open-source Flux.1 Kontext is already better than it.

energy12311mo ago

Even if they're winning the AI race, their search business is still going to be cannibalized, and it's unclear if they'll be able to extract any economic rents from AI thanks to market competition. Of course they have no choice but to compete, but they probably would have preferred the pre-AI status quo of unquestioned monopoly and eyeballs on ads.

xmprt11mo ago

Historically, every company has failed by not adapting to new technologies and trying to protect their core business (eg. Kodak, Blockbuster, Blackberry, Intel, etc). I applaud Google for going against their instincts and actively trying to disrupt their cash cow in order to gain an advantage in the AI race.

orionsbelt11mo ago

I think it’s too early to say they are not the leader given they have o3 pro and GPT 5 coming out within the next month or two. Only if those are not impressive would I start to consider that they have lost their edge.

Although it does feel likely that at minimum, they are neck and neck with Google and others.

ed_mercer11mo ago

Source for gpt 5 coming out soon?

orionsbelt11mo ago

https://www.reddit.com/r/singularity/comments/1l1fi7a/gpt5_i...

There’s been other stuff from Sam Altman that puts it around this summer, so even if it gets delayed past July, it seems pretty clear it’s coming within the next few months.

sebzim450011mo ago

>At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

What? Apple has a revenue of 400B and a market cap of 3T

Rudybega11mo ago

I think OpenAI has projected 12.7B in revenue this year and 29.4B in 2026.

Edit: I am dumb, ignore the second half of my post.

eamag11mo ago

isn't P/E about earnings, not revenue?

Rudybega11mo ago

You are correct. I need some coffee.

ketzo11mo ago

OpenAI has already forecast $12B in revenue by the end of this year.

I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway

Workaccount211mo ago

The hurdle for OpenAI is going to be on the profit side. Google has their own hardware acceleration and their own data centers. OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers. Never mind that Google can customize it's hardware specifically for it's models.

The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.

geodel11mo ago

OpenAI has now partnered with Jony Ive now and they are going to have thinnest data centers with thinnest servers mounted on thinnest racks. And since everything is so thin, servers can just whisper to each other instead of communicating via fat cables.

I think that will be the game changer OpenAI will show us soon.

3 more replies

diggan11mo ago

> OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers.

Don't they have a data center in progress as we speak? Seems by now they're planning on building not just one huge data center in Texas, but more in other countries too.

1 more reply

chollida111mo ago

Agreed, its the doubling of that each year for the next 4-5 years that I see as being difficult.

VeejayRampay11mo ago

the leeway comes from the grotesque fanboyism the company benefits from

they haven't been number one for quite some time and still people can't stop presenting them as the leaders

ketzo11mo ago

People said much the same thing about Apple for decades, and they’re a $3T company; not a bad thing to have fans.

Plus, it’s a consumer product; it doesn’t matter if people are “presenting them as leaders”, it matters if hundreds of millions of totally average people will open their computers and use the product. OpenAI has that.

1 more reply

raincole11mo ago

Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.

qeternity11mo ago

Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.

Revenue is not the metric by which these companies are valued...

Yizahi11mo ago

The difference between Microsoft and OAI is that Microsoft can spend a lump sum of money on Excel and a fraction of that on its support and then sell it infinitely with almost no additional costs. MS can add a million of new Excel users tomorrow and that would be almost pure profit. (I'm very simplifying)

OAI on the other hand must spend a lot of additional money for every single new user, both free and paid. Adding million new OAI users tomorrow would mean gigantic negative red hole in the profits (adding to the existing negative). OAI has no or almost no benefits of scale, unlike other industries.

I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.

Oleksa_dr11mo ago

I was tempted by the ratings and immediately paid for a subscription to Gemini 2.5. Half an hour later, I canceled the subscription and got a refund. This is the laziest and stupidest LLM. What he had to do, he told me to do on my own. And also when analyzing simple short documents, he pulled up some completely strange documents from the Internet not related to the topic. Even local LLMs (3B) were not so stupid and lazy.

sigmoid1011mo ago

Exactly my experience as well. I don't get why people here now seem to blindly take every new gamed benchmark as some harbinger of OpenAI's imminent downfall. Google is still way behind in day-to day personal and professional use for me.

vthallam11mo ago

As if 3 different preview versions of the same model is not confusing enough, the last two dates are 05-06 and 06-05. They could have held off for a day:)

tomComb11mo ago

Since those days are ambiguous anyway, they would have had to hold off until the 13th.

In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.

layer811mo ago

> they would have had to hold off until the 13th.

06-06 is unambiguously after 05-06 regardless of date format.

Sammi11mo ago

The problems is that I mentally just panic and abort without even trying when I see 06-06 and 05-06. The ambiguity just flips my brain off.

dist-epoch11mo ago

> the last two dates are 05-06 and 06-05

they are clearly trolling OpenAI's 4o and o4 models.

oezi11mo ago

Don't repeat the same mistake if you want to troll somebody.

It makes you look even more stupid.

fragmede11mo ago

ChatGPT itself suggests better names than that!

declan_roberts11mo ago

Engineers are surprisingly bad at naming things!

jacob01911mo ago

I rather like date codes as versions.

atom05811mo ago

But it's not clear how to interpret the date code: 05-06 could be 5th June or 6th May; same sorry for 06-05. Very confusing due to American-style date formatting. Versions number are at least sequential, with a bigger number being a later version.

UncleOxidant11mo ago

At what point will they move from Gemini 2.5 pro to Gemini 2.6 pro? I'd guess Gemini 3 will be a larger model.

wiradikusuma11mo ago

I have two issues with Gemini that I don't experience with Claude: 1. It RENAMES VARIABLE NAMES even in places where I don't tell it to change (I pass them just as context). and 2. Sometimes it's missing closing square brackets.

Sure I'm a lazy bum, I call the variable "json" instead of "jsonStringForX", but it's contextual (within a closure or function), and I appreciate the feedback, but it makes reviewing the changes difficult (too much noise).

xtracto11mo ago

I have a very clear example of Gemini getting it wrong:

For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.

    #Set up the SFTTrainer
    print("Setting up SFTTrainer...")
    trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    args=sft_config,
    processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME
    )
    print("SFTTrainer ready.")

I haven't tried with this latest version, but the 05-06 pro still did it wrong.

diggan11mo ago

Do you have in the system prompt to actually not edit lines that has comments about not editing them? Had that happen to me too, that code comments been ignored, and adding instructions about actually following code comments helped for that. But different models so YMMV.

AaronAPU11mo ago

I find o1-pro, which nobody ever mentions, is in the top spot along with Gemini. But Gemini is an absolute mess to work with because it constantly adds tons of comments and changes unrelated code.

It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.

danielbln11mo ago

Gemini loves to add idiotic non-functional inline comments.

"# Added this function" "# Changed this to fix the issue"

No, I know, I was there! This is what commit messages for, not comments that are only relevant in one PR.

macNchz11mo ago

I love when I ask it to remove things and it doesn't want to truly let go, so it leaves a comment instead:

   # Removed iterMod variable here because it is no longer needed.

It's like it spent too much time hanging out with an engineer who doesn't trust version control and prefers to just comment everything out.

Still enjoying Gemini 2.5 Pro more than Claude Sonnet these days, though, purely on vibes.

oezi11mo ago

And it sure loves removing your carefully inserted comments for human readers.

sweetjuly11mo ago

It feels like I'm negotiating with a toddler. If I say nothing, it adds useless comments everywhere. If I tell it to not add comments, it deletes all of my comments. Tell it to put the comments back, and it still throws away half of my comments and rewrites the reset in a less precise way.

Workaccount211mo ago

I think it is likely that the comments are more for the model than for the user. I would not be even slightly surprised if verbose coding versions outperformed light commenting versions.

xmprt11mo ago

On the other hand, I'm skeptical if that has any impact because these models have thinking tokens where they can put all those comments and attention shouldn't care about how close the tokens are as long as they're within the context window.

1 more reply

PantaloonFlames11mo ago

Have you tried modifying the system instructions to get it to stop doing that?

93po11mo ago

i've noticed with ChatGPT is will 100% ignore certain instructions and I wonder if it's just an LLM thing. For example, I can scream and yell in caps at ChatGPT to not use em or en dashes and if anything it makes it use them even more. I've literally never once made it successfully not use them, even when it ignored it the first time, and my follow up is "output the same thing again but NO EM or EN DASHES!"

i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.

creesch11mo ago

There are some things so ubiquitous in the training data that it is really difficult to tell models to not so them. Simply because it is so ingrained in their core training. Em dashes are apparently one of those things.

It's something I read a lottle while ago in a larger article but can't remember which article it was.

EnPissant11mo ago

I have had 95% success rate telling it not to use emdash or semicolon.

tacotime11mo ago

I wonder if using the character itself in the directions, instead of the name for the character, might help with this.

Something like, "Forbidden character list: [—, –]" or "Do NOT use the characters '—' or '–' in any of your output"

hu311mo ago

I pay for both ChatGPT Plus and Gemini Pro.

I'm thinking of cancelling my ChatGPT subscription because I keep hitting rate limits.

Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.

HenriNext11mo ago

AI Studio uses your API account behind the scenes, and it is subject to normal API limits. When you signup for AI Studio, it creates a Google Cloud free tier project with "gen-lang-client-" prefix behind the scenes. You can link a billing account at the bottom of the "get an api key page".

Also note that AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.

sysoleg11mo ago

> AI Studio uses your API account behind the scenes

This is not true for the Gemini 2.5 Pro Preview model, at least. Although this model API is not available on the Free Tier [1], you can still use it on AI Studio.

[1] https://ai.google.dev/gemini-api/docs/pricing

PantaloonFlames11mo ago

> AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.

Seconded.

oofbaroomf11mo ago

I think AI Studio uses the API, so rate limits are extremely high and almost impossible for a normal human to reach if using the paid preview model.

staticman211mo ago

As far as I know AI Studio is always free, even on pay accounts, and you can definetly hit the rate limit.

Squarex11mo ago

I much prefer Gemini over chapgpt, but they recently introduced a limit of 100 messages a day on a pro plan :( aistudio is probably still fine

MisterPea11mo ago

I've heard it's only on mobile? I was using gemini for work on desktop for at least 6 hours yesterday (definitely over 100 back and forths) for work and did not get hit with any rate limits

Either way, Google's transparency with this is very poor - I saw the limits from a VP's tweet

fermentation11mo ago

Is there a reason not to just use the API through openrouter or something?

abraxas11mo ago

I found all the previous Gemini models somewhat inferior even compared to Claude 3.7 Sonnet (and much worse than 4) as my coding assistants. I'm keeping an open mind but also not rushing to try this one until some evaluations roll in. I'm actually baffled that the internet at large seems to be very pumped about Gemini but it's not reflective of my personal experience. Not to be that tinfoil hat guy but I smell at least a bit of astroturf activity around Gemini.

bachmeier11mo ago

> I'm actually baffled that the internet at large seems to be very pumped about Gemini but it's not reflective of my personal experience. Not to be that tinfoil hat guy but I smell at least a bit of astroturf activity around Gemini.

I haven't used Claude, but Gemini has always returned better answers to general questions relative to ChatGPT or Copilot. My impression, which could be wrong, is that Gemini is better in situations that are a substitute for search. How do I do this on the command line, tell me about this product, etc. all give better results, sometimes much better, on Gemini.

dist-epoch11mo ago

You should try Grok then. It's by far the best when searching is required, especially if you enable DeepSearch.

Take843511mo ago

I don't really want to use the X platform. What's the best alternative? Claude?

praveer1311mo ago

I’ve honestly had consistently the opposite experiences for general questions. Also for images, Gemini just hallucinates crazily. ChatGPT even on free tier is giving perfectly correct answers, and I’m on Gemini pro. I canceled it yesterday because of this

verall11mo ago

I think it's just very dependent on what you're doing. Claude 3.5/3.7 Sonnet (thinking or not) were just absolutely terrible at almost anything I asked of it (C/C++/Make/CMake). Like constantly giving wrong facts, generating code that could never work, hallucinating syntax and APIs, thinking about something then concluding the opposite, etc. Gemini 2.5-pro and o3 (even old o1-preview, o1-mini) were miles better. I haven't used Claude 4 yet.

But everyone is using them for different things and it doesn't always generalize. Maybe Claude was great at typescript or ruby or something else I don't do. But for some of us, it definitely was not astroturf for Gemini. My whole team was talking about how much better it was.

strobe11mo ago

I'm switching a lot between Sonnet and Gemini in Aider - for some reason some of my coding problems only one of models capable to solve and I don't see any pattern which cold give answer upfront which I should to use for specific need.

3abiton11mo ago

> I found all the previous Gemini models somewhat inferior even compared to Claude 3.7 Sonnet (and much worse than 4) as my coding assistants.

What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.

abraxas11mo ago

Backend python code, postgres database. Front end: Reeact/NextJS. Very common stack in 2025. Using LLMs in assist mode (not as agents) for enhancing the existing code base that weighs in under 1MM LoC. So not a greenfield project anymore but not a huge amount of legacy cruft either.

3abiton11mo ago

I still have the Claude subscription, so I will take a look again and see.

Fergusonb11mo ago

I think they are fairly interchangeable, In Roo Code, Claude uses the tools better, but I prefer gemini's coding style and brevity (except for comments, it loves to write comments) Sometimes I mix and match if one fails or pursues a path I don't like.

throwaway31415511mo ago

My experience has been that Gemini's code (and even conversation) is a little bit uglier in general - but that the code tends to solve the issue you asked with fewer hallucinations.

I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.

tiahura11mo ago

As a lawyer, Claude 4 is the best writer, and usually, but not always, the leader in legal reasoning. That said, o3 often grinds out the best response, and Gemini seems to be the most exhaustive researcher.

nprateem11mo ago

Gemini sucks for its stupid comment verbosity like others have mentioned but wins on price to value.

vikramkr11mo ago

I mean, they're cheaper models and they aren't as much if a pain about rate limiting as Claude was/they have a pretty solid depenresesrch without restrictive usage limits. IDK how it is for long running agentic stuff, would be surprised if it was anywhere near the other models, but for a general chatgpt competitor it doesn't matter if it's not as good as opus 4 if it's way cheaper and won't use up your usage limit

unpwn11mo ago

I feel like instead of constantly releasing these preview versions with different dates attached they should just add a patch version and bump that.

impulser_11mo ago

They can't because if someone has built something around that version they don't want to replace that model with a new model that could provide different results.

jfoster11mo ago

In what way are dates better than integers at preventing that kind of mistake?

dist-epoch11mo ago

Except google did exactly that with the previous release, where they silently redirect 03-25 requests to 05-06.

nsriv11mo ago

Looking at you Anthropic. 4.0 markedly different from 3.7 in my experience.

Aeolun11mo ago

The model name is completely different? How do you accidentally switch from 3.7 to 4.0?

1 more reply

ChrisArchitect11mo ago

Blog post: https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...

(https://news.ycombinator.com/item?id=44192954)

xnx11mo ago

That's a much better link

Workaccount211mo ago

Apparently 06-05 bridges the gap that people were feeling between the 03-25 and 05-06 release[1]

[1]https://nitter.net/OfficialLoganK/status/1930657743251349854...

jcuenodOP11mo ago

82.2 on Aider

Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/

sottol11mo ago

Does 82.2 correspond to the "Percent correct" of the other models?

Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.

The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.

That looks like a pretty nice bump!

But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.

vessenes11mo ago

But so.much.cheaper.and.faster. Pretty amazing.

hobofan11mo ago

That's the older 05-06 preview, not the new one from today.

energy12311mo ago

They knew that. The 82.2 comes from the new benchmarks in the OP not from the aider url. The aider url was supplied for comparison.

hobofan11mo ago

Ah, thanks for clearing that up!

zone41111mo ago

Omproves on the Extended NYT Connections benchmark compared to both Gemini 2.5 Pro Exp (03-25) and Gemini 2.5 Pro Preview (05-06), scoring 58.7. The decline observed between 03-25 and 05-06 has been reversed - https://github.com/lechmazur/nyt-connections/.

unsupp0rted11mo ago

Curious to see how this compares to Claude 4 Sonnet in code.

This table seems to indicate it's markedly worse?

https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...

gundmc11mo ago

Almost all of those benchmarks are coding related. It looks like SWE-Bench is the only one where Claude is higher. Hard to say which benchmark is most representative of actual work. The community seems to like Aider Polyglot from what I've seen

Alifatisk11mo ago

Finally Google is advertising their ai studio, it's a shame they didn't push that beautiful app before.

pu_pe11mo ago

I just checked and it looks like the limits for Jules has been bumped from 5 free daily tasks to 60. Not sure it uses the latest model, but I would assume it does

kristianp11mo ago

https://jules.google/

jbellis11mo ago

Did it get upgraded in-place again or do you need to opt in to the new model?

pelorat11mo ago

Why not call it Gemini 2.6?

MallocVoidstar11mo ago

Beta, beta, release candidate (this version)

laweijfmvo11mo ago

because the plethora of models and versions is getting ridiculous, and for anyone who's not following LLM news daily, you have no clue what to use. There was never a "Google Search 2.6.4 04-13". You just went to google.com and searched.

johnfn11mo ago

Well, Google Search never released an API that millions of people depended on.

ZeroTalent11mo ago

Yes, they did on Google Cloud:

"Custom Search JSON API: The primary solution offered by Google is the Custom Search JSON API. This API allows you to create a customized search engine that can search a collection of specified websites. While it's not a direct equivalent to a full-fledged Google Search API, it can be configured to search the entire web."

In my experience it's essentially the same as Google Search if configured properly.

AISnakeOil11mo ago

These api models are for developers. Gemini is for consumers.

Szpadel11mo ago

next year maybe? they they so not have year in version so they will need to bump the number make sure you can just sort by name

op00to11mo ago

I found Gemini 2.5 Pro highly useful for text summaries, and even reasoning in long conversations... UP TO the last 2 weeks or month. Recently, it seems to totally forget what I'm talking about after 4-5 messages of a paragraph of text each. We're not talking huge amounts of context, but conversational braindeadness. Between ChatGPT's sycophancy, Gemini's forgetfulness and poor attention, I'm just sticking with whatever local model du jour fits my needs and whatever crap my company is paying for today. It's super annoying, hopefully Gemini gets its memory back!

energy12311mo ago

I believe it's intentionally nerfed if you use it through the app. Once you use Gemini for a long time you realize they have a number of dark patterns to deter heavy users but maintain the experience for light users. These dark patterns are:

- "Something went wrong error" after too many prompts in a day. This was an undocumented rate limit because it never occurs earlier in the day and will immediately disappear if you subscribe for and use a new paid account, but it won't disappear if you make a new free account, and the error going away is strictly tied to how long you wait. Users complained about this for over a year. Of course they lied about the real reasons for this error, and it was never fixed until a few days ago when they rug pulled paying users by introducing actual documented tight rate limits.

- "You've been signed out" error if the model has exceeded its output token budget (or runtime duration) for a single inference, so you can't do things like what Anthropic recommends where you coax the model to think longer.

- I have less definitive evidence for this but I would not be surprised if they programmatically nerf the reasoning effort parameter for multiturn conversations. I have no other explanation for why the chain of thought fails to generate for small context multiturn chats but will consistently generate for ultra long context singleturn chats.

op00to11mo ago

Right! I feel like it will sail through MBs of text data, but remembering what I said two turns ago is just too much.

harrisoned11mo ago

I noticed that same behavior across older Gemini models. I build a chatbot at work around 1.5 Flash, and one day suddenly it was behaving like that. it was perfect before, but after it always saluted the user like it was their first chat, despite me sending the history. And i didn't found any changelog regarding that at the time.

After that i moved to OpenAI, Gemini models just seem unreliable on that regard.

85392_school11mo ago

This might be because Gemini silently updates checkpoints (1.5 001 -> 1.5 002, 2.5 0325 -> 2.5 0506 -> 2.5 0605) while OpenAI doesn't update them without ensuring that they're uniformly better and typically emails customers when they are updated.

carbocation11mo ago

Is it possible to know which model version their chat app ( https://gemini.google.com/app ) is using?

lxe11mo ago

Gemini is a good and fast model, but I think the style of code it writes is... amateur / inexperienced. It doesn't make a lot of mistakes typical of an LLM, but rather chooses approaches that are typical of someone who just learned programming. I have to always nudge it to avoid verbosity, keep structure less repetitive, optimize async code, etc. With claude, I rarely have this problem -- it feels more like working with a more experienced developer.

PantaloonFlames11mo ago

> I have to always nudge it to avoid verbosity, keep structure less repetitive, optimize async code, etc.

Isn’t this what you can do with system instructions?

fallinditch11mo ago

As a Windsurf user I was happy with Claude 3.7 but then switched to Google Gemini 2.5 when Claude started glitching on a particularly large file. It's a bummer that 3.7 has gone from Windsurf - I considered cancelling my Windsurf subscription, but decided not to because it is still good value for money.

sumedh11mo ago

No models have gone from WindSurf.

Are you talking about Sonnet 4 which never came to Windsurf because Anthropic does not want to support OpenAI?

consumer45111mo ago

Man, if the benchmarks are to be believed, this is a lifeline for Windsurf as Anthropic becomes less and less friendly.

However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.

lexandstuff11mo ago

Well, they just had a $3B exit, so not that grim, all things considered.

consumer45111mo ago

Yeah, true.. but I just meant for users/user growth. Even if not completely warranted, users in their subreddit are upset that they don't have access to Sonnet 4.

With the Claude Max development, non-vibing users seem to be going to Claude Code. This makes me think that maybe Cursor should have taken an exit, cause Claude Code is gonna eat everyone's lunch?

sergiotapia11mo ago

In Cursor this is called "gemini-2.5-pro-preview-06-05" you have to enable it manually.

aienjoyer11mo ago

The truth is that Gemini 2.5 6-05 is a fraud in coding; before, out of 10 codes you wrote, 1 or 2 might not work, meaning they had errors. Now, out of 10 codes, 9 or 10 are wrong. Why does it have so many errors???

aienjoyer11mo ago

it have more skill in coding but have a lot of errors, i can't code anything

jdmoreira11mo ago

Is there a no brainer alternative to Claude Code where I can try other models?

ketzo11mo ago

People quite like aider! I’m not as much of a fan of the CLI workflow but it’s quite comparable, I think.

kristianp11mo ago

I enjoy using Aider, but it's not agentic: it cant run your tests for you, for example.

geoka911mo ago

It sort of can:

https://aider.chat/docs/usage/lint-test.html#testing

jdmoreira11mo ago

I've heard about it but is the outcome as good as claude code?

hensybex11mo ago

In short - it's like comparing Ubuntu with MacOS/Windows

Open source power vs corp - if you are eager to do some stuff yourself, aider is probably a better pick (or even a strategic investment); otherwise you'll simply find much more tools that don't require setting them up in Cline/Claude Code

But man, if you're on ycombinator.com, how not to stick with open source?

rubslopes11mo ago

Roo Code, or Cline. You can allow it to run everything by itself and just watch.

I've been preferring to use Copilot agent mode with Sonnet 4, but it asks you to intervene a lot.

aiiizzz11mo ago

Openai codex cli

emehex11mo ago

Is this "kingfall"?

paisanashapyaar11mo ago

No, Kingfall is a separate model which is supposed to deliver slightly better performance, around 2.5% to 5% improvement over this.

Workaccount211mo ago

Sundar tweeted a lion so it's probably goldmane. Kingfall is probably their deep think model, and they might wait for O3 pro to drop so they can swing back.

tibbar11mo ago

Interesting, I just learned about matharena.ai. Google cherry-picks one result where they're the best here, but in the overall results, it's still O3 and o4-mini-high who are in the lead.

energy12311mo ago

So there's both a 05-06 model and a 06-05 model, and the launch page for 06-05 has some graphs with benchmarks for the 05-06 model but without the 06-05 model?

johnnyApplePRNG11mo ago

General first impressions are that it's not as capable as 05-06, although it's technically testing better on the leaderboards... interesting.

feelingsonice11mo ago

I'm confused by the naming. It advertises itself as "Thinking" so is this the release of the new "Deep Think" model or not?

excerionsforte11mo ago

Ok Google, I was deflated after you guys took away 03-25, but now I am happy again with 06-05. Hell yes, we are back baby!

simianwords11mo ago

I feel stupid for asking but how do I enable deepthink?

koakuma-chan11mo ago

They added a thinking section in AI studio

simianwords11mo ago

True but it’s greyed out. Not sure if this is “deep think”

koakuma-chan11mo ago

I can set its thinking budget to 32k, that probably is deep think

1 more reply

_pdp_11mo ago

Is it still rate limited though?

BDivyesh11mo ago

It depends on where and how you use it, I only use the gemini pro model on aistudio, and set the temperature to 0.05 or 0.1 in rare cases I bump it to 0.3 if I need some frontend creativity, it still isn't impressive, I see that claude is still far better, o4-mini-high too. When it comes to o3 I despise it, despite being ranked very high on benchmarks, the best version of it is only available through api.

InTheArena11mo ago

RIght now, the claude code tooling and ChatGPT codex are far better then anything else I have seen for massive code development. Is there a better option out there with Gemini at the heart of it? I noticed the command line codex might support it.

kisamoto11mo ago

Amateur question, how are people using this for coding?

Direct chat and copy pasting code? Seems clunky.

Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.

Cline + open router in VSCode?

Something else?

4d66ba0611mo ago

Consider taking a look at Zed, it can use Gemini with an API key and has an agentic “write” mode if you don’t want to copy and paste.

j / k navigate · click thread line to collapse

230 comments

johnfn11mo ago

Impressive seeing Google notch up another ~25 ELO on lmarena, on top of the previous #1, which was also Gemini!

Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.

joshmlewis11mo ago

o3 is still my favorite over even Opus 4 in most cases. I've spent hundreds of dollars on AI code gen tools in the last month alone and my ranking is:

3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.

spaceman_202011mo ago

I use o3 a lot for basic research and analysis. I also find the deep research tool really useful for even basic shopping research

Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise

jml7811mo ago

Gemini deep research runs circles around OpenAI deep research. It goes way deeper and uses way more sources.

throwaway31415511mo ago

It's interesting you say that because o3, while being a considerable improvement over OpenAI's other models, still doesn't match the performance of Opus 4 and Gemini 2.5 Pro by a long shot for me.

However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.

svachalek11mo ago

If you're coding through chat apps you're really behind the times. Try an agent IDE or plugin.

5 more replies

jorvi11mo ago

You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.

1 more reply

joshmlewis11mo ago

2 more replies

vendiddy11mo ago

I find o3 to be the clearest thinker as well.

If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.

If o3 was faster and cheaper I'd use it a lot more.

I'm curious what your workflows are !

monkpit11mo ago

Have you used Cline with opus+sonnet? Do you have opinions about Claude code vs cline+api? Curious to hear your thoughts!

jonplackett11mo ago

How do you find o3 vs o4-mini?

joshmlewis11mo ago

1 more reply

pqdbr11mo ago

How do you choose which model to use with Claude Code?

joshmlewis11mo ago

1 more reply

jasonjmcghee11mo ago

In case you're asking for the literal command...

/model

VeejayRampay11mo ago

we need to stop it with the anecdotal evidence presented by one random dude

1 more reply

batrat11mo ago

baq11mo ago

1 more reply

Szpadel11mo ago

in my experience this highly depends case by case. For some cases Gemini crushed my problem, but in next one stuck and couldn't figure out simple bug.

the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)

I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution

varunneal11mo ago

Have you tried o3 on those problems? I've found o3 to be much more impressive than Opus 4 for all of my use cases.

johnfn11mo ago

To be honest, I haven't, because the "This model is extremely expensive" popup on Cursor makes me a bit anxious - but given the accolades here I'll have to give it a shot.

cwbriscoe11mo ago

I haven't tried all of the favorites, just what is available with Jetbrains AI, but I can say that Gemini 2.5 is very good with Go. I guess that makes sense in a way.

zamadatix11mo ago

johnfn11mo ago

Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is 1206, which is an 86% win rate. https://chatgpt.com/share/6841f69d-b2ec-800c-9f8c-3e802ebbc0...

zamadatix11mo ago

gpt-3.5-turbo-1106 from November 2023 was 1170, 1206 is for the March variant.

Workaccount211mo ago

AstroBen11mo ago

are you saying that's a bad way to judge a model? Not sure why we'd want ones that choose bad snacks

tempusalaria11mo ago

I agree I find claude easily the best model, at least for programming which is the only thing I use LLMs for

lispisok11mo ago

>That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability

Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.

Alifatisk11mo ago

> after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it

No way, is there any way to see the dialog or recreate this scenario!?

johnfn11mo ago

The two prior paragraphs, in case you're curious:

AmazingTurtle11mo ago

idk whats the hype about gemini, it's really not that good imho

tymonPartyLate11mo ago

I do not understand how those machines work.

diggan11mo ago

> Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional

> That being said it occasionally does something absolutely stupid. Like completely dumb

tymonPartyLate11mo ago

Thanks for sharing, I'll copy your rules :)

Tostino11mo ago

With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.

simon1ltd11mo ago

tomr7511mo ago

how does it have access to DOM? are you using it with cursor/browser MCP?

chollida111mo ago

I'd start to worry about OpenAI, from a valuation standpoint. The company has some serious competition now and is arguably no longer the leader.

When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.

What is new money coming into OpenAI getting now?

They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.

Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.

jstummbillig11mo ago

There is some serious confusion about the strength of OpenAIs position.

aeyes11mo ago

Google has a text input box on google.com, as soon as this gives similar responses there is no need for the average user to use ChatGPT anymore.

I already see lots of normal people share screenshots of the AI Overview responses.

jstummbillig11mo ago

1 more reply

paxys11mo ago

> as soon as this gives similar responses

askafriend11mo ago

As the other poster mentioned, young people are not going there. What happens when they grow up?

candiddevmike11mo ago

ChatGPT is going to be Kleenex'd. They wasted their first mover advantage. Replace ChatGPT's interface with any other LLM and most users won't be able to tell the difference.

ComplexSystems11mo ago

"People have no idea what claude or gemini are"

One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.

jstummbillig11mo ago

If it were that simple to sway markets through marketing, we would see Pepsi/Coca-Cola or McDonalds/BurgerKing swing like crazy all the time from "one well-placed ad campaign" to the next. We do not.

1 more reply

chollida111mo ago

Chatgpt has no moat of any kind though.

I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.

That means one stumble on the next foundational model and their market share drops in half in like 2 months.

Now the same is true for the other llms as well.

potatolicious11mo ago

I think this pretty substantially overstates ChatGPT's stickiness. Just because something is widely (if not universally) known doesn't mean it's universally used, or that such usage is sticky.

For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.

tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.

I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.

lizardking11mo ago

Xerox was a verb too

PantaloonFlames11mo ago

Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.

Probably your point still stands.

jadbox11mo ago

Currently I only find OpenAI to be clearly better for image generation: like illustrations, comics, or photo editing for home project ideation.

bufferoverflow11mo ago

And open-source Flux.1 Kontext is already better than it.

energy12311mo ago

xmprt11mo ago

orionsbelt11mo ago

Although it does feel likely that at minimum, they are neck and neck with Google and others.

ed_mercer11mo ago

Source for gpt 5 coming out soon?

orionsbelt11mo ago

https://www.reddit.com/r/singularity/comments/1l1fi7a/gpt5_i...

There’s been other stuff from Sam Altman that puts it around this summer, so even if it gets delayed past July, it seems pretty clear it’s coming within the next few months.

sebzim450011mo ago

>At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

What? Apple has a revenue of 400B and a market cap of 3T

Rudybega11mo ago

I think OpenAI has projected 12.7B in revenue this year and 29.4B in 2026.

Edit: I am dumb, ignore the second half of my post.

eamag11mo ago

isn't P/E about earnings, not revenue?

Rudybega11mo ago

You are correct. I need some coffee.

ketzo11mo ago

OpenAI has already forecast $12B in revenue by the end of this year.

I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway

Workaccount211mo ago

The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.

geodel11mo ago

I think that will be the game changer OpenAI will show us soon.

3 more replies

diggan11mo ago

> OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers.

Don't they have a data center in progress as we speak? Seems by now they're planning on building not just one huge data center in Texas, but more in other countries too.

1 more reply

chollida111mo ago

Agreed, its the doubling of that each year for the next 4-5 years that I see as being difficult.

VeejayRampay11mo ago

the leeway comes from the grotesque fanboyism the company benefits from

they haven't been number one for quite some time and still people can't stop presenting them as the leaders

ketzo11mo ago

People said much the same thing about Apple for decades, and they’re a $3T company; not a bad thing to have fans.

1 more reply

raincole11mo ago

Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.

qeternity11mo ago

Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.

Revenue is not the metric by which these companies are valued...

Yizahi11mo ago

I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.

Oleksa_dr11mo ago

sigmoid1011mo ago

vthallam11mo ago

As if 3 different preview versions of the same model is not confusing enough, the last two dates are 05-06 and 06-05. They could have held off for a day:)

tomComb11mo ago

Since those days are ambiguous anyway, they would have had to hold off until the 13th.

In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.

layer811mo ago

> they would have had to hold off until the 13th.

06-06 is unambiguously after 05-06 regardless of date format.

Sammi11mo ago

The problems is that I mentally just panic and abort without even trying when I see 06-06 and 05-06. The ambiguity just flips my brain off.

dist-epoch11mo ago

> the last two dates are 05-06 and 06-05

they are clearly trolling OpenAI's 4o and o4 models.

oezi11mo ago

Don't repeat the same mistake if you want to troll somebody.

It makes you look even more stupid.

fragmede11mo ago

ChatGPT itself suggests better names than that!

declan_roberts11mo ago

Engineers are surprisingly bad at naming things!

jacob01911mo ago

I rather like date codes as versions.

atom05811mo ago

UncleOxidant11mo ago

At what point will they move from Gemini 2.5 pro to Gemini 2.6 pro? I'd guess Gemini 3 will be a larger model.

wiradikusuma11mo ago

xtracto11mo ago

I have a very clear example of Gemini getting it wrong:

For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.

    #Set up the SFTTrainer
    print("Setting up SFTTrainer...")
    trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    args=sft_config,
    processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME
    )
    print("SFTTrainer ready.")

I haven't tried with this latest version, but the 05-06 pro still did it wrong.

diggan11mo ago

AaronAPU11mo ago

I find o1-pro, which nobody ever mentions, is in the top spot along with Gemini. But Gemini is an absolute mess to work with because it constantly adds tons of comments and changes unrelated code.

It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.

danielbln11mo ago

Gemini loves to add idiotic non-functional inline comments.

"# Added this function" "# Changed this to fix the issue"

No, I know, I was there! This is what commit messages for, not comments that are only relevant in one PR.

macNchz11mo ago

I love when I ask it to remove things and it doesn't want to truly let go, so it leaves a comment instead:

   # Removed iterMod variable here because it is no longer needed.

It's like it spent too much time hanging out with an engineer who doesn't trust version control and prefers to just comment everything out.

Still enjoying Gemini 2.5 Pro more than Claude Sonnet these days, though, purely on vibes.

oezi11mo ago

And it sure loves removing your carefully inserted comments for human readers.

sweetjuly11mo ago

Workaccount211mo ago

I think it is likely that the comments are more for the model than for the user. I would not be even slightly surprised if verbose coding versions outperformed light commenting versions.

xmprt11mo ago

1 more reply

PantaloonFlames11mo ago

Have you tried modifying the system instructions to get it to stop doing that?

93po11mo ago

i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.

creesch11mo ago

It's something I read a lottle while ago in a larger article but can't remember which article it was.

EnPissant11mo ago

I have had 95% success rate telling it not to use emdash or semicolon.

tacotime11mo ago

I wonder if using the character itself in the directions, instead of the name for the character, might help with this.

Something like, "Forbidden character list: [—, –]" or "Do NOT use the characters '—' or '–' in any of your output"

hu311mo ago

I pay for both ChatGPT Plus and Gemini Pro.

I'm thinking of cancelling my ChatGPT subscription because I keep hitting rate limits.

Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.

HenriNext11mo ago

sysoleg11mo ago

> AI Studio uses your API account behind the scenes

This is not true for the Gemini 2.5 Pro Preview model, at least. Although this model API is not available on the Free Tier [1], you can still use it on AI Studio.

[1] https://ai.google.dev/gemini-api/docs/pricing

PantaloonFlames11mo ago

Seconded.

oofbaroomf11mo ago

I think AI Studio uses the API, so rate limits are extremely high and almost impossible for a normal human to reach if using the paid preview model.

staticman211mo ago

As far as I know AI Studio is always free, even on pay accounts, and you can definetly hit the rate limit.

Squarex11mo ago

I much prefer Gemini over chapgpt, but they recently introduced a limit of 100 messages a day on a pro plan :( aistudio is probably still fine

MisterPea11mo ago

I've heard it's only on mobile? I was using gemini for work on desktop for at least 6 hours yesterday (definitely over 100 back and forths) for work and did not get hit with any rate limits

Either way, Google's transparency with this is very poor - I saw the limits from a VP's tweet

fermentation11mo ago

Is there a reason not to just use the API through openrouter or something?

abraxas11mo ago

bachmeier11mo ago

dist-epoch11mo ago

You should try Grok then. It's by far the best when searching is required, especially if you enable DeepSearch.

Take843511mo ago

I don't really want to use the X platform. What's the best alternative? Claude?

praveer1311mo ago

verall11mo ago

strobe11mo ago

3abiton11mo ago

> I found all the previous Gemini models somewhat inferior even compared to Claude 3.7 Sonnet (and much worse than 4) as my coding assistants.

What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.

abraxas11mo ago

3abiton11mo ago

I still have the Claude subscription, so I will take a look again and see.

Fergusonb11mo ago

throwaway31415511mo ago

My experience has been that Gemini's code (and even conversation) is a little bit uglier in general - but that the code tends to solve the issue you asked with fewer hallucinations.

I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.

tiahura11mo ago

nprateem11mo ago

Gemini sucks for its stupid comment verbosity like others have mentioned but wins on price to value.

vikramkr11mo ago

unpwn11mo ago

I feel like instead of constantly releasing these preview versions with different dates attached they should just add a patch version and bump that.

impulser_11mo ago

They can't because if someone has built something around that version they don't want to replace that model with a new model that could provide different results.

jfoster11mo ago

In what way are dates better than integers at preventing that kind of mistake?

dist-epoch11mo ago

Except google did exactly that with the previous release, where they silently redirect 03-25 requests to 05-06.

nsriv11mo ago

Looking at you Anthropic. 4.0 markedly different from 3.7 in my experience.

Aeolun11mo ago

The model name is completely different? How do you accidentally switch from 3.7 to 4.0?

1 more reply

ChrisArchitect11mo ago

Blog post: https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...

(https://news.ycombinator.com/item?id=44192954)

xnx11mo ago

That's a much better link

Workaccount211mo ago

Apparently 06-05 bridges the gap that people were feeling between the 03-25 and 05-06 release[1]

[1]https://nitter.net/OfficialLoganK/status/1930657743251349854...

jcuenodOP11mo ago

82.2 on Aider

Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/

sottol11mo ago

Does 82.2 correspond to the "Percent correct" of the other models?

Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.

The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.

That looks like a pretty nice bump!

But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.

vessenes11mo ago

But so.much.cheaper.and.faster. Pretty amazing.

hobofan11mo ago

That's the older 05-06 preview, not the new one from today.

energy12311mo ago

They knew that. The 82.2 comes from the new benchmarks in the OP not from the aider url. The aider url was supplied for comparison.

hobofan11mo ago

Ah, thanks for clearing that up!

zone41111mo ago

unsupp0rted11mo ago

Curious to see how this compares to Claude 4 Sonnet in code.

This table seems to indicate it's markedly worse?

https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...

gundmc11mo ago

Alifatisk11mo ago

Finally Google is advertising their ai studio, it's a shame they didn't push that beautiful app before.

pu_pe11mo ago

I just checked and it looks like the limits for Jules has been bumped from 5 free daily tasks to 60. Not sure it uses the latest model, but I would assume it does

kristianp11mo ago

https://jules.google/

jbellis11mo ago

Did it get upgraded in-place again or do you need to opt in to the new model?

pelorat11mo ago

Why not call it Gemini 2.6?

MallocVoidstar11mo ago

Beta, beta, release candidate (this version)

laweijfmvo11mo ago

johnfn11mo ago

Well, Google Search never released an API that millions of people depended on.

ZeroTalent11mo ago

Yes, they did on Google Cloud:

In my experience it's essentially the same as Google Search if configured properly.

AISnakeOil11mo ago

These api models are for developers. Gemini is for consumers.

Szpadel11mo ago

next year maybe? they they so not have year in version so they will need to bump the number make sure you can just sort by name

op00to11mo ago

energy12311mo ago

op00to11mo ago

Right! I feel like it will sail through MBs of text data, but remembering what I said two turns ago is just too much.

harrisoned11mo ago

After that i moved to OpenAI, Gemini models just seem unreliable on that regard.

85392_school11mo ago

carbocation11mo ago

Is it possible to know which model version their chat app ( https://gemini.google.com/app ) is using?

lxe11mo ago

PantaloonFlames11mo ago

> I have to always nudge it to avoid verbosity, keep structure less repetitive, optimize async code, etc.

Isn’t this what you can do with system instructions?

fallinditch11mo ago

sumedh11mo ago

No models have gone from WindSurf.

Are you talking about Sonnet 4 which never came to Windsurf because Anthropic does not want to support OpenAI?

consumer45111mo ago

Man, if the benchmarks are to be believed, this is a lifeline for Windsurf as Anthropic becomes less and less friendly.

However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.

lexandstuff11mo ago

Well, they just had a $3B exit, so not that grim, all things considered.

consumer45111mo ago

Yeah, true.. but I just meant for users/user growth. Even if not completely warranted, users in their subreddit are upset that they don't have access to Sonnet 4.

With the Claude Max development, non-vibing users seem to be going to Claude Code. This makes me think that maybe Cursor should have taken an exit, cause Claude Code is gonna eat everyone's lunch?

sergiotapia11mo ago

In Cursor this is called "gemini-2.5-pro-preview-06-05" you have to enable it manually.

aienjoyer11mo ago

it have more skill in coding but have a lot of errors, i can't code anything

jdmoreira11mo ago

Is there a no brainer alternative to Claude Code where I can try other models?

ketzo11mo ago

People quite like aider! I’m not as much of a fan of the CLI workflow but it’s quite comparable, I think.

kristianp11mo ago

I enjoy using Aider, but it's not agentic: it cant run your tests for you, for example.

geoka911mo ago

It sort of can:

https://aider.chat/docs/usage/lint-test.html#testing

jdmoreira11mo ago

I've heard about it but is the outcome as good as claude code?

hensybex11mo ago

In short - it's like comparing Ubuntu with MacOS/Windows

But man, if you're on ycombinator.com, how not to stick with open source?

rubslopes11mo ago

Roo Code, or Cline. You can allow it to run everything by itself and just watch.

I've been preferring to use Copilot agent mode with Sonnet 4, but it asks you to intervene a lot.

aiiizzz11mo ago

Openai codex cli

emehex11mo ago

Is this "kingfall"?

paisanashapyaar11mo ago

No, Kingfall is a separate model which is supposed to deliver slightly better performance, around 2.5% to 5% improvement over this.

Workaccount211mo ago

Sundar tweeted a lion so it's probably goldmane. Kingfall is probably their deep think model, and they might wait for O3 pro to drop so they can swing back.

tibbar11mo ago

Interesting, I just learned about matharena.ai. Google cherry-picks one result where they're the best here, but in the overall results, it's still O3 and o4-mini-high who are in the lead.

energy12311mo ago

So there's both a 05-06 model and a 06-05 model, and the launch page for 06-05 has some graphs with benchmarks for the 05-06 model but without the 06-05 model?

johnnyApplePRNG11mo ago

General first impressions are that it's not as capable as 05-06, although it's technically testing better on the leaderboards... interesting.

feelingsonice11mo ago

I'm confused by the naming. It advertises itself as "Thinking" so is this the release of the new "Deep Think" model or not?

excerionsforte11mo ago

Ok Google, I was deflated after you guys took away 03-25, but now I am happy again with 06-05. Hell yes, we are back baby!

simianwords11mo ago

I feel stupid for asking but how do I enable deepthink?

koakuma-chan11mo ago

They added a thinking section in AI studio

simianwords11mo ago

True but it’s greyed out. Not sure if this is “deep think”

koakuma-chan11mo ago

I can set its thinking budget to 32k, that probably is deep think

1 more reply

_pdp_11mo ago

Is it still rate limited though?

BDivyesh11mo ago

InTheArena11mo ago

kisamoto11mo ago

Amateur question, how are people using this for coding?

Direct chat and copy pasting code? Seems clunky.

Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.

Cline + open router in VSCode?

Something else?

4d66ba0611mo ago

Consider taking a look at Zed, it can use Gemini with an API key and has an agentic “write” mode if you don’t want to copy and paste.

j / k navigate · click thread line to collapse