Gemini 3.5 Flash: frontier intelligence with action (opens in new tab)

(blog.google)

167 pointsmeetpateltech6d ago87 comments

87 comments

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

__jl__6d ago

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

bnug5d ago

At these pricing levels, corporations who use the models will need to ensure employees are using them efficiently. I know, where I work, we don't really think about the cost to the company when using copilot chat, but sounds like it could start adding up really fast, especially for poorly defined questions that have to be revised multiple times.

joshmlewis5d ago

It's interesting they use output tokens as an eval because all tokens are not made equal. Even from model to model (like Opus 4.6 to Opus 4.7) the tokenizer can be different and it's no longer an apples to apples comparison. No one really talks about this but it directly affects stats like usage limits. Certainly comparing models between providers on an apples to apples comparison token wise is not a good test.

xdertz5d ago

the era of subsidised ai is ending

driverdan5d ago

API calls have never been subsidized, only subscriptions.

kzrdude5d ago

AI is getting really useful, might be why

ahknight5d ago

Sonnet-level performance at Haiku prices. They know what they have and who the audience is they want.

ashirviskas5d ago

Gemini 2.0 Flash: $19

ahknight5d ago

... and you get what you pay for. Or less.

doginasuit6d ago

They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.

opsnooperfax6d ago

I think the big 3 are cartelizing and starting to ratchet up costs. GPT5.5 is not easily distinguishable from 5.1. I would it be shocked if we hit the ceiling and everyone is quietly positioning for the exit.

tskj5d ago

I don't understand why everyone thinks there is a ceiling below human-level intelligence, when we have an existence proof that human-level intelligence is possible.

3 more replies

lanthissa6d ago

switching models is insanely cheap compared to token cost on anything signficant, this is a take so cynical it misses the reality

Clueed6d ago

in any corporate or half compliance-relevant setting switching isn't trivial. new DPA, subprocessor notifications, TIA, procurement review, security questionnaires, plus re-running your evals because prompts don't transfer 1:1. token cost is just one of the line items.

2 more replies

hnarn6d ago

> now that they have people who built services on their API

People really can’t wait to be the next Zynga

rudedogg6d ago

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

tempaccount4206d ago

This is not priced at inference cost.

My guess: it's the price at which they make more money than if they rent the TPUs to other companies.

The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?

gpm6d ago

The cost at such they could rent out the TPUs, i.e. the market rate, is the inference cost.

Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.

4 more replies

spyckie26d ago

Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.

You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.

Flash seems to be targeting the near-frontier category.

TurdF3rguson6d ago

That might work if it wasn't for FOMO. Are you ok with only $20 of frontier usage a month?

1 more reply

booty6d ago

Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.

Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.

https://www.together.ai/pricing

https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)

Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.

But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.

...my opinions here are of course, conjecture built on top of conjecture....

eklitzke6d ago

Most of the training cost is not in the final training run, it's in all of the R&D (including salaries, equity, etc.) that it takes to get to the final training run. The actual cost of all of the TPUs (or GPUs), power, networking, storage, etc. for the final training run is significant, but it's even more expensive to have this huge R&D team doing frontier model development and using a lot of those same resources during development.

I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.

HDBaseT6d ago

Not to discredit you, because you are 100% correct but tangential note about together.ai, they seem fairly unreliable with constant outages or higher than normal latency.

BoorishBears6d ago

This is trouble if you're not Google/OpenAI/Anthropic: they're all shifting towards pricing for the economic value of the knowledge work they're aiding.

The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.

That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.

At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.

(and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)

tskj5d ago

Thank you, this is obviously where we're heading. People who think in terms of "will it ever be profitable to sell tokens" are thinking in the wrong framework entirely. The correct framework is "will it be profitable to sell knowledge work", and the answer will clearly be "yes".

IncreasePosts6d ago

Maybe the margins are just very large for Google because they predict so much demand for 3.5?

GodelNumbering6d ago

This combined with locally runnable models getting pretty good recently (e.g. Qwen 3.6) tells me that it's time to seriously consider local dev setup again

2 more replies

hei-lima6d ago

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

SwellJoe6d ago

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

Zambyte6d ago

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.

1 more reply

trollbridge6d ago

We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.

DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.

1 more reply

squidbeak6d ago

Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

ai_fry_ur_brain6d ago

Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.

This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.

7 more replies

xbmcuser6d ago

What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.

throwa3562626d ago

To be honest, China not having access to the latest hardware is exactly what has driven LLM technology forward the last 2 years.

1 more reply

blackoil6d ago

Open Source ASML EUV. But will wipe off trillions from US stocks so 401k may not like that.

stared6d ago

We have a "DeepSeek moment", https://github.com/antirez/ds4 (see https://news.ycombinator.com/item?id=48142108).

Or if you prefer smaller ones, Qwen3.6-35B-A3B, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF

Bombthecat5d ago

Can you run a coal power plant in your backyard? Or a giant solar power farm?

Of course not

And you don't need to

segmondy6d ago

You can use lots of open weight models today.

hei-lima6d ago

That's one solution to the problem. But it still needs some good computational capabilities. Either we optimize the hell out of those models, or we wait for the hardware to become good enough for them.

Gigachad6d ago

The real problem is the hardware to run them is still very expensive.

pianopatrick6d ago

Maybe we can figure out better ways to use the models that can run on cheap hardware.

GeorgeOldfield6d ago

gemini isn't even that good. just tested 3.5 on usual complex prompts to opus/chat 5.5. meh

k8sToGo6d ago

Are you really comparing flash to opus? Shouldn't you be comparing pro?

1 more reply

bachmeier6d ago

Who would have guessed that something costing roughly a third as much wouldn't do as well at certain tasks.

kmac_6d ago

Well, the first impression is that Gemini still goes off the instruction rails easier than other models, but I noticed that it tends to go back to the initial goal without holding a hand, which is a real improvement. It's really interesting that these models behave so differently.

jstummbillig5d ago

> Interesting pricing direction.

Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.

More generally, $/token + naming scheme comparisons are just confusing: I am not looking for a wordy idiot and I doubt most people are (at least not with what I would consider worthwhile business ambitions). In fact wordy idiots are fairly costly, because we have to consider the large amounts of cheap garbage that they are producing, and if you price your own time somewhat competitively then fairly quickly that's the bigger lever.

Even if we don't consider the last part: How do we price the better model, that can one shot a task without having to go back and forth and spending more tokens or having to fix more bugs later? It is definitely worth something and I think it's quite undervalued right now. What seems to be missing is a better measurement of capability per token. I don't know how that could look like. Maybe something like how we try and measure inflation, some basket of tasks (which then ends up being part of the training data so idk).

fnordsensei6d ago

3.5 flash is listed as stable rather than preview, or am I misreading?

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

GodelNumbering6d ago

ah I mistakenly wrote preview

dr_dshiv6d ago

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

SwellJoe6d ago

Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric, though. You're comparing apples to oranges. Gemini 3.1 Flash is somewhere in the neighborhood between current Haiku and Sonnet, I think? Still a better value than the Anthropic models, I guess, which are quite pricey.

Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.

dr_dshiv5d ago

Definitely apples to oranges, sorry I wasn’t clear. I only included opus pricing for comparison—it is vastly superior. But even 3.1 flash lite is really useful.

Of course, if I manage to reach my limits every week on my Claude $200 sub, opus 4.7 is probably priced closer to flash!

WarmWash6d ago

>Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric,

Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.

2 more replies

OakNinja6d ago

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

drob5185d ago

I think that’s true on divergence. Basically, the only most is living in the frontier, and even that is only temporary. At some point, the frontier advances such that 99% of tasks can use something short of a frontier model and only a very few tasks actually demand frontier performance.

WhitneyLand6d ago

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

SyneRyder6d ago

> Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.

https://x.com/Steve_Yegge/status/2046260541912707471

A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.

https://x.com/demishassabis/status/2043867486320222333

This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:

https://x.com/mihaimaruseac/status/2046272726881693960

myko6d ago

> and because the ban applied outside of Google work as well

I think false (or hasn't filtered to everyone lol)

davedx5d ago

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

I will definitely not be updating to this new model, and I think once 2.5 Flash is deprecated I'll have to re-architect so Gemini is only used for web grounding requests. This is an insane price increase.

dbbk6d ago

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

GodelNumbering6d ago

No, 2.5 had both flash and flash lite.

mlmonkey6d ago

It is Google, after all ....

photonair6d ago

In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.

harrouet5d ago

If you look at the benchmark, the model is not particularly good at coding, and as you point out it costs 3x the price of the previous flash models. So what is the market for it?

I think that they might have reached the latency sweetspot where voice applications become more natural. Natural speech is <100 tokens per second (after STT), so $9 for a million token takes you to roughly 3 hours of speech. That's totally competitive compared to human costs.

LetsGetTechnicl6d ago

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

roadside_picnic6d ago

These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.

Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.

Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.

This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).

It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).

overrun116d ago

Arguably nothing even has to change with training for this to be sustainable. Dario has claimed that Anthropic is profitable on a per training run basis. They aren't profitable because they choose to keep investing in increasingly large training runs.

1 more reply

ReliantGuyZ6d ago

And if you can run those strong models at home for free, why would hosting them be a successful business for any of these providers?

Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?

2 more replies

LetsGetTechnicl6d ago

If it's profitable, why haven't they reported any profits? People like Ed Zitron have done the math and it just doesn't add up. I mean he just published this piece today: https://www.wheresyoured.at/ai-is-too-expensive/

3 more replies

booty6d ago

Yeah, at this point I think the worst-case scenario for OpenAI/Anthropic/etc is to slow down frontier model development and focus on tooling and services, as opposed to imploding completely and bursting the economic bubble. I hope?

GaggiX6d ago

If you don't need SOTA or near SOTA there are plenty of dirt cheap models, just look at Gemma 4 31B on Openrouter.

Gigachad6d ago

For all of the use cases being hyped you really do, and you actually need something much better than the SOTA models to do what we are being told can be done.

The small models are useful for small things like summarizing text or search but not much else.

1 more reply

scrollop5d ago

You mean Kimi or qwen

npn6d ago

It is insanely profitable though, if you cut out r&d cost, plus the marketing and loss leaders. Don't let them gaslight you.

Even anthropic who does not own any hardware still have a big margin providing claude models.

LetsGetTechnicl6d ago

Then why haven't they reported any profits using GAAP (generally accepted accounting principles)? They all use ARR which is easily gamed.

2 more replies

timmytokyo6d ago

Everything is insanely profitable if you ignore the costs.

2 more replies

ilia-a6d ago

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

OakNinja6d ago

There’s already a flash lite tier since 2.5. Latest is 3.1 currently.

llm_nerd6d ago

It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.

bjoli5d ago

I let it loose on a f# codebase that I know was pretty optimized but with a few low hanging fruit changes that would have a big impact.

3.1 Pro did NOT find them. 3.5 flash did. Plus one I hadn't thought of that may or may not work (which it also pointed out).

I'm pretty impressed.

irthomasthomas6d ago

And they are using this to power search answers?

CooCooCaCha6d ago

I bet the API pricing helps pay for search users

malloryerik6d ago

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...

This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.

But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.

And if we think this way, it's possible that prices are actually falling?

deaux5d ago

> Old Flash Lite -> now Gemma (and not served by Google)

> which is now Gemma territory, and I can't get that served by Google anymore

Gemma is served by Google. They're serving Gemma 4 26B A4B at $0.15/$0.60.

https://console.cloud.google.com/agent-platform/publishers/g...

https://cloud.google.com/gemini-enterprise-agent-platform/ge...

malloryerik5d ago

Ah, thanks!

baq5d ago

Demis is on record saying they need small models on edge devices and if it’s on the edge the weights may as well be public officially.

ashirviskas5d ago

don't forget Gemini 2.0 flash at $0.10/$0.40

verdverm6d ago

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

SwellJoe6d ago

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

copperx6d ago

They have said AI will be priced like a utility, meaning $100-300 per month or so.

dzhiurgis6d ago

I use Gemini models in Junie daily. When I need accuracy I switch to Gemini 3.1 Pro Preview (why it is still in preview?), but it burns thru credits leaving me topping up $5 every day. 3.1 Flash lite is just not accurate enough. 3 Flash is sweet spot just as Jetbrains suggests it is.

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

throwa3562626d ago

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

npn5d ago

The 09-2025 preview was awesome.

m3kw96d ago

just subscribe to the plan, cheaper

SXX6d ago

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG

3.5 Flash: Thinking Medium - 7516 tokens

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

SXX6d ago

Gemini 3.1 Flash Lite Thinking High - 2,526 tokens:

https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...

Gemini 2.5 Pro - 5,325 tokens:

https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...

Gemini 2.5 Flash - 7,556 tokens:

https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...

Gemma 4 31B IT - 3,261 tokens via AI Studio:

https://gistpreview.github.io/?858a42b96af864859a3b89508619d...

Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:

https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...

SXX6d ago

Gemma 4 E4B it via Edge Gallery on pixel phone:

https://gistpreview.github.io/?da742884e5e830ce71ee4db877519...

OFC this is just for fun, but nevertheless gave me working code on first try.

segmondy6d ago

I'm surprised that, "they must have trained for it" camp is not here saying that rubbish.

franze6d ago

Opus 4.7

https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...

tasuki6d ago

Wow that's terrible. Any idea why?

lpa226d ago

Did you see the other ones? This is very good by comparison.

2 more replies

doubleorseven5d ago

My guess will be because this is just software that don't understand how the world works and it's only trying to please?idk maybe im wrong

stingraycharles6d ago

I think Anthropic optimizes less for visuals. Also, it’s not that terrible.

abtinf6d ago

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF @ Q6_K

8112 tokens @ 52.97 TPS, 0.85s TTFT

https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...

Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...

Generated with LM Studio on a Macbook Pro M2 Max

https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...

SXX6d ago

Well, honestly this is quite impressive compared to 3.1 Flash Lite and 2.5 Pro. Considering that 2.5 Pro is actually quite good at generating massive amounts of code one shot.

svnt6d ago

It isn’t animated at all for me?

kingstnap6d ago

It is animated but the viewer is broken for some reason (tested Chrome latest windows).

This one works:

https://www.svgviewer.dev/s/04ipQgsU

SXX6d ago

It is animated just no movement like on my 3.5 flash examples. Try different browser might be unless it iOS.

vtail6d ago

Here is GPT 5.5 High thinking; I had to add a second follow up prompt "it's not animated though" as the first one was not animated.

https://gistpreview.github.io/?557f979c82701862bc26d24f10399...

vtail6d ago

Here is a GPT 5.5 Extra High with a modified instruction:

> Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG. Use the Brave Browser to verifty that the image is indeed animated and looks like a proper rowing frog; iterate until you are satisfied with it.

It was able to discover and fix an animation bug, but the result is still far from perfect: https://gistpreview.github.io/?029df86d03bfe8f87df1e4d9ed2f6...

hskalin6d ago

Why is it fixated on the front perspective? Interesting choice though, because most humans (and seems like other LLMs too) would pick a side perspective

captn3m06d ago

All three links animate for me.

NitpickLawyer6d ago

I think they mean the boat is moving. In the flash ones the paddles are animated but the boat is stationary for me.

codazoda6d ago

The boat moves in all three for me

1 more reply

r0fl6d ago

It’s shocking how much better 3.1 is than 3.5 flash

The benchmarks used don’t really give a full story

wslh6d ago

Can you try with a more complex story such as "three little pigs"? I tried but it created a storybook instead of the SVG animation. I am looking to partially imitate Godogen [1][2] which is really great, even for animations.

[1] https://github.com/htdt/godogen

[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...

SXX6d ago

I think it's unreasonable to expect models generate complex stories in single prompt since they trained to be concise, but I tried. This is prompt on top of story with no control buttons request:

   Now think, plan how to tell this story in a cartoon, make scene outline and then generate SVG animation story for "Three Little Pigs" in self contained HTML page. Just single animation no control buttons.

Full prompt in gist comments: https://gist.github.com/ArseniyShestakov/ed9faa53604035005ca...

Actual results for models, one shot:

Gemini 3.5 Flash - Three Little Pigs - 9,050 tokens:

https://gistpreview.github.io/?ed9faa53604035005cae86c63c766...

Gemini 3.1 Pro - Three Little Pigs - 24,272 tokens:

https://gistpreview.github.io/?f506bbfd9b4459c8cd55d89605af8...

Gemini 3 Flash - Three Little Pigs - 5,350 tokens:

https://gistpreview.github.io/?f58eff069cf916031c97d560b0e35...

Gemma 4 31B IT - Three Little Pigs - 5,494 tokens:

https://gistpreview.github.io/?a3aa75abbe8fd7818b73f6fa55ee6...

Gemma 4 26B A4B IT - Three Iittle Pigs - 6,375 tokens:

https://gistpreview.github.io/?1e631caebeb54f9f0cd6d0e3d4d5e...

segmondy5d ago

This was generated locally with Kimi https://gistpreview.github.io/?d55f07c22d54badc8042a7c8b3785...

1 more reply

no-name-here6d ago

3.1 pro was pretty good among them. (iOS)

ZeWaka6d ago

Wow, Gemini 3.5 Flash surprised me there.

1 more reply

krupan6d ago

These are hilarious. 3.5 Flash Thinking High is the only one that is weirdly deformed (what is going on with the hat in 3.1 Pro??)

stingraycharles6d ago

3.5 Flash definitely got the synth wave vibe preference.

abi6d ago

Your links are broken FYI.

John78787816d ago

They work for me.

TacticalCoder6d ago

They do work here too.

golfer6d ago

Arena.ai:

> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.

https://x.com/arena/status/2056793180998361233

h14h6d ago

Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.

ohlookcake5d ago

That graph seems odd. It looks like Gemini 3.5 Flash is not actually on the convex hull, and they forced the 'frontier' to bend inwards to include it

OsrsNeedsf2P6d ago

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

golfer6d ago

Arena.ai is saying "Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers."

https://x.com/arena/status/2056793180998361233

nicce6d ago

Not sure what to think about this. There is no even GPT 5.5

sauwan6d ago

Yeah, bummer. I was very excited for this release, but this killed it.

droidjj6d ago

The pricing is an issue.

himata41136d ago

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

stri8ed6d ago

Given the cost increase associated with this model, and previous model releases, I think the size is trending upwards, not down.

himata41136d ago

The speed says otherwise. I think they're increasing costs since they want to start seeing ROI.

JanSt6d ago

Those are (mostly) new, faster TPU

1 more reply

maipen6d ago

Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

They are just refining their current models while they finish training the next generation.

They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

ACCount376d ago

Anthropic has been sitting on Mythos for a while now. I guess they don't feel pressured to fuck it ship it until anyone else gets a 10T to work.

throwa3562626d ago

According to people who have access to Mythos, it is slightly worse than GPT-5.5-xhigh. At least for security tasks.

Hold on, I think this claim needs some hard data. Here you go gentlemen:

https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

2 more replies

abirch6d ago

Anthropic can sell Mythos to Fortune 500 companies and bypass the average user. I'm not sure how much is hype but I see things like this https://blog.cloudflare.com/cyber-frontier-models/

Sevii6d ago

It's doubtful they have the compute to make mythos publicly available even after the SpaceX datacenter deal. And why sell it publicly if people are still willing to pay for Opus 4.7?

outside12346d ago

I suspect that Mythos doesn't have a business model that works

Jabbles6d ago

> Engineers at google have publically stated that the models are too big and are far from their potencial

Can you link to a source?

himata41136d ago

I wish I could, it was one of those youtube podcast type interviews with one of the engineers, there was a lot more shared, but that line stuck with me the most.

Dinux6d ago

Source please cause i dont believe that for once second

howdareme6d ago

Google’s pro models are almost certainly bigger than Openai’s lol

fikama6d ago

Why would that be? I am curious why do you think that.

mnicky6d ago

E.g. because they are behind on research and so must compensate with size to achieve similar level of intelligence. At least this is what I heard.

For intelligence/size only OpenAI and Anthropic are the frontier. Google has more compute so it can compensate for that with size of the models...

1 more reply

ActorNightly6d ago

Because TPUs are more efficient, and its cheaper for them to field them in higher quantity since they own the chip.

ActorNightly6d ago

I mean, yes and no.

Nobody really knows the answer to which one is more optimal

* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

merb6d ago

Stil no new processor version for document ai https://docs.cloud.google.com/document-ai/docs/release-notes that is so weird. (Customer extractor)

It’s not possible to uptrain on preview releases and it did not get that much love for a while.

s3p6d ago

Yikes. I think the concept of a 'flash' model is changing, no? Google used to market this as its lower-intelligence, faster, cheaper option. I appreciate that they are delivering on both of those, but personally I would appreciate if they could create an incremental knowledge improvement while holding price steady. Fortune 500 companies have to make their money I guess.

2001zhaozhao6d ago

I think flash just means "fast" now

kilpikaarna6d ago

Real smart. I’ve come to associate ”Flash” with ”useless make-shit-up”, and always look for Thinking/Pro when I see it set. Now, suddenly, there is only Flash?

likium6d ago

My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

toraway6d ago

That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

asar6d ago

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

Aunche6d ago

"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

WarmWash6d ago

I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

iwhalen6d ago

I wonder why they didn't discuss price in the post?

Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/

himata41136d ago

I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

wolttam6d ago

It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

himata41136d ago

gemini models solve a problem in 80% less tokens so that's something to think about.

1 more reply

simonw6d ago

Gemini caching is confusing though:

  $0.15 / million tokens
  $1.00 / 1,000,000 tokens per hour (storage price)

I much prefer the OpenAI/DeepSeek way of pricing caching where you don't have to think about storage price at all - you pay for cached tokens if you reuse the same prefix within a (loosely defined) time period.

simonw6d ago

As far as I can tell Gemini caching DOES work like OpenAI - see implicit caching here: https://ai.google.dev/gemini-api/docs/caching

I confirmed this by running a bunch of prompts through Gemini 3.5 Flash without doing anything special to configure caching and noting that it comes back with a "cachedContentTokenCount" on many of the responses.

The "storage price" quoted is for an optional Gemini feature that most people don't care about: https://ai.google.dev/gemini-api/docs/caching#explicit-cachi...

__jl__6d ago

In our experience, caching is not very reliable with google. We always get random cache misses that don't happen with other providers. We find OpenAI, Anthropic and Fireworks (which we use a lot) all have higher cache hit rates. So it's not only about the costs of cached token but also what kind of cached hit rate you get.

svachalek6d ago

In my experience Google is the most flaky in general, which is surprising considering the rock solid history of their search and other products. Just more likely not to respond at all, to give a response out of left field, to handle the same error in 12 different ways randomly (a rainbow of HTTP status codes and error messages), etc etc.

2 more replies

minimaxir6d ago

10% of input pricing is standard especially compared to competition.

himata41136d ago

yah, which means that the input cost is the only value that should be paid attention to at the end + the cache discount (x10). If google would start offering x20 discount it would make it twice as cheap while input and output stayed the same.

John78787816d ago

[deleted]

stri8ed6d ago

Output cost is 3x from Gemini 3 flash.

npn6d ago

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

brianwawok6d ago

What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

npn5d ago

I sell service. Imagine my users have to pay 4x more for marginal increment just 'cause.

They are more willing to wait though, so Chinese models are pretty attractive right now.

bel86d ago

Generate 5x more value for the same amount of money.

s3p6d ago

Wrong question.

Right question: What exactly is Google's plan for the long term pricing of these models, and are we all going to be priced out in a year?

noelsusman6d ago

The Artificial Analysis benchmark results are pretty underwhelming. Roughly the same "intelligence" as MiMo-V2.5-Pro for over 3x the cost. We'll have to see how that translates to actual usage but it's not a great sign.

hydra-f6d ago

That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

halJordan6d ago

Bad look to tell people they're not allowed to compare things just because we need to respect Google's privacy

hydra-f6d ago

I didn't take the price into consideration when writing that. I meant to point out that even if they have similar scores, the Flash model might be smaller than MiMo or Kimi, which would by itself be a win

That said, haste makes waste as the price point completely invalidates that

noelsusman6d ago

I don't know why a user should care at all about parameter counts. All that matters is performance and cost.

aliljet6d ago

Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.

WarmWash6d ago

People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

hibikir6d ago

I see constant hallucination in claude code when using specific tooling: It thinks it knows aws cli, for instance, but there's some flags that don't exist, it attempts to use all the time in 4.6 and 4.7. When asked about it, it says that yes , the flag doesn't exist in that command, but it exists in a different command (which it does), and yet, it attempts to use it without extra info.

Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.

For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.

asdfasgasdgasdg6d ago

https://gemini.google.com/share/9cd8ca68025a

I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").

Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)

hamdingers6d ago

I can reliably produce hallucinations with this genre of prompt: "write a script that does <simple task> with <well known but not too-well-known API>." Even the frontier models will hallucinate the perfect API endpoint that does exactly what I want, regardless of if it exists.

The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.

sapneshnaik6d ago

Yeah. Better to have more details in your prompt than fewer. For example, I use this:

```

Build a Nango sync that stores Figma projects.

Integration ID: figma

Connection ID for dry run: my-figma-connection

Frequency: every hour

Metadata: team_id

Records: Project with id, name, last_modified

API reference: https://www.figma.com/developers/api#projects-endpoints

```

Note: You do need a Nango account and the Nango Skill installed before it could work.

Corence6d ago

https://gemini.google.com/share/3717c8505d6b

Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.

Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8

brooksc6d ago

I asked gemini 3.1 Pro to search for the linkedin URLs for a list of peers. It generated a plausible list of links -- but they were all hallucinated. On a follow up it confirmed it couldn't actually search, but didn't tell me that without prompting.

rjh296d ago

"People complain about them incessantly, but I can almost never get people to actually post receipts."

...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.

No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.

Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.

ls6126d ago

Claude has gotten good in the past month or two at recognizing when it might need to search the web for updated info rather than saying that it has no idea what I'm talking about or making stuff up.

krupan6d ago

Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.

Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number

saberience6d ago

I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

droidjj6d ago

Hallucination is also much better controlled in the context of agentic coding because outputs can be validated by running the code (or linters/LSP). I almost never notice hallucinations when I’m coding with AI, but when using AI for legal work (my real job) it hallucinates constantly and perniciously because the hallucinations are subtle—e.g., making up a crucial fact about a real case.

1 more reply

NothingAboutAny5d ago

For coding the worst I've seen recently is gemini using or suggesting library methods that dont exist in c# which it catches when it builds the project (something I've told it to do to catch these.)

but for research it makes shit up all the time, I asked GPT5.5 to make me a build for Rogue Trader and not only did it use out of date info, it made up a bunch of skills that were NEVER in the game. I attribute that to there not being enough online information in the wikis or whatever but I wish it would just say "I dont know" instead of hallucinating but I know that's not how the tech works.

vitorgrs6d ago

Just ask any real question about stuff. LLM is not about code only...

throawayonthe6d ago

well there is https://artificialanalysis.ai/evaluations/omniscience

goldenarm6d ago

It's a gibberish input detection benchmark, and does not measure output hallucinations.

vlmutolo6d ago

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

https://g.co/gemini/share/33e7a589a161

deaux5d ago

Nothing about this is a hallucination. The Codex that it talks about is real, existed, and did go on to power the original Copilot. You neither specified that you meant a different Codex, nor did it make anything up. The CodeGemma isn't made up either, as its referenced working link shows.

Sevii6d ago

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

aliljet6d ago

I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.

Coding, however, is solved like magic. Easier to add tests, to be fair.

yieldcrv6d ago

if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

(the domain name is dumb and completely unmarketable)

jampekka6d ago

The models still hallucinate bad when called via APIs, especially if web search is not enabled. Gemini hallucinates quite frequently even with the app and search enabled. More recent (e.g. ChatGPT 5.x and Deepseek v4) prompts/harnesses search very aggressively, which does greatly mitigate hallucinations.

schneehertz5d ago

Victim of LLM hallucinations, poor guy

majso6d ago

maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

krupan6d ago

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

FergusArgyll6d ago

As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate

goldenarm6d ago

I've seen chatGPT and Gemini hallucinate even from web search, it's better is not sufficient

golfer6d ago

Here's the benchmark scoreboard they published:

https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...

mixtureoftakes6d ago

benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?

1 more reply

andrewstuart6d ago

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

cmrdporcupine6d ago

I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

eis6d ago

3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

hedora6d ago

Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

ekojs6d ago

Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

pingou6d ago

How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

knollimar6d ago

Only speculation but cache maybe?

ls_stats6d ago

>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

mijoharas6d ago

That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

[0] https://news.ycombinator.com/item?id=47076484

hubraumhugo6d ago

Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com

amarant6d ago

Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

harias5d ago

The xkcd comic is a really cool idea. I enjoyed seeing my wrapped, thanks!

simianwords6d ago

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

NitpickLawyer6d ago

> concerned about Gemini models being benchmaxxed generally

I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.

computerex6d ago

I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.

alexdns6d ago

Its Gemini 3.5 Flash

nerdalytics6d ago

Yeah, Google chose a misleading title for the blog post.

jader2016d ago

> Today, we’re introducing Gemini 3.5, our latest family of models combining frontier intelligence with action. This represents a major leap forward in building more capable, intelligent agents. We’re kicking off the series by releasing 3.5 Flash.

nerdalytics5d ago

paragraph vs title

nightski6d ago

AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.

lugu6d ago

What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.

charcircuit6d ago

All major operating systems Windows, macOS, iOS, and Android have local APIs for using AI.

hedora6d ago

Why would I use those instead of just grabbing a model from hugging face? Are they as good as qwen 30B?

1 more reply

HardCodedBias6d ago

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

brokencode6d ago

Where do you see that it’s low capability?

And Google is trying to make something affordable enough for a mass market, ad-supported audience.

They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

hedora6d ago

Price up (cost up?), benchmarks down. Latency down.

So, who is this for? People that want more ads and worse output, but want it faster? Sounds pretty awful to me.

llmslave6d ago

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

npn6d ago

Nah, it costs what you are willing to pay.

bakugo6d ago

Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

cesarvarela6d ago

Add Flash to the title, please.

meetpateltechOP6d ago

edited it.

warthog6d ago

GPT-5.5 on the benchmarks still seem to perform better than this

Plus the vibe of the gemini models are so weird particularly when it comes to tool calling

At this point I kinda need them to shock me to make the switch

benbencodes6d ago

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

lyjackal6d ago

You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

conorh6d ago

I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

mchusma6d ago

Okay, it's kind of somewhere between haiku and sonnet level pricing, at somewhere between sonnet and opus level performance. Its a great option. I was hoping to see opus class intelligence at haiku level pricing out of google, and this is close to that!

mchusma6d ago

Never mind, after looking at more benchmarks, seems closer to sonnet level intelligence at slightly lower cost. Speed is great for latency sensitive applications, but if this was 1/2 the cost it would have been priced to win.

If this is the big model release out of google, its a disappointent.

ls_stats6d ago

You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

jpau6d ago

Standard pricing is showing for me as $1.50 / $9.

(I suspect you're viewing the "flex" pricing).

Tiberium6d ago

Please delete/edit your AI-written and factually wrong post.

MallocVoidstar6d ago

In addition to people pointing out your LLM got the pricing wrong,

> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

Every Gemini model starting with 2.5 has been a reasoning model.

j / k navigate · click thread line to collapse

87 comments

GodelNumbering6d ago

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

__jl__6d ago

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

bnug5d ago

joshmlewis5d ago

xdertz5d ago

the era of subsidised ai is ending

driverdan5d ago

API calls have never been subsidized, only subscriptions.

kzrdude5d ago

AI is getting really useful, might be why

ahknight5d ago

Sonnet-level performance at Haiku prices. They know what they have and who the audience is they want.

ashirviskas5d ago

Gemini 2.0 Flash: $19

ahknight5d ago

... and you get what you pay for. Or less.

doginasuit6d ago

opsnooperfax6d ago

tskj5d ago

I don't understand why everyone thinks there is a ceiling below human-level intelligence, when we have an existence proof that human-level intelligence is possible.

3 more replies

lanthissa6d ago

switching models is insanely cheap compared to token cost on anything signficant, this is a take so cynical it misses the reality

Clueed6d ago

2 more replies

hnarn6d ago

> now that they have people who built services on their API

People really can’t wait to be the next Zynga

rudedogg6d ago

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

tempaccount4206d ago

This is not priced at inference cost.

My guess: it's the price at which they make more money than if they rent the TPUs to other companies.

gpm6d ago

The cost at such they could rent out the TPUs, i.e. the market rate, is the inference cost.

Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.

4 more replies

spyckie26d ago

Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.

You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.

Flash seems to be targeting the near-frontier category.

TurdF3rguson6d ago

That might work if it wasn't for FOMO. Are you ok with only $20 of frontier usage a month?

1 more reply

booty6d ago

Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.

Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.

https://www.together.ai/pricing

https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)

...my opinions here are of course, conjecture built on top of conjecture....

eklitzke6d ago

HDBaseT6d ago

Not to discredit you, because you are 100% correct but tangential note about together.ai, they seem fairly unreliable with constant outages or higher than normal latency.

BoorishBears6d ago

This is trouble if you're not Google/OpenAI/Anthropic: they're all shifting towards pricing for the economic value of the knowledge work they're aiding.

The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.

That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.

At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.

tskj5d ago

IncreasePosts6d ago

Maybe the margins are just very large for Google because they predict so much demand for 3.5?

GodelNumbering6d ago

This combined with locally runnable models getting pretty good recently (e.g. Qwen 3.6) tells me that it's time to seriously consider local dev setup again

2 more replies

hei-lima6d ago

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

SwellJoe6d ago

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

Zambyte6d ago

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

1 more reply

trollbridge6d ago

1 more reply

squidbeak6d ago

Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

ai_fry_ur_brain6d ago

Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.

7 more replies

xbmcuser6d ago

throwa3562626d ago

To be honest, China not having access to the latest hardware is exactly what has driven LLM technology forward the last 2 years.

1 more reply

blackoil6d ago

Open Source ASML EUV. But will wipe off trillions from US stocks so 401k may not like that.

stared6d ago

We have a "DeepSeek moment", https://github.com/antirez/ds4 (see https://news.ycombinator.com/item?id=48142108).

Or if you prefer smaller ones, Qwen3.6-35B-A3B, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF

Bombthecat5d ago

Can you run a coal power plant in your backyard? Or a giant solar power farm?

Of course not

And you don't need to

segmondy6d ago

You can use lots of open weight models today.

hei-lima6d ago

Gigachad6d ago

The real problem is the hardware to run them is still very expensive.

pianopatrick6d ago

Maybe we can figure out better ways to use the models that can run on cheap hardware.

GeorgeOldfield6d ago

gemini isn't even that good. just tested 3.5 on usual complex prompts to opus/chat 5.5. meh

k8sToGo6d ago

Are you really comparing flash to opus? Shouldn't you be comparing pro?

1 more reply

bachmeier6d ago

Who would have guessed that something costing roughly a third as much wouldn't do as well at certain tasks.

kmac_6d ago

jstummbillig5d ago

> Interesting pricing direction.

Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.

fnordsensei6d ago

3.5 flash is listed as stable rather than preview, or am I misreading?

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

GodelNumbering6d ago

ah I mistakenly wrote preview

dr_dshiv6d ago

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

SwellJoe6d ago

dr_dshiv5d ago

Definitely apples to oranges, sorry I wasn’t clear. I only included opus pricing for comparison—it is vastly superior. But even 3.1 flash lite is really useful.

Of course, if I manage to reach my limits every week on my Claude $200 sub, opus 4.7 is probably priced closer to flash!

WarmWash6d ago

>Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric,

Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.

2 more replies

OakNinja6d ago

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

drob5185d ago

WhitneyLand6d ago

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

SyneRyder6d ago

> Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.

https://x.com/Steve_Yegge/status/2046260541912707471

A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.

https://x.com/demishassabis/status/2043867486320222333

This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:

https://x.com/mihaimaruseac/status/2046272726881693960

myko6d ago

> and because the ban applied outside of Google work as well

I think false (or hasn't filtered to everyone lol)

davedx5d ago

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

dbbk6d ago

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

GodelNumbering6d ago

No, 2.5 had both flash and flash lite.

mlmonkey6d ago

It is Google, after all ....

photonair6d ago

harrouet5d ago

If you look at the benchmark, the model is not particularly good at coding, and as you point out it costs 3x the price of the previous flash models. So what is the market for it?

LetsGetTechnicl6d ago

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

roadside_picnic6d ago

These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.

This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).

overrun116d ago

1 more reply

ReliantGuyZ6d ago

And if you can run those strong models at home for free, why would hosting them be a successful business for any of these providers?

Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?

2 more replies

LetsGetTechnicl6d ago

3 more replies

booty6d ago

GaggiX6d ago

If you don't need SOTA or near SOTA there are plenty of dirt cheap models, just look at Gemma 4 31B on Openrouter.

Gigachad6d ago

For all of the use cases being hyped you really do, and you actually need something much better than the SOTA models to do what we are being told can be done.

The small models are useful for small things like summarizing text or search but not much else.

1 more reply

scrollop5d ago

You mean Kimi or qwen

npn6d ago

It is insanely profitable though, if you cut out r&d cost, plus the marketing and loss leaders. Don't let them gaslight you.

Even anthropic who does not own any hardware still have a big margin providing claude models.

LetsGetTechnicl6d ago

Then why haven't they reported any profits using GAAP (generally accepted accounting principles)? They all use ARR which is easily gamed.

2 more replies

timmytokyo6d ago

Everything is insanely profitable if you ignore the costs.

2 more replies

ilia-a6d ago

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

OakNinja6d ago

There’s already a flash lite tier since 2.5. Latest is 3.1 currently.

llm_nerd6d ago

bjoli5d ago

I let it loose on a f# codebase that I know was pretty optimized but with a few low hanging fruit changes that would have a big impact.

3.1 Pro did NOT find them. 3.5 flash did. Plus one I hadn't thought of that may or may not work (which it also pointed out).

I'm pretty impressed.

irthomasthomas6d ago

And they are using this to power search answers?

CooCooCaCha6d ago

I bet the API pricing helps pay for search users

malloryerik6d ago

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

And if we think this way, it's possible that prices are actually falling?

deaux5d ago

> Old Flash Lite -> now Gemma (and not served by Google)

> which is now Gemma territory, and I can't get that served by Google anymore

Gemma is served by Google. They're serving Gemma 4 26B A4B at $0.15/$0.60.

https://console.cloud.google.com/agent-platform/publishers/g...

https://cloud.google.com/gemini-enterprise-agent-platform/ge...

malloryerik5d ago

Ah, thanks!

baq5d ago

Demis is on record saying they need small models on edge devices and if it’s on the edge the weights may as well be public officially.

ashirviskas5d ago

don't forget Gemini 2.0 flash at $0.10/$0.40

verdverm6d ago

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

SwellJoe6d ago

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

copperx6d ago

They have said AI will be priced like a utility, meaning $100-300 per month or so.

dzhiurgis6d ago

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

throwa3562626d ago

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

npn5d ago

The 09-2025 preview was awesome.

m3kw96d ago

just subscribe to the plan, cheaper

SXX6d ago

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG

3.5 Flash: Thinking Medium - 7516 tokens

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

SXX6d ago

Gemini 3.1 Flash Lite Thinking High - 2,526 tokens:

https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...

Gemini 2.5 Pro - 5,325 tokens:

https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...

Gemini 2.5 Flash - 7,556 tokens:

https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...

Gemma 4 31B IT - 3,261 tokens via AI Studio:

https://gistpreview.github.io/?858a42b96af864859a3b89508619d...

Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:

https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...

SXX6d ago

Gemma 4 E4B it via Edge Gallery on pixel phone:

https://gistpreview.github.io/?da742884e5e830ce71ee4db877519...

OFC this is just for fun, but nevertheless gave me working code on first try.

segmondy6d ago

I'm surprised that, "they must have trained for it" camp is not here saying that rubbish.

franze6d ago

Opus 4.7

https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...

tasuki6d ago

Wow that's terrible. Any idea why?

lpa226d ago

Did you see the other ones? This is very good by comparison.

2 more replies

doubleorseven5d ago

My guess will be because this is just software that don't understand how the world works and it's only trying to please?idk maybe im wrong

stingraycharles6d ago

I think Anthropic optimizes less for visuals. Also, it’s not that terrible.

abtinf6d ago

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF @ Q6_K

8112 tokens @ 52.97 TPS, 0.85s TTFT

https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...

Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...

Generated with LM Studio on a Macbook Pro M2 Max

https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...

SXX6d ago

Well, honestly this is quite impressive compared to 3.1 Flash Lite and 2.5 Pro. Considering that 2.5 Pro is actually quite good at generating massive amounts of code one shot.

svnt6d ago

It isn’t animated at all for me?

kingstnap6d ago

It is animated but the viewer is broken for some reason (tested Chrome latest windows).

This one works:

https://www.svgviewer.dev/s/04ipQgsU

SXX6d ago

It is animated just no movement like on my 3.5 flash examples. Try different browser might be unless it iOS.

vtail6d ago

Here is GPT 5.5 High thinking; I had to add a second follow up prompt "it's not animated though" as the first one was not animated.

https://gistpreview.github.io/?557f979c82701862bc26d24f10399...

vtail6d ago

Here is a GPT 5.5 Extra High with a modified instruction:

It was able to discover and fix an animation bug, but the result is still far from perfect: https://gistpreview.github.io/?029df86d03bfe8f87df1e4d9ed2f6...

hskalin6d ago

Why is it fixated on the front perspective? Interesting choice though, because most humans (and seems like other LLMs too) would pick a side perspective

captn3m06d ago

All three links animate for me.

NitpickLawyer6d ago

I think they mean the boat is moving. In the flash ones the paddles are animated but the boat is stationary for me.

codazoda6d ago

The boat moves in all three for me

1 more reply

r0fl6d ago

It’s shocking how much better 3.1 is than 3.5 flash

The benchmarks used don’t really give a full story

wslh6d ago

[1] https://github.com/htdt/godogen

[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...

SXX6d ago

I think it's unreasonable to expect models generate complex stories in single prompt since they trained to be concise, but I tried. This is prompt on top of story with no control buttons request:

   Now think, plan how to tell this story in a cartoon, make scene outline and then generate SVG animation story for "Three Little Pigs" in self contained HTML page. Just single animation no control buttons.

Full prompt in gist comments: https://gist.github.com/ArseniyShestakov/ed9faa53604035005ca...

Actual results for models, one shot:

Gemini 3.5 Flash - Three Little Pigs - 9,050 tokens:

https://gistpreview.github.io/?ed9faa53604035005cae86c63c766...

Gemini 3.1 Pro - Three Little Pigs - 24,272 tokens:

https://gistpreview.github.io/?f506bbfd9b4459c8cd55d89605af8...

Gemini 3 Flash - Three Little Pigs - 5,350 tokens:

https://gistpreview.github.io/?f58eff069cf916031c97d560b0e35...

Gemma 4 31B IT - Three Little Pigs - 5,494 tokens:

https://gistpreview.github.io/?a3aa75abbe8fd7818b73f6fa55ee6...

Gemma 4 26B A4B IT - Three Iittle Pigs - 6,375 tokens:

https://gistpreview.github.io/?1e631caebeb54f9f0cd6d0e3d4d5e...

segmondy5d ago

This was generated locally with Kimi https://gistpreview.github.io/?d55f07c22d54badc8042a7c8b3785...

1 more reply

no-name-here6d ago

3.1 pro was pretty good among them. (iOS)

ZeWaka6d ago

Wow, Gemini 3.5 Flash surprised me there.

1 more reply

krupan6d ago

These are hilarious. 3.5 Flash Thinking High is the only one that is weirdly deformed (what is going on with the hat in 3.1 Pro??)

stingraycharles6d ago

3.5 Flash definitely got the synth wave vibe preference.

abi6d ago

Your links are broken FYI.

John78787816d ago

They work for me.

TacticalCoder6d ago

They do work here too.

golfer6d ago

Arena.ai:

https://x.com/arena/status/2056793180998361233

h14h6d ago

Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

ohlookcake5d ago

That graph seems odd. It looks like Gemini 3.5 Flash is not actually on the convex hull, and they forced the 'frontier' to bend inwards to include it

OsrsNeedsf2P6d ago

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

golfer6d ago

https://x.com/arena/status/2056793180998361233

nicce6d ago

Not sure what to think about this. There is no even GPT 5.5

sauwan6d ago

Yeah, bummer. I was very excited for this release, but this killed it.

droidjj6d ago

The pricing is an issue.

himata41136d ago

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

stri8ed6d ago

Given the cost increase associated with this model, and previous model releases, I think the size is trending upwards, not down.

himata41136d ago

The speed says otherwise. I think they're increasing costs since they want to start seeing ROI.

JanSt6d ago

Those are (mostly) new, faster TPU

1 more reply

maipen6d ago

Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

They are just refining their current models while they finish training the next generation.

They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

ACCount376d ago

Anthropic has been sitting on Mythos for a while now. I guess they don't feel pressured to fuck it ship it until anyone else gets a 10T to work.

throwa3562626d ago

According to people who have access to Mythos, it is slightly worse than GPT-5.5-xhigh. At least for security tasks.

Hold on, I think this claim needs some hard data. Here you go gentlemen:

https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

2 more replies

abirch6d ago

Anthropic can sell Mythos to Fortune 500 companies and bypass the average user. I'm not sure how much is hype but I see things like this https://blog.cloudflare.com/cyber-frontier-models/

Sevii6d ago

It's doubtful they have the compute to make mythos publicly available even after the SpaceX datacenter deal. And why sell it publicly if people are still willing to pay for Opus 4.7?

outside12346d ago

I suspect that Mythos doesn't have a business model that works

Jabbles6d ago

> Engineers at google have publically stated that the models are too big and are far from their potencial

Can you link to a source?

himata41136d ago

I wish I could, it was one of those youtube podcast type interviews with one of the engineers, there was a lot more shared, but that line stuck with me the most.

Dinux6d ago

Source please cause i dont believe that for once second

howdareme6d ago

Google’s pro models are almost certainly bigger than Openai’s lol

fikama6d ago

Why would that be? I am curious why do you think that.

mnicky6d ago

E.g. because they are behind on research and so must compensate with size to achieve similar level of intelligence. At least this is what I heard.

For intelligence/size only OpenAI and Anthropic are the frontier. Google has more compute so it can compensate for that with size of the models...

1 more reply

ActorNightly6d ago

Because TPUs are more efficient, and its cheaper for them to field them in higher quantity since they own the chip.

ActorNightly6d ago

I mean, yes and no.

Nobody really knows the answer to which one is more optimal

* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

merb6d ago

Stil no new processor version for document ai https://docs.cloud.google.com/document-ai/docs/release-notes that is so weird. (Customer extractor)

It’s not possible to uptrain on preview releases and it did not get that much love for a while.

s3p6d ago

2001zhaozhao6d ago

I think flash just means "fast" now

kilpikaarna6d ago

Real smart. I’ve come to associate ”Flash” with ”useless make-shit-up”, and always look for Thinking/Pro when I see it set. Now, suddenly, there is only Flash?

likium6d ago

My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

toraway6d ago

That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

asar6d ago

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

Aunche6d ago

"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

WarmWash6d ago

I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

iwhalen6d ago

I wonder why they didn't discuss price in the post?

Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/

himata41136d ago

I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

wolttam6d ago

It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

himata41136d ago

gemini models solve a problem in 80% less tokens so that's something to think about.

1 more reply

simonw6d ago

Gemini caching is confusing though:

  $0.15 / million tokens
  $1.00 / 1,000,000 tokens per hour (storage price)

simonw6d ago

As far as I can tell Gemini caching DOES work like OpenAI - see implicit caching here: https://ai.google.dev/gemini-api/docs/caching

The "storage price" quoted is for an optional Gemini feature that most people don't care about: https://ai.google.dev/gemini-api/docs/caching#explicit-cachi...

__jl__6d ago

svachalek6d ago

2 more replies

minimaxir6d ago

10% of input pricing is standard especially compared to competition.

himata41136d ago

John78787816d ago

[deleted]

stri8ed6d ago

Output cost is 3x from Gemini 3 flash.

npn6d ago

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

brianwawok6d ago

What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

npn5d ago

I sell service. Imagine my users have to pay 4x more for marginal increment just 'cause.

They are more willing to wait though, so Chinese models are pretty attractive right now.

bel86d ago

Generate 5x more value for the same amount of money.

s3p6d ago

Wrong question.

Right question: What exactly is Google's plan for the long term pricing of these models, and are we all going to be priced out in a year?

noelsusman6d ago

hydra-f6d ago

That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

halJordan6d ago

Bad look to tell people they're not allowed to compare things just because we need to respect Google's privacy

hydra-f6d ago

That said, haste makes waste as the price point completely invalidates that

noelsusman6d ago

I don't know why a user should care at all about parameter counts. All that matters is performance and cost.

aliljet6d ago

WarmWash6d ago

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

hibikir6d ago

asdfasgasdgasdg6d ago

https://gemini.google.com/share/9cd8ca68025a

hamdingers6d ago

The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.

sapneshnaik6d ago

Yeah. Better to have more details in your prompt than fewer. For example, I use this:

```

Build a Nango sync that stores Figma projects.

Integration ID: figma

Connection ID for dry run: my-figma-connection

Frequency: every hour

Metadata: team_id

Records: Project with id, name, last_modified

API reference: https://www.figma.com/developers/api#projects-endpoints

```

Note: You do need a Nango account and the Nango Skill installed before it could work.

Corence6d ago

https://gemini.google.com/share/3717c8505d6b

Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8

brooksc6d ago

rjh296d ago

"People complain about them incessantly, but I can almost never get people to actually post receipts."

...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.

Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.

ls6126d ago

Claude has gotten good in the past month or two at recognizing when it might need to search the web for updated info rather than saying that it has no idea what I'm talking about or making stuff up.

krupan6d ago

Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.

saberience6d ago

I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

droidjj6d ago

1 more reply

NothingAboutAny5d ago

For coding the worst I've seen recently is gemini using or suggesting library methods that dont exist in c# which it catches when it builds the project (something I've told it to do to catch these.)

vitorgrs6d ago

Just ask any real question about stuff. LLM is not about code only...

throawayonthe6d ago

well there is https://artificialanalysis.ai/evaluations/omniscience

goldenarm6d ago

It's a gibberish input detection benchmark, and does not measure output hallucinations.

vlmutolo6d ago

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

https://g.co/gemini/share/33e7a589a161

deaux5d ago

Sevii6d ago

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

aliljet6d ago

Coding, however, is solved like magic. Easier to add tests, to be fair.

yieldcrv6d ago

(the domain name is dumb and completely unmarketable)

jampekka6d ago

schneehertz5d ago

Victim of LLM hallucinations, poor guy

majso6d ago

maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

krupan6d ago

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

FergusArgyll6d ago

goldenarm6d ago

I've seen chatGPT and Gemini hallucinate even from web search, it's better is not sufficient

golfer6d ago

Here's the benchmark scoreboard they published:

https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...

mixtureoftakes6d ago

benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?

1 more reply

andrewstuart6d ago

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

cmrdporcupine6d ago

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

eis6d ago

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

hedora6d ago

Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

ekojs6d ago

Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

pingou6d ago

How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

knollimar6d ago

Only speculation but cache maybe?

ls_stats6d ago

>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

mijoharas6d ago

That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

[0] https://news.ycombinator.com/item?id=47076484

hubraumhugo6d ago

Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com

amarant6d ago

Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

harias5d ago

The xkcd comic is a really cool idea. I enjoyed seeing my wrapped, thanks!

simianwords6d ago

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

NitpickLawyer6d ago

> concerned about Gemini models being benchmaxxed generally

computerex6d ago

I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.

alexdns6d ago

Its Gemini 3.5 Flash

nerdalytics6d ago

Yeah, Google chose a misleading title for the blog post.

jader2016d ago

nerdalytics5d ago

paragraph vs title

nightski6d ago

lugu6d ago

charcircuit6d ago

All major operating systems Windows, macOS, iOS, and Android have local APIs for using AI.

hedora6d ago

Why would I use those instead of just grabbing a model from hugging face? Are they as good as qwen 30B?

1 more reply

HardCodedBias6d ago

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

brokencode6d ago

Where do you see that it’s low capability?

And Google is trying to make something affordable enough for a mass market, ad-supported audience.

They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

hedora6d ago

Price up (cost up?), benchmarks down. Latency down.

So, who is this for? People that want more ads and worse output, but want it faster? Sounds pretty awful to me.

llmslave6d ago

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

npn6d ago

Nah, it costs what you are willing to pay.

bakugo6d ago

Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

cesarvarela6d ago

Add Flash to the title, please.

meetpateltechOP6d ago

edited it.

warthog6d ago

GPT-5.5 on the benchmarks still seem to perform better than this

Plus the vibe of the gemini models are so weird particularly when it comes to tool calling

At this point I kinda need them to shock me to make the switch

benbencodes6d ago

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

lyjackal6d ago

You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

conorh6d ago

I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

mchusma6d ago

If this is the big model release out of google, its a disappointent.

ls_stats6d ago

You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

jpau6d ago

Standard pricing is showing for me as $1.50 / $9.

(I suspect you're viewing the "flex" pricing).

Tiberium6d ago

Please delete/edit your AI-written and factually wrong post.

MallocVoidstar6d ago

In addition to people pointing out your LLM got the pricing wrong,

> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

Every Gemini model starting with 2.5 has been a reasoning model.

j / k navigate · click thread line to collapse