undefined | Better HN

0 pointssimonw1mo ago0 comments

The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

0 comments

cootsnuck1mo ago

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

chis1mo ago

It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.

[1] https://epochai.substack.com/p/can-ai-companies-become-profi...

m1011mo ago

It’s not clear at all because model training upfront costs and how you depreciate them are big unknowns, even for deprecated models. See my last comment for a bit more detail.

3 more replies

magicalist1mo ago

> They've said this directly and analysts agree [1]

chasing down a few sources in that article leads to articles like this at the root of claims[1], which is entirely based on information "according to a person with knowledge of the company’s financials", which doesn't exactly fill me with confidence.

[1] https://www.theinformation.com/articles/openai-getting-effic...

2 more replies

9cb14c1ec01mo ago

It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.

emp173441mo ago

Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.

NitpickLawyer1mo ago

> they still are subsidizing inference costs.

They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.

int_19h1mo ago

> They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range.

The interesting number is usually input tokens, not output, because there's much more of the former in any long-running session (like say coding agents) since all outputs become inputs for the next iteration, and you also have tool calls adding a lot of additional input tokens etc.

It doesn't change your conclusion much though. Kimi K2.5 has almost the same input token pricing as Gemini 3 Flash.

slopusila1mo ago

most of those subscriptions go unused. I barely use 10% of mine

so my unused tokens compensate for the few heavy users

2 more replies

mrandish1mo ago

> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.

Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.

defmacr01mo ago

>Despite the willingness of private investment to fund hugely negative AI spend

VC firms, even ones the size of Softbank, also literally just don't have enough capital to fund the planned next-generation gigawatt-scale data centers.

WarmWash1mo ago

IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.

barrkel1mo ago

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

The evidence is in third party inference costs for open source models.

nubg1mo ago

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

bityard1mo ago

When MP3 became popular, people were amazed that you could compress audio to 1/10th its size with minor quality loss. A few decades later, we have audio compression that is much better and higher-quality than MP3, and they took a lot more effort than "MP3 but at a lower bitrate."

The same is happening in AI research now.

oblio1mo ago

> A few decades later, we have audio compression that is much better and higher-quality than MP3

Just curious, which formats and how they compare, storage wise?

Also, are you sure it's not just moving the goalposts to CPU usage? Frequently more powerful compression algorithms can't be used because they use lots of processing power, so frequently the biggest gains over 20 years are just... hardware advancements.

esafak1mo ago

Someone made a quality tracker: https://marginlab.ai/trackers/claude-code/

embedding-shape1mo ago

Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".

simonwOP1mo ago

The o3 optimizations were not quantization, they confirmed this at the time.

jmalicki1mo ago

A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

This isn't just quantization, it's actually just better optimization.

Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.

There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.

It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.

Der_Einzige1mo ago

"This isn't X, it's Y" with extra steps.

1 more reply

topaz01mo ago

But a) that's the cost to the user -- we don't know how much loss they're taking on those and b) the number of tokens to serve a similar prompt has been going up, so that the total cost to serve a prompt has been going up in general. Any cost analysis that doesn't mention these is hugely misleading.

replwoacause1mo ago

My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.

sumitkumar1mo ago

It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.

j / k navigate · click thread line to collapse

0 comments

cootsnuck1mo ago

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

chis1mo ago

[1] https://epochai.substack.com/p/can-ai-companies-become-profi...

m1011mo ago

It’s not clear at all because model training upfront costs and how you depreciate them are big unknowns, even for deprecated models. See my last comment for a bit more detail.

3 more replies

magicalist1mo ago

> They've said this directly and analysts agree [1]

[1] https://www.theinformation.com/articles/openai-getting-effic...

2 more replies

9cb14c1ec01mo ago

It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.

emp173441mo ago

Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.

NitpickLawyer1mo ago

> they still are subsidizing inference costs.

They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

int_19h1mo ago

> They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range.

It doesn't change your conclusion much though. Kimi K2.5 has almost the same input token pricing as Gemini 3 Flash.

slopusila1mo ago

most of those subscriptions go unused. I barely use 10% of mine

so my unused tokens compensate for the few heavy users

2 more replies

mrandish1mo ago

> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

defmacr01mo ago

>Despite the willingness of private investment to fund hugely negative AI spend

VC firms, even ones the size of Softbank, also literally just don't have enough capital to fund the planned next-generation gigawatt-scale data centers.

WarmWash1mo ago

IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.

barrkel1mo ago

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

The evidence is in third party inference costs for open source models.

nubg1mo ago

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

bityard1mo ago

The same is happening in AI research now.

oblio1mo ago

> A few decades later, we have audio compression that is much better and higher-quality than MP3

Just curious, which formats and how they compare, storage wise?

esafak1mo ago

Someone made a quality tracker: https://marginlab.ai/trackers/claude-code/

embedding-shape1mo ago

Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".

simonwOP1mo ago

The o3 optimizations were not quantization, they confirmed this at the time.

jmalicki1mo ago

A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

This isn't just quantization, it's actually just better optimization.

It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.

Der_Einzige1mo ago

"This isn't X, it's Y" with extra steps.

1 more reply

topaz01mo ago

replwoacause1mo ago

sumitkumar1mo ago

It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.

j / k navigate · click thread line to collapse