That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.
Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.
We’re seeing a massive slowing in the value of all that additional training. Folks don’t like to talk about that, but absent a completely new break-thru the current math of LLMs has largely run its course.
We simply don’t need massive training forever and ever. We’re getting to the point that “good enough” models will solve most use cases. The demonstrated business value is also still broadly missing for AI on the level required to keep funding all this training for much longer.
I think local will always have a place, but the infrastructure is going to be used in my humble opinion.
Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
https://scienceleadership.org/thumbnail/34729/1920x1920
Just in case if someone still didn't realize - we do live in Idiocracy
Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.
There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.
https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...
https://onexplayerstore.com/products/onexplayer-super-x?vari...
https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...
Apple fans never cease to amaze me.
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
Why do you say they can't do this?
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.
I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.
https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...
There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.
Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).
All put together, you can do some pretty cool things.
With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
This exists[0], but the chip in question is physically large and won't fit on a phone.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
As such I can't agree with "The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough" - we are much closer than 15/20 years to get these on a phone
This is a toy.
We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.
And then we need to build very large scale open weights.
That's the only way we don't get owned by the hyperscalers.
At the edge isn't going to happen in a meaningful way to save us.
The fact that it's running on a phone now just sets the goalpost and gets everyone excited about it: add more RAM and GPU to the next iPhone and it's not a toy anymore. Co-incidentally, phone companies also have thousands of engineers sitting around wondering what to do in their next release to convince consumers to buy ...
We're not going to get more RAM and GPU in consumer devices.
All of the supply is going into data center build outs. As the hyper scaler gamble on the future continues, we get left with weaker (or more expensive) devices - not stronger ones.
The market makers make more money if we're left to thin clients. They're also the ones who control supply and the shapes of devices.
use the experience we gain from both to bolster the other.
a future where we are unable to locally run is kind of troubling. as is a future with no open cloud. we need both to stop some of the horrors the hyperscalers will happily inflict.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
Emphasis on slowly.
laughed when it slowly began to type that out
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
With hardware and model improvements, the future is bright.
Local LLMs are going to make people sit on their phones instead of taking to real people.
Practical LLMs on mobile devices are at least a few years away.
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
> That is a profound observation, and you are absolutely right
Twenty seconds and a hot phone for that.
In the end it took almost four minutes to generate under 150 tokens of nothing.
Impressive that they got it to run, but that’s about the only thing.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
It’s only paying Google $1 billion a year for access to Gemini for Siri
If they continue to increase.