iPhone 17 Pro Demonstrated Running a 400B LLM (opens in new tab)

(twitter.com)

713 pointsanemll2mo ago326 comments

https://xcancel.com/anemll/status/2035901335984611412

326 comments

To the extent that the present LLM movement reaches a steady state conclusion it’s highly likely to be open source models on your own hardware that are “good enough” for 95% of use cases.

That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.

Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.

draxil2mo ago

I assume you mean open weight models? I wish we had better open source models. It would make LLMs far less icky if we had nice clean open trained models. A breakthrough on the cost of training would be nice.

Yizahi2mo ago

We really can't have open source LLM, because they are all based on the stolen IP, or stolen IP slightly laundered and under different title.

2 more replies

mike_hearn2mo ago

Nemotron is genuinely open source at least at the smaller sizes. You can download the datasets.

1 more reply

cmiles82mo ago

Fair clarification, yes.

seism2mo ago

Check out Apertus, the publicly funded model from a research team that goes to great lengths to remove icky content.

mr_toad2mo ago

Still need massive amounts of compute for training. Nobody is going to be training 400B models on a phone any time soon.

cmiles82mo ago

Likely not.

We’re seeing a massive slowing in the value of all that additional training. Folks don’t like to talk about that, but absent a completely new break-thru the current math of LLMs has largely run its course.

We simply don’t need massive training forever and ever. We’re getting to the point that “good enough” models will solve most use cases. The demonstrated business value is also still broadly missing for AI on the level required to keep funding all this training for much longer.

2 more replies

anonyfox2mo ago

I could see apple doing just that because they can and then having this another selling point of selling their own hardware. like their software is hard customized to run on their own hardware and vice versa (at least on paper), they could totally get some LLM going that works perfectly well on their chips specifically as a good enough local model in the next years, and promote it as kind of you-don't-need-a-subscription-when-you-have-an-iphone kind of thing. given the advances in recent years in the LLM space sounds kinda realistic to arrive somewhere that locally just works mid-term.

noemit2mo ago

Even if it runs, this will run slowly, and heat up.

I think local will always have a place, but the infrastructure is going to be used in my humble opinion.

cmiles82mo ago

Today yes, but between the improved performance of smaller on device models and the hardware itself getting better this issue is short lived.

plussed_reader2mo ago

I don't want to put information into a black box of mystery that can then be used for other monetization purposes. I am still waiting for a realistic local solution.

1 more reply

throwaway1737382mo ago

Compute evolved from batch systems with time sharing to responsive systems in your pocket. Why wouldn’t that happen here?

gervwyk2mo ago

there was a time when mainframes was the main thing.. we’ll look back and say data centers was a thing.. (hopefully if we lucky)

efsavage2mo ago

> “good enough” for 95% of use cases

Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)

hiddencost2mo ago

Consumer market is small compared to headcount reduction and cutting edge science.

firstbabylonian2mo ago

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

simonw2mo ago

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

anemllOP2mo ago

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

1 more reply

superjan2mo ago

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

trebligdivad2mo ago

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

zozbot2342mo ago

A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

Aurornis2mo ago

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

1 more reply

simonw2mo ago

Yeah, this new post is a continuation of that work.

foobiekr2mo ago

This is not entirely dissimilar to what Cerebus does with their weights streaming.

manmal2mo ago

And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

1 more reply

CrzyLngPwd2mo ago

I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.

iLemming2mo ago

The Anthropic logo is just Kurt Vonnegut’s drawing of an asshole:

https://scienceleadership.org/thumbnail/34729/1920x1920

Just in case if someone still didn't realize - we do live in Idiocracy

https://www.youtube.com/watch?v=gGlJgU9x8tM

DiscourseFan2mo ago

I think the first thing is just a funny little literary allusion for those in the know. I mean isn’t it kind of hilarious that a company valued at $300 billion has a drawing of an asshole for its logo?

1 more reply

dudefeliciano2mo ago

there is a case to be made that all AI company logos are just drawings of assholes: https://velvetshark.com/ai-company-logos-that-look-like-butt...

1 more reply

parineum2mo ago

Their logo appears to be A\ ?

1 more reply

SecretDreams2mo ago

A modern Nostradamus?

CrzyLngPwd2mo ago

It was just a dream, which quickly turned into a nightmare.

wiseowise2mo ago

You know, Quasimodo predicted all of this.

t0lo2mo ago

I don't think that's a dream for much longer. Look at the fact that we selected tiktok as the most popular social media app.

andix2mo ago

My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.

HPsquared2mo ago

I wonder if anyone has made a liquid cooling system for ipads / phones. Like, a sealed thing that seals onto the back of the device and circulates cooling water directly against the back surface.

jml7c52mo ago

A more whimsical method is to put the thing in a glass of water with the cord sticking out. :-)

https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...

4 more replies

whamlastxmas2mo ago

I have a small portable fan that I place under it basically any time I use it for any development work. It gets thermally throttled pretty fast otherwise. It's definitely the wrong machine for my needs but it's what I gotta work with for now.

ThatMedicIsASpy2mo ago

You can buy a liquid cooled tablet.

https://onexplayerstore.com/products/onexplayer-super-x?vari...

1 more reply

cercatrova2mo ago

You're in luck. Lots of phone manufacturers also implement liquid cooling inside the phone too.

https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...

aidenn02mo ago

What about submerging it in mineral oil?

Schiendelman2mo ago

I think the vapor chamber cooling Apple's starting to use is something like that, no?

ActorNightly2mo ago

Yeah, lets add more cost and complexity in a cooling system so instead of 1 token per second we get 2 tokens per second, all of the price of one graphics card that can do 50+ tokens a second.

Apple fans never cease to amaze me.

yencabulator2mo ago

Qwen3.5-397B-A17B behaves more like a 17B parameter model. Omitting the MoE part from the headline makes it a lie and stupid hype.

Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.

BoorishBears2mo ago

It behaves more like a ~80B parameter model (geometric mean of active and total params), and has world knowledge closer to a 400B parameter model

There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.

yencabulator2mo ago

I think we're using different meaning of "behaves like". I meant "has tokens/sec performance comparable to".

1 more reply

EruditeCoder1082mo ago

This is less about “running a 400B model on a phone” and more about clever engineering around constraints. What’s actually happening is: in mixture-of-experts only a small subset of weights is active per token Aggressive quantization Streaming weights from storage instead of loading everything into RAM So the effective working set is much smaller than 400B. That said, the trade-offs are obvious: very low token throughput, high latency, and heavy reliance on storage bandwidth. It’s more of a proof-of-concept than something usable.

adam_patarino2mo ago

I’ve seen this story making the rounds and I’m not just why it’s gotten so much traction. Is it just a good write up?

bkfh2mo ago

Thanks, bot.

classified2mo ago

Wouldn't a bot write better English? Or are they optimized to produce bad grammar already?

1 more reply

cj002mo ago

It’s 400B but it’s mixture of experts so how many are active at any time?

simonw2mo ago

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

thecopy2mo ago

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

3 more replies

stingraycharles2mo ago

One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

1 more reply

Hasslequest2mo ago

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

anshumankmr2mo ago

Aren't most companies doing MoE at this point?

lainproliant2mo ago

This reminds me of how excited people were to get models running locally when llama.c first hit.

russellbeattie2mo ago

I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.

The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)

Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.

So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.

As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.

To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.

But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

mlsu2mo ago

Models on the phone is never going to make sense.

If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.

"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.

On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.

Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.

madwolf2mo ago

Living through all mobile phone history, from non-existant when I was a child to today's smartphones, I would hesitate to use such absolute phrases like "X on the phone is never going to make sense". How many things we're doing on a phone today that we wouldn't dream of 20 years ago? Local models on phones don't make sense today but in 5 years? who knows...

1 more reply

russellbeattie2mo ago

Huh, I hadn't thought of battery limitations. Good call. My initial reaction is that bigger/better batteries, hyper fast recharge times and more efficient processors might address this issue, but I need to learn more about it.

That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.

zozbot2342mo ago

RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.

anemllOP2mo ago

multiple NAND, and apple already used it in Mac Studio. Plus better cooling

ecshafer2mo ago

In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

alwillis2mo ago

> In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.

Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.

big_toast2mo ago

I think this is roughly true, but instead RAM will remain a discriminator even moreso. If the scaling laws apple has domain over are compute and model size, then they'll pretty easily be able to map that into their existing price tiers.

Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.

It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.

GTP2mo ago

> nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM

Why do you say they can't do this?

ottah2mo ago

Possibly this just isn't the generation of hardware to solve this problem in? We're like, what three or four years in at most, and only barely two in towards AI assisted development being practical. I wouldn't want to be the first mover here, and I don't know if it's a good point in history to try and solve the problem. Everything we're doing right now with AI, we will likely not be doing in five years. If I were running a company like Apple, I'd just sit on the problem until the technology stabilizes and matures.

bigyabai2mo ago

If I was running a company like Apple, I'd be working with Khronos to kill CUDA since yesterday. There are multiple trillions of dollars that could be Apple's if they sign CUDA drivers on macOS, or create a CUDA-compatible layer. Instead, Apple is spinning their wheels and promoting nothingburger technology like the NPU and MPS.

It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.

1 more reply

alnah2mo ago

It's a nice experiment, but I really wonder what's the use case? Privacy, yes. Local, yes. But then? Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks? I mean, it really looks cool. But I don't think it's gonna be the future of local AI also. Maybe someone who can build up a very specialized local model for one particular task can enjoy that. Not sure it's gonna be massively use by the common of the mortals... But fore sure, for the industry, there is maybe a direction where we could have different very specialized models, on our devices, that could interoperate together, and then, provide something useful. We'll see. Interesting though! Maybe we still need some years, or decades, before we have devices, laptops, good enough to run good models.

latexr2mo ago

> Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks?

If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.

Schiendelman2mo ago

This will become default. Siri (new) and Gemini will eventually run simple tasks locally and only switch to cloud compute when necessary. Apple and Google then won't have to spend as much on their datacenters.

I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.

illwrks2mo ago

I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.

Aachen2mo ago

Somehow this reminds me of the time I downloaded, compiled, and ran a Bitcoin miner with the app called Linux Deploy on my then-new Galaxy Note (the thing called phablet that is now positively small). It ran terribly, but it did run!

Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)

illwrks2mo ago

Yes, computer in your pocket indeed! I think the Apple Neo shows just how powerful/capable the mobile chips are getting for computer use.

ActorNightly2mo ago

Don't waste time trying to run models locally.

Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.

https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...

There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.

Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).

All put together, you can do some pretty cool things.

mkagenius2mo ago

Fwiw, my pixel 8 runs Qwen3.5 4B with 2 tok/s speed. Via pocketpal app. Somehow cactus app didn't work.

PinkMilkshake2mo ago

"That is a profound observation, and you are absolutely right..."

With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!

pshc2mo ago

Even though it's quantized-to-hell Mixture of Experts, honestly, it's crazy this model can run semi-coherently on an phone.

redwood2mo ago

It will be funny if we go back to lugging around brick-size batteries with us everywhere!

wiether2mo ago

A backpack full of batteries!

https://www.youtube.com/watch?v=MI69LUXWiBc

gizajob2mo ago

Seeing as we have the power in our pockets we may as well utilise it. To…type…expert answers… very slowly.

wayeq2mo ago

might be worth it to keep Sam Altman from reading our AI generated fanfic

pokstad2mo ago

Backpack computers!

groby_b2mo ago

For small values of "running".

Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)

_air2mo ago

This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

Tade02mo ago

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

tclancy2mo ago

I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.

2 more replies

intrasight2mo ago

I think for many reasons this will become the dominant paradigm for end user devices.

Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

2 more replies

ottah2mo ago

That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.

4 more replies

originalvichy2mo ago

On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.

ottah2mo ago

Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.

Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.

The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.

alwillis2mo ago

> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.

alpineman2mo ago

Agree with the first part - but I can run GPT OSS 20b, a highly capable model on my laptop with 32GB of RAM at speeds that for all practical intents is as fast as GPT-5.4 and good enough for 90%+ of non-technical use cases.

As such I can't agree with "The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough" - we are much closer than 15/20 years to get these on a phone

1 more reply

zozbot2342mo ago

KV-cache is still quite small compared to the weights. It can stay in memory for reasonable context length, or be streamed to storage as a last resort. This actually doesn't impact performance too much, since we were already limited by having to stream in the much larger weights.

smlacy2mo ago

This should be the top comment

svachalek2mo ago

A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.

root_axis2mo ago

It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

NetMageSCW2mo ago

No one needs more than 640K!

1 more reply

bushbaba2mo ago

This comment will age well.

iooi2mo ago

Is 100 t/s the stadard for models?

lofaszvanitt2mo ago

I miss the old days when words appear one by one, just like images line by line in old modem days.

system22mo ago

Innocent times. Also, not too innocent because there was no restriction on anything.

causal2mo ago

Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

intrasight2mo ago

Better than waiting 7.5 million years to have a tell you the answer is 42.

bartread2mo ago

Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.

Which makes it even funnier.

It makes me a little sad that Douglas Adams didn't live to see it.

2 more replies

whyenot2mo ago

Should have used a better platform. So long and thanks for all the fish.

thinkingtoilet2mo ago

Maybe you should have asked a better question. :P

1 more reply

AnonymousPlanet2mo ago

Yes and then no one knows the prompt!

sgustard2mo ago

326 comments

cmiles82mo ago

To the extent that the present LLM movement reaches a steady state conclusion it’s highly likely to be open source models on your own hardware that are “good enough” for 95% of use cases.

That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.

draxil2mo ago

Yizahi2mo ago

We really can't have open source LLM, because they are all based on the stolen IP, or stolen IP slightly laundered and under different title.

2 more replies

mike_hearn2mo ago

Nemotron is genuinely open source at least at the smaller sizes. You can download the datasets.

1 more reply

cmiles82mo ago

Fair clarification, yes.

seism2mo ago

Check out Apertus, the publicly funded model from a research team that goes to great lengths to remove icky content.

mr_toad2mo ago

Still need massive amounts of compute for training. Nobody is going to be training 400B models on a phone any time soon.

cmiles82mo ago

Likely not.

2 more replies

anonyfox2mo ago

noemit2mo ago

Even if it runs, this will run slowly, and heat up.

I think local will always have a place, but the infrastructure is going to be used in my humble opinion.

cmiles82mo ago

Today yes, but between the improved performance of smaller on device models and the hardware itself getting better this issue is short lived.

plussed_reader2mo ago

I don't want to put information into a black box of mystery that can then be used for other monetization purposes. I am still waiting for a realistic local solution.

1 more reply

throwaway1737382mo ago

Compute evolved from batch systems with time sharing to responsive systems in your pocket. Why wouldn’t that happen here?

gervwyk2mo ago

there was a time when mainframes was the main thing.. we’ll look back and say data centers was a thing.. (hopefully if we lucky)

efsavage2mo ago

> “good enough” for 95% of use cases

Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)

hiddencost2mo ago

Consumer market is small compared to headcount reduction and cutting edge science.

firstbabylonian2mo ago

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

simonw2mo ago

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

anemllOP2mo ago

1 more reply

superjan2mo ago

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

trebligdivad2mo ago

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

zozbot2342mo ago

Aurornis2mo ago

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

1 more reply

simonw2mo ago

Yeah, this new post is a continuation of that work.

foobiekr2mo ago

This is not entirely dissimilar to what Cerebus does with their weights streaming.

manmal2mo ago

And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

1 more reply

CrzyLngPwd2mo ago

I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.

iLemming2mo ago

The Anthropic logo is just Kurt Vonnegut’s drawing of an asshole:

https://scienceleadership.org/thumbnail/34729/1920x1920

Just in case if someone still didn't realize - we do live in Idiocracy

https://www.youtube.com/watch?v=gGlJgU9x8tM

DiscourseFan2mo ago

1 more reply

dudefeliciano2mo ago

there is a case to be made that all AI company logos are just drawings of assholes: https://velvetshark.com/ai-company-logos-that-look-like-butt...

1 more reply

parineum2mo ago

Their logo appears to be A\ ?

1 more reply

SecretDreams2mo ago

A modern Nostradamus?

CrzyLngPwd2mo ago

It was just a dream, which quickly turned into a nightmare.

wiseowise2mo ago

You know, Quasimodo predicted all of this.

t0lo2mo ago

I don't think that's a dream for much longer. Look at the fact that we selected tiktok as the most popular social media app.

andix2mo ago

My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.

HPsquared2mo ago

I wonder if anyone has made a liquid cooling system for ipads / phones. Like, a sealed thing that seals onto the back of the device and circulates cooling water directly against the back surface.

jml7c52mo ago

A more whimsical method is to put the thing in a glass of water with the cord sticking out. :-)

https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...

4 more replies

whamlastxmas2mo ago

ThatMedicIsASpy2mo ago

You can buy a liquid cooled tablet.

https://onexplayerstore.com/products/onexplayer-super-x?vari...

1 more reply

cercatrova2mo ago

You're in luck. Lots of phone manufacturers also implement liquid cooling inside the phone too.

https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...

aidenn02mo ago

What about submerging it in mineral oil?

Schiendelman2mo ago

I think the vapor chamber cooling Apple's starting to use is something like that, no?

ActorNightly2mo ago

Yeah, lets add more cost and complexity in a cooling system so instead of 1 token per second we get 2 tokens per second, all of the price of one graphics card that can do 50+ tokens a second.

Apple fans never cease to amaze me.

yencabulator2mo ago

Qwen3.5-397B-A17B behaves more like a 17B parameter model. Omitting the MoE part from the headline makes it a lie and stupid hype.

Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.

BoorishBears2mo ago

It behaves more like a ~80B parameter model (geometric mean of active and total params), and has world knowledge closer to a 400B parameter model

yencabulator2mo ago

I think we're using different meaning of "behaves like". I meant "has tokens/sec performance comparable to".

1 more reply

EruditeCoder1082mo ago

adam_patarino2mo ago

I’ve seen this story making the rounds and I’m not just why it’s gotten so much traction. Is it just a good write up?

bkfh2mo ago

Thanks, bot.

classified2mo ago

Wouldn't a bot write better English? Or are they optimized to produce bad grammar already?

1 more reply

cj002mo ago

It’s 400B but it’s mixture of experts so how many are active at any time?

simonw2mo ago

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

thecopy2mo ago

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

3 more replies

stingraycharles2mo ago

One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

1 more reply

Hasslequest2mo ago

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

anshumankmr2mo ago

Aren't most companies doing MoE at this point?

lainproliant2mo ago

This reminds me of how excited people were to get models running locally when llama.c first hit.

russellbeattie2mo ago

I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

mlsu2mo ago

Models on the phone is never going to make sense.

On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.

madwolf2mo ago

1 more reply

russellbeattie2mo ago

zozbot2342mo ago

RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.

anemllOP2mo ago

multiple NAND, and apple already used it in Mac Studio. Plus better cooling

ecshafer2mo ago

alwillis2mo ago

Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.

big_toast2mo ago

Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.

GTP2mo ago

> nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM

Why do you say they can't do this?

ottah2mo ago

bigyabai2mo ago

1 more reply

alnah2mo ago

latexr2mo ago

> Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks?

If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.

Schiendelman2mo ago

I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.

illwrks2mo ago

I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.

Aachen2mo ago

illwrks2mo ago

Yes, computer in your pocket indeed! I think the Apple Neo shows just how powerful/capable the mobile chips are getting for computer use.

ActorNightly2mo ago

Don't waste time trying to run models locally.

https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...

Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).

All put together, you can do some pretty cool things.

mkagenius2mo ago

Fwiw, my pixel 8 runs Qwen3.5 4B with 2 tok/s speed. Via pocketpal app. Somehow cactus app didn't work.

PinkMilkshake2mo ago

"That is a profound observation, and you are absolutely right..."

With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!

pshc2mo ago

Even though it's quantized-to-hell Mixture of Experts, honestly, it's crazy this model can run semi-coherently on an phone.

redwood2mo ago

It will be funny if we go back to lugging around brick-size batteries with us everywhere!

wiether2mo ago

A backpack full of batteries!

https://www.youtube.com/watch?v=MI69LUXWiBc

gizajob2mo ago

Seeing as we have the power in our pockets we may as well utilise it. To…type…expert answers… very slowly.

wayeq2mo ago

might be worth it to keep Sam Altman from reading our AI generated fanfic

pokstad2mo ago

Backpack computers!

groby_b2mo ago

For small values of "running".

Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)

_air2mo ago

This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

Tade02mo ago

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

tclancy2mo ago

I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.

2 more replies

intrasight2mo ago

I think for many reasons this will become the dominant paradigm for end user devices.

Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

2 more replies

ottah2mo ago

That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.

4 more replies

originalvichy2mo ago

ottah2mo ago

alwillis2mo ago

> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

alpineman2mo ago

1 more reply

zozbot2342mo ago

smlacy2mo ago

This should be the top comment

svachalek2mo ago

root_axis2mo ago

It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

NetMageSCW2mo ago

No one needs more than 640K!

1 more reply

bushbaba2mo ago

This comment will age well.

iooi2mo ago

Is 100 t/s the stadard for models?

lofaszvanitt2mo ago

I miss the old days when words appear one by one, just like images line by line in old modem days.

system22mo ago

Innocent times. Also, not too innocent because there was no restriction on anything.

causal2mo ago

Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

intrasight2mo ago

Better than waiting 7.5 million years to have a tell you the answer is 42.

bartread2mo ago

Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.

Which makes it even funnier.

It makes me a little sad that Douglas Adams didn't live to see it.

2 more replies

whyenot2mo ago

Should have used a better platform. So long and thanks for all the fish.

thinkingtoilet2mo ago

Maybe you should have asked a better question. :P

1 more reply

AnonymousPlanet2mo ago

Yes and then no one knows the prompt!

sgustard2mo ago