I don't think local LLMs will ever be a thing except for very specific use cases.
Servers will always have way more compute power than edge nodes. As server power increases, people will expect more and more of the LLMs and edge node compute will stay irrelevant since their relative power will stay the same.
Mobile applications are also relevant. An LLM in your car could be used for local intelligence. I'm pretty sure self driving cars use some about of local AI already (although obviously not LLM, and I don't really know how much of their processing is local vs done on a server somewhere).
If models stop advancing at a fast clip, hardware will eventually become fast and cheap enough that running models locally isn't something we think about as being a non-sensical luxury, in the same way that we don't think that rendering graphics locally is a luxury even though remote rendering is possible.
Even over LTE you're looking at under 120ms coast to coast.
This doesn't seem right to me.
You take all the memory and CPU cycles of all the clients connected to a typical online service, compared to the memory and CPU in the datacenter serving it? The vast majority of compute involved in delivering that experience is on the client. And there's probably vast amounts of untapped compute available on that client - most websites only peg the client CPU by accident because they triggered an infinite loop in an ad bidding war; imagine what they could do if they actually used that compute power on purpose.
But even doing fairly trivial stuff, a typical browser tab is using hundreds of megs of memory and an appreciable percentage of the CPU of the machine it's loaded on, for the duration of the time it's being interacted with. Meanwhile, serving that content out to the browser took milliseconds, and was done at the same time as the server was handling thousands of other requests.
Edge compute scales with the amount of users who are using your service: each of them brings along their own hardware. Server compute has to scale at your expense.
Now, LLMs bring their special needs - large models that need to be loaded into vast fast memory... there are reasons to bring the compute to the model. But it's definitely not trivially the case that there's more compute in servers than clients.
A single datacenter machine with state of the art GPUs serving LLM inference can be drawing in the tens of kilowatts, and you borrow a sizable portion for a moment when you run a prompt on the heavier models.
A phone that has to count individual watts, or a laptop that peaks on dual digit sustained draw, isn't remotely comparable, and the gap isn't one or two hardware features.
Even with token consumption increasing as AI abilities increase, there will be a point where AI output is good enough for most people.
Granted, people are very willing to hand over their data and often money to rent a software licence from the big players, but if they're all charging subscription fees where a local LLM costs nothing, that might cause a few sleepless nights for a few execs.
I use Read Aloud across a few browser platforms cause sometimes I don't care to read an article I have some passing interest in.
The landscape is a mess:
it's not really bandwidth efficient to transmit on one count, local frameworks like Piper perform well in alot of cases, there's paid APIs from the big players, at least one player has incorporated api-powered neural tts and packaged it into their browser presumably ad-supported or something, yet another has incorporated into their OS, already (though it defaults to speak and spell for god knows why). I'm not willing to pay $0.20 per page though, after experimenting, especially when the free/private solution is good enough.
the problem is people expectation, they want the model to be smart
people aren't having problem for if its local or not, but they want the model to be useful
But cloud models will have diminishing returns, local hardware will get drastically faster, and techniques to efficiently inference them will be worked out further. At some point, local LLMs will have its day.
this is the same happening with software and game industry
because free market forces people to raise the bar every year, the requirement of apps and games never met. its only goes up
human would never be satisfied, boundary would be push further
that's why we have 12gb or 16gb ram for smartphone right now only for system + apps
and now we must accommodate for local LLM too??? it would only goes up, people would demand smarter and smarter model
frontier model today would deem unusable(dumb) in 5 years
example: people literally screaming in agony when Antrophic quantized their model
> Deepseek-r1 was loaded and ran locally on the Mac Studio
> M3 Ultra chip [...] 32-core CPU, an 80-core GPU, and the 32-core Neural Engine. [...] 512GB of unified memory, [...] memory bandwidth of 819GB/s.
> Deepseek-r1 was loaded [...] 671-billion-parameter model requiring [...] a bit less than 450 gigabytes of [unified] RAM to function.
> the Mac Studio was able to churn through queries at approximately 17 to 18 tokens per second
> it was observed as requiring 160 to 180 Watts during use
Considering getting this model. Looking into the future, a Mac Studio M5 Ultra should be something special.
[0] https://appleinsider.com/articles/25/03/18/heavily-upgraded-...
Society is already giving pushback to AI being pushed on them everywhere; see the rise of the word "clanker". We're seeing mental health issues pop up. We're all tired of AI slop content and engagement bait. Even the developers like us discussing it at the bleeding edge go round in circles with the same talking points reflexively. I don't see it as a given that there's public demand for even more AI, "if only it were more powerful on a server".
Speaking to your PC gaming analogy, there are render farms for graphics - they're just used for CGI and non-realtime use cases. What there isn't a huge demand for is consumer-grade hardware at datacenter prices. Apple found this out the hard way shipping Xserve prematurely.
Apple's privacy stance is to do as much as possible on the user's device and as little as possible in cloud. They have iCloud for storage to make inter-device synch easy, but even that is painful for them. They hate cloud. This is the direction they've had for some years now. It always makes me smile that so many commentators just can't understand it and insist that they're "so far behind" on AI.
All the recent academic literature suggests that LLM capability is beginning to plateau, and we don't have ideas on what to do next (and no, we can't ask the LLMs).
As you get more capable SLMs or LLMs, and the hardware gets better and better (who _really_ wants to be long on nVIDIA or Intel right now? Hmm?), people are going to find that they're "good enough" for a range of tasks, and Apple's customer demographic are going to be happy that's all happening on the device in their hand and not on a server [waves hands] "somewhere", in the cloud.
Large issues: tokenizers exist, reasoning models are still next-token-prediction instead of having "internal thoughts", RL post-training destroys model calibration
Small issues: they're all trained to write Python instead of a good language, most of the benchmarks are bad, pretraining doesn't use document metadata (ie they have to learn from each document without being told the URL or that they're written by different people)
Android crowd has been able to run LLMs on-device since LlamaCPP first came out. But the magic is in the integration with OS. As usual there will be hype around Apple, idk, inventing the very concept of LLMs or something. But the truth is neither Apple nor Android did this; only the wee team that wrote the attention is all you need paper + the many open source/hobbyist contributors inventing creative solutions like LoRA and creating natural ecosystems for them.
That's why I find this memo so cool (and will once again repost the link): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...
Never could figure out what the heck the value proposition was supposed to be though. Pay full price for a game that you can't even pretend you own? I don't think so. And the game conservation implications were also dire, so I'm not sad it went away in the end.
But on technical merits? It worked great.
Is it 'Local'?, 'Large?'...'Language?'
From Apple's point of view a local model would be the cheapest possible to run, as the end-user pays for hardware plus consumption...
I disagree.
There's a lot of interest in local LLMs in the LLM community. My internet was down for a few days and did I wish I had a local LLM on my laptop!
There's a big push for privacy; people are using LLMs for personal medical issues for example and don't want that going into the cloud.
Is it necessary to talk to a server just to check out a letter I wrote?
Obviously with Apple's release of iOS 26 and macOS 26 and the rest of their operating systems, tens of millions of devices are getting a local LLM with 3rd party apps that can take advantage of them.
I'm running Qwen 30B code on my framework laptop to ask questions about ruby vs. python syntax because I can, and because the internet was flaky.
At some point, more doesn't mean I need it. LLMs will certainly get "good enough" and they'll be lower latency, no subscription, and no internet required.