I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.
On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?
Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.
https://thot-experiment.github.io/gradient-gemma4-31b/
This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.
running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write
eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true
I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.
Edit: For comparison with the other poster, same setup as above, but with Gemma 4 31B Instruct 8bit MLX (not sure if exactly the same model): time to first token 4.62s, 7.20 tokens/s; with a different prompt, 1.17s and 7.24 tokens/s.
Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.
GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.
Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.
I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.
It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.
This is not said often enough.
Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.
There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.
That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.
I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.
I have seen way too many people who are overly optimistic about local LLMs.
Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.
When were you trying local models? The model releases from April 2026 are a serious change in performance.
Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.
Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?
A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.
A local model is as good as a frontier model of writing a joke.
A local model is as good as a frontier model at responding to an email.
Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!
You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.
Edit: TIL it is MoE and only has 3.6B active, explains a lot.
In my experience (so far), I can’t let the LLM write too much in one go.
I need to test the hell out of what it gives me, and I can’t ask for too much, at one time.
I tend to ask it to “flesh out” functions, where I have a signature, and a detailed headerdoc comment. I will provide a lot of guidance about the context, often attaching relevant files.
Even then, it often doesn’t give me what I need, first time, unless it’s a small function, with extremely limited scope.
That said, it’s been extremely helpful. It has accelerated my development greatly.
I have found that it gives me much better PHP, than Swift.
I suspect that may be because PHP is extremely mature, and there’s millions and millions of lines of high-quality code out there, in open-source repos, while Swift is probably mostly in closed repos, with open stuff not really provided by experienced developers (it’s a proprietary language used for shipping commercial software, so that may also apply to other languages).
What it gives me in Swift, most closely resembles stuff that enthusiastic newer folks would do, and want to show off.
The same is true for rust-lang. Code that will immediately clone/re-allocate anything passed by reference and collect everything to the heap that is passed by `Iterator`/`IntoIterator`.
It is a massive performance anti-pattern and the hallmark of somebody "struggling" with the borrow checker. Naturally a lot of 1st & 2nd 'I just learned rust' projects lean on it. Which is totally fine for humans, you're learning. But with LLMs that pattern is now burned into their eigenvectors with the heat of a billion hours of H100 training time.
It has gotten to a point that all code I generate with Opus or Codex if there as iterator or reference in the argument, I start a fresh context, with a sort of `remove unnecessary clones, collections, and copies from the following code: {{code}}`
What does it do if you put "Avoid unnecessary clones, collections, and copies" in your CLAUDE.md/AGENTS.md?
Edit: Opus prior to the context nerf it worked more often than not. Current Opus 4.7 is practically unusable.
Second, but I've found a cheat code to make it much farther with minimal intervention.
Step 1: tell them your goal, have them generate a doc, include design principals, system invariants, and acceptance criteria.
No amount of CLAUDE.md or skills beats re-iterating the focus points directly in the prompt.
Step 2: tell them to summarize the doc (pay close attention here). Have them save it somewhere (I use docs/agents) once you're happy with it.
Step 3: tell them to build a detailed plan to meet the objectives of the doc.
Step 4: let them go wild.
Step 5: once they declare "done", feed their progress to another LLM (Gemini is quite decent for review, and free) -> mindlessly feed the feedback back to the implementing LLM.
Step 6: Say the magic words: https://github.com/cuzzo/clear/blob/master/docs/retrospectiv...
Again, I've found no amount of skills or CLAUDE.md beats slightly modifying a prompt to meet your exact goals specific to the design and what you know of the implementation so far.
Step 7: Have them rebuild a plan to address feedback.
Step 8: Let them go wild. Loop back to Step 5 until the LLMs tell you there's no major action items.
Step 9: Tell them to remove anything from the commit that's not strictly necessary, get rid of comment changes that aren't strictly necessary, etc.
Step 10: here and only here do you invest your time (worth 100x what you're paying them) to look at what they did. Here you can give them feedback to address anything you saw.
Step 11: Review.
Step 12: Profit $$$
I got a quite decent implementation of Finite State Machine and Thunk + Trampoline transformation of code in custom language I'm building in about 1 day, barely checking in while commuting to and from work on the train...
Occassionally, at step 11, you will find a gigantic turd and wonder how the LLMs converged on this. But, typically, it's at least good enough at that stage.
I don't even waste my time looking at anything they've done until they've converged on a good design and implementation with no holes, no feedback, no notes that does what a minimal, summarized doc clearly states and follows the design principles. Because they DEFINITELY haven't in a one-shot.
And LLMs tend to converge on mediocrity. Which is totally fine.
PHP has come of age. Actually, it’s been a backbone technology for millions of professional sites and apps for many years, and people tend to work in the open. Sort of the nature of the language.
There’s a popular perception that PHP programmers are bad programmers, but that’s a dated point of view. Pros have been using it to make serious money, and create serious infrastructure, for many years.
But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.
My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!
This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)
It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.
It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.
There's quite a lot there: press "Tour" to see it all.
Will be open source next week.
Actually....
I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).
The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...
Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.
Case in point, JPMorgan London Whale incident, $6 billion loss caused by an excel error...
These are my current results for my models:
┌──────────────────────┬───────────┬─────────────┐
│ Model │ Size │ Tokens/sec │
├──────────────────────┼───────────┼─────────────┤
│ gemma-4-e4b-it-mlx │ ~4B (MLX) │ ~10.5 tok/s │
├──────────────────────┼───────────┼─────────────┤
│ qwen3-8b-uncensor-v2 │ 8B │ ~6.3 tok/s │
├──────────────────────┼───────────┼─────────────┤
│ qwen3-14b-uncensored │ 14B │ ~3.5 tok/s │
└──────────────────────┴───────────┴─────────────┘
I seem to be doing ok with the Gemma model for file parsing / handling.As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.
I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.
1) control 2) privacy 3) transparent cost model
Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.
Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models
I assumed turboquant optimizations are already everywhere - in llama-cpp, or the quantization machinery of unsloth and the likes.
In this case, picking out "semantic" css classes on single dom nodes.
Was able to run it on my 4(?) year old M2 mbp with 16GB of ram and it runs in only 100ms or so per query. Probably it can run much faster, but haven't experimented with batching etc
With tight and targeted context control, you can use extremely small models for useful things. Ideally with problems where the harness can be mostly deterministic and you have known bounds on what you're trying to do
Makes me feel we are nowhere near the optimum yet.
Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...
Agree but only for small projects. SOTA from a year ago still wins on larger projects
How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.
And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.
The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.
You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.
In the UK, it's currently an extra £800 to get a 128 GB vs the 64 GB equivalent. So that's more like 3 years of Claude - I think? - assuming current prices stay the same.
Or: you might just feel like £800 isn't an unjustifiable amount of money (one way or another), and tick the box, on the basis that it might just work out. As the saying goes, in for 459,900 pennies, in for £5,399...
I don't think that's true. Plenty of people can run basic workflows at 8GB on the MacBook Neo and most others are fine at 16 GB.
I just hate paying money for cloud subscriptions, and work has given me a decent laptop
It seems like cache layers like https://omlx.ai make more RAM better than more GPU cores or faster CPUs cores, but I'm curious if someone has tested both.
Also minor note: the M4/5 Pros come in multiples of 12, so it's a 24/36 or 48GB set up.
Simple test failed: sending "1","2","3" as separate messages using an openclaw harness.
I tested a few other "follow these instructions" tests. Qwen3.5/6 were able to follow along, gemma was not able to.
- reliance on US technologies is not so good, but on Chinese is not discussed, just chosen
- environmental cost is of concern
- so are the energy costs
In the end, there are some clear tips on how to configure the LLM, but overall the article is a bit thin and rather biased.
I use it occassionally for very easy tasks, fix typos or update meta data in blog posts. So yeah, it improves productivity. But coding-wise it's far away from Codex, Claude et al.
https://bsky.app/profile/mooresolutions.io/post/3mliilyf2i22...
Could somebody please provide some tokens-per-second numbers for example for Qwen 3.6 35B/A3B, specifically for Q4 and Q6 quants?
The local inference space is leaning to MoE models, and a lot of them have decent tokens / second, but horrible TTFT.
https://www.techpowerup.com/gpu-specs/tesla-m40-24-gb.c3838
and wanted to ask what version of nvidia driver and cuda...
If you're already doing big boy stuff with big boy models, then... just carry on trucking!
Only place I'd differ is for vision/OCR tasks. Small/medium open weights models are as good as SoTa, and token prices for prefill are kinda very not worth it for larger batch tasks.
Other thing that people forget is, if you want to have even a smallish LLM as a reliable personal service, you've got to carve out 16-24 of (V)RAM and leave it permanently running.
The main problem is finding the money :/
For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)
However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.
Oh and don’t take legal advice from random people on the internet by the way.
Imagine you're a contractor. You have a client who knows nothing about software development that wants you to write some software for them. They give you some code they generated with an LLM to get you started. Would you use the code or start over?
- private
- local
- no internet required
- works well enough for most tasks
- "free"
- will pop this "AI" bubble as word spreads
I got pretty good results with the model in the article on my machine. Sure, it took forever, but that doesn't matter to me as much, and it's kind of cool just watching it do its thing through LM studio. The result was also impressive enough for me that I would actually use it.Why pay $20/mo when local is good enough?
I understand that multiple things can be true at the same time. Is the concern for centralized AI monopolization? Or is the concern for the art of software engineering?
If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.
Was quite disappointed to see that the PC side hasn't kept up. The unified architecture on Macs makes it very hard to justify spending money on a Linux machine for inference workloads.
This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.