Can I run AI locally? (opens in new tab)

(canirun.ai)

1520 pointsricardbejarano2mo ago353 comments

353 comments

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

sdrinf2mo ago

Just want to echo the recommendation for qwen3.5:9b. This is a smol, thinking, agentic tool-using, text-image multimodal creature, with very good internal chains of thought. CoT can be sometimes excessive, but it leads to very stable decision-making process, even across very large contexts -something we haven't seen models of this size before.

What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.

Strong, strong recommend.

steve_adams_862mo ago

I've been building a harness for qwen3.5:9b lately (to better understand how to create agentic tools/have fun) and I'm not going to use it instead of Opus 4.6 for my day job but it's remarkably useful for small tasks. And more than snappy enough on my equipment. It's a fun model to experiment with. I was previously using an old model from Meta and the contrast in capability is pretty crazy.

I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.

tempoponet2mo ago

What kind of small tasks do you find it's good at? My non-coding use of agents has been related to server admin, and my local-llm use-case is for 24/7 tasks that would be cost-prohibitive. So my best guess for this would be monitoring logs, security cameras, and general home automation tasks.

1 more reply

threecheese2mo ago

You can really see the limitations of qwen3.5:9b in reasoning traces- it’s fascinating. When a question “goes bad”, sometimes the thinking tokens are WILD - it’s like watching the Poirot after a head injury.

Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.

scottmf2mo ago

As a person who also knows there's a connection between that phrase and Monty Python and not much more information beyond that, I'm not sure how to feel.

cassianoleal2mo ago

African or European?

1 more reply

8note2mo ago

could that be some of the RL trying to get it to not regurgitate?

the gag is giving in detail which one

1 more reply

kingo552mo ago

How's it compare in quality with larger models in the same series? E.g 122b?

buzzin_2mo ago

The chart on this link compares all qwen3.5 models down to 0.8B.

https://www.reddit.com/r/LocalLLaMA/comments/1ro7xve/qwen35_...

ggsp2mo ago

How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?

rnewme2mo ago

Less than expected, search for unsloths recent benchmark

RRRA2mo ago

I'd be curious to see people give their opinion on embedded models for less tech focused needs, say what's that bug killing spray chemistry like or what is the history of this or that...

I'd also be curious to see if people have started doing censorship analysis of various models, like Qwen differing Tiananmen square to government documments while Llama straights up answers the question.

boppo12mo ago

Is qwen 3.5 any good for chatting? I use chatgpt for 'light therapy' (basically sounding out confusing social situations my friends don't want to walk me through) and it's honestly been amazing. But I would rather not give all that to openai.

flutetornado2mo ago

My experience with qwen3.5 9b has not been the same. It’s definitely good at agentic responses but it hallucinates a lot. 30%-50% of the content it generated for a research task (local code repo exploration) turned out to be plain wrong to the extent of made up file names and function names. I ran its output through KimiK2 and asked it to verify its output - which found out that much of what it had figured out after agentic exploration was plain wrong. So use smaller models but be very cautious how much you depend on their output.

mongrelion2mo ago

At what temperature did you run it and what was your context limit?

mongrelion2mo ago

I don't understand why I'm getting downvoted.

I am legitimately curious about the parameters that the person used for running the model locally to get the results they got because I am myself currently experimenting with running models locally myself. You can see I am asking similar questions to others in this same thread and correlate the timestamps.

johnmaguire2mo ago

I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.

philipkglass2mo ago

It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.

mandeepj2mo ago

> I formerly used that cost hundreds or thousands of dollars for a license

Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?

Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.

1 more reply

tempaccount50502mo ago

Not OP but I had an XML file with inconsistent formatting for album releases. I wanted to extract YouTube links from it, but the formatting was different from album to album. Nothing you could regex or filter manually. I shoved it all into a DB, looked up the album, then gave the xml to a local LLM and said "give me the song/YouTube pairs from this DB entry". Worked like a charm.

Bluecobra2mo ago

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

lambda2mo ago

I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.

1 more reply

AzN1337c0d3r2mo ago

Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

3 more replies

saltwounds2mo ago

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not

echelon2mo ago

Shouldn't we prioritize large scale open weights and open source cloud infra?

An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.

girvo2mo ago

I run Qwen3.5-plus through Alibaba’s coding plan (Model Studio): incredibly cheap, pretty fast, and decent. I can’t compare it to the highest released weight one though.

2 more replies

adamkittelson2mo ago

Anecdotal but for some reason I had a pretty bad time with qwen3.5 locally for tool usage. I've been using GPT-OSS-120B successfully and switched to qwen so that I could process images as well (I'm using this for a discord chat bot).

Everything worked fine on GPT but Qwen as often as not preferred to pretend to call a tool and not actually call it. After much aggravation I wound up just setting my bot / llama swap to use gpt for chat and only load up qwen when someone posts an image and just process / respond to the image with qwen and pop back over to gpt when the next chat comes in.

1 more reply

eek21212mo ago

Qwen is actually really good at code as well. I used qwen3-coder-next a while back and it was every bit as good as claude code in the use cases I tested it in. Both made the same amount of mistakes, and both did a good job of the rest.

cyanydeez2mo ago

Cline (https://marketplace.visualstudio.com/items?itemName=saoudriz...) in vscode, inside a code-server run within docker (https://docs.linuxserver.io/images/docker-code-server/) using lmstudio (https://lmstudio.ai/) to access unsloth models (https://unsloth.ai/docs/get-started/unsloth-model-catalog) speficially (https://unsloth.ai/docs/models/qwen3-coder-next) appears to be right at the edge of productivity, as long as you realize what complexity means when issuing tasks.

dataflow2mo ago

Thanks for sharing this, it's super helpful. I have a question if you don't mind: I want a model that I can feed, say, my entire email mailbox to, so that I can ask it questions later. (Just the text content, which I can clean and preprocess offline for its use.) Have any offline models you've dealt with seemed suitable for that sort of use case, with that volume of content?

lilactown2mo ago

If your inbox is as big as mine, you won’t be able to load all the text content into a prompt even with SotA cloud hosted models.

Instead you should give it tools to search over the mailbox for terms, labels, addresses, etc. so that the model can do fine grained filters based on the query, read the relevant emails it finds, then answer the question.

dataflow2mo ago

Thanks, yeah. I think strong prefiltering is pretty much always doable because, if nothing else, I usually know the time range of the relevant emails and probably the sender/recipient or some keywords, plus I know how to filter out a big chunk of the irrelevant emails (like mailing lists, etc.), so I'm hoping it's not actually that much data for each search. What I don't know is which models would be most suitable even in the case where I can fit the data.

As an example of the kind of query I'm interested in, I want a model that can tell me all the flights I took within a given time range (so that means it'd have to filter out cancellations). Or, for a given flight, the arrival and departure times and time zones (or the city and country so I can look up the time zone). Stuff like that. (Travel is just an example obviously, I have other topics to ask about.) It's not a terribly large number of emails to search through in each query, but the email structures are too heterogeneous across senders to write custom tooling for each case.

perbu2mo ago

Prompt injection is a problem if your agent has access to anything.

The local models are quite weak here.

dataflow2mo ago

Security is not a concern for the purpose of my question here, please ignore that for now. I'm just looking for text summary and search functionality here, not looking to give it full system access and let it loose on my computer or network. I can easily set up VM/sandboxing/airgapping/etc. as needed.

My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.

1 more reply

dhblumenfeld12mo ago

Have you found that using a frontier model for planning and small local model for writing code to be a solid workflow? Been wanting to experiment with relying less on Claude Code/Codex and more on local models.

storus2mo ago

Coding locally with Qwen3-Coder-Next or Qwen-3.5 is a piece of cake on a workstation card (RTX Pro 6000); set it up in llama.cpp or vLLM in 1 hour, install Claude Code, force local API hostname and fake secret key, and just run it like regular setup with Claude4 but on Qwen.

kylehotchkiss2mo ago

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

Western02mo ago

I only put this https://unsloth.ai/docs/models/qwen3.5 (please look at old gemma 3n 4eb, and small IBM granite)

chrisweekly2mo ago

Thanks for this, Mark. And for your website and books and generosity of spirit. Signal in the noise. Have an awesome weekend!

manmal2mo ago

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

girvo2mo ago

IMO you’re better off using qwen3.5-plus through the model studio coding plan, but ymmv

sakesun2mo ago

Becoming a retired builder is the ultimate bliss.

nine_k2mo ago

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

fzzzy2mo ago

A Mac Pro with 512 gb unified ram does not exist.

nine_k2mo ago

Mac Studio Ultra, my bad. The 512 GB option existed up until March 2026: https://macdailynews.com/2026/03/06/apple-drops-512gb-m3-ult...

meatmanek2mo ago

This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)

lambda2mo ago

Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.

For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.

pbronez2mo ago

The docs page addresses this:

> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.

> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.

https://www.canirun.ai/docs

lambda2mo ago

It discusses it, and they have data showing that they know the number of active parameters on an MoE model, but they don't seem to use that in their calculation. It gives me answers far lower than my real-world usage on my setup; its calculation lines up fairly well for if I were trying to run a dense model of that size. Or, if I increase my memory bandwidth in the calculator by a factor of 10 or so which is the ratio between active and total parameters in the model, I get results that are much closer to real world usage.

littlestymaar2mo ago

While your remark is valid, there's two small inaccuracies here:

> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).

Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).

lambda2mo ago

So, this is all true, but this calculation isn't that nuanced. It's trying to get you into a ballpark range, and based on my usage on my real hardware (if I put in my specs, since it's not in their hardware list), the results are fairly close to my real experience if I compensate for the issue where it's calculating based on total params instead of active.

So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.

littlestymaar2mo ago

I agree, and I even started my response expressing my agreement with the whole point.

But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.

tommy_axle2mo ago

I'm guessing this is also calculating based on the full context size that the model supports but depending on your use case it will be misleading. Even on a small consumer card with Qwen 3 30B-A3B you probably don't need 128K context depending on what you're doing so a smaller context and some tensor overrides will help. llama.cpp's llama-fit-params is helpful in those cases.

mopierotti2mo ago

This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

0xbadcafebee2mo ago

Too generic question. Gotta be more specific:

   "what is the best open weight model for high-quality coding that fits in 8GB VRAM and 32GB system RAM with t/s >= 30 and context >= 32768" -> Qwen2.5-Coder-7B-Instruct

   "what is the best open weight model for research w/web search that fits in 24GB VRAM and 32GB system RAM with t/s >= 60 and context >= 400k" -> Qwen3-30B-A3B-Instruct-2507

   "what is the best open weight embedding model for RAG on a collection of 100,000 documents that fits in 40GB VRAM and 128GB system RAM with t/s >= 50 and context >= 200k" -> Qwen3-Embedding-8B

Specific models & sizes for specific use cases on specific hardware at specific speeds.

comboy2mo ago

What is the $/Mtok that would make you choose your time vs savings of running stuff locally?

Just to be clear, it may sound like a snarky comment but I'm really curious from you or others how do you see it. I mean there are some batches long running tasks where ignoring electricity it's kind of free but usually local generation is slower (and worse quality) and we all kind of want some stuff to get done.

Or is it not about the cost at all, just about not pushing your data into the clouds.

mopierotti2mo ago

Good question. I agree with what I think you're implying, which is that local generation is not the right choice if you want to maximize results per time/$ spent. In my experience, hosted models like Claude Opus 4.6 are just so effective that it's hard to justify using much else.

Nevertheless, I spend a lot of time with local models because of:

1. Pure engineering/academic curiosity. It's a blast to experiment with low-level settings/finetunes/lora's/etc. (I have a Cog Sci/ML/software eng background.)

2. I prefer not to share my data with 3rd party services, and it's also nice to not have to worry too much about accidentally pasting sensitive data into prompts (like personal health notes), or if I'm wasting $ with silly experiments, or if I'm accidentally poisoning some stateful cross-session 'memories' linked to an account.

3. It's nice to be able solve simple tasks without having to reason about any external 'side-effects' outside my machine.

wilkystyle2mo ago

For me it's a combination of privacy and wanting to be able to experiment as much as I want without limits. I'd happily take something that is 80% as good as SOTA but I can run it locally 24/7. I don't think there's anything out there yet that would 100% obviate my desire to at least occasionally fall back to e.g. Claude, but I think most of it could be done locally if I had infinite tokens to throw at it.

phillmv2mo ago

i can think of some tasks (classification, structured info extraction) that i _imagine_ even small meh models could do quite well at

on data i would never ever want to upload to any vendor if i can avoid it

J_Shelby_J2mo ago

It’s a hard problem. I’ve been working on it for the better part of a year.

Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.

But “quality” is the hard part. In this case I’m just choosing the largest quants.

mopierotti2mo ago

Supporting all the various devices does sound quite challenging.

I wouldn't expect a perfect single measurement of "quality" to exist, but it seems like it could be approximated enough to at least be directionally useful. (e.g. comparing subsequent releases of the same model family)

downrightmike2mo ago

LLMs are just special purpose calculators, as opposed to normal calculators which just do numbers and MUST be accurate. There aren't very good ways of knowing what you want because the people making the models can't read your mind and have different goals

twampss2mo ago

Is this just llmfit but a web version of it?

https://github.com/AlexsJones/llmfit

deanc2mo ago

Yes. But llmfit is far more useful as it detects your system resources.

Someone12342mo ago

I feel like they both solve different issues well:

- If you already HAVE a computer and are looking for models: LLMFit

- If you are looking to BUY a computer/hardware, and want to compare/contrast for local LLM usage: This

You cannot exactly run LLMFit on hardware you don't have.

shrinks992mo ago

Yes, but you can get LLMFit to recommend hardware requirements with `llmfit plan --context <TOKENS> <MODEL>`.

dgrin912mo ago

Honestly I was surprised about this. It accurately got my GPU and specs without asking for any permissions. I didnt realize I was exposing this info.

johnisgood2mo ago

Why were you surprised?

You can check out here how it does that: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

To detect NVIDIA GPUs, for example: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

In this case it just runs the command "nvidia-smi".

Note: llmfit is not web-based.

spudlyo2mo ago

I run LibreWolf, which is configured to ask me before a site can use WebGL, which is commonly used for fingerprinting. I got the popup on this site, so I assume that's how they're doing it.

1 more reply

dekhn2mo ago

How could it not? That information is always available to userspace.

1 more reply

rithdmc2mo ago

Do you mean the OPs website? Mine's way off.

> Estimates based on browser APIs. Actual specs may vary

rootusrootus2mo ago

That's super handy, thanks for sharing the link. Way more useful than the web site this post is about, to be honest.

It looks like I can run more local LLMs than I thought, I'll have to give some of those a try. I have decent memory (96GB) but my M2 Max MBP is a few years old now and I figured it would be getting inadequate for the latest models. But llmfit thinks it's a really good fit for the vast majority of them. Interesting!

1 more reply

LeifCarrotson2mo ago

This lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.

It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.

It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.

But I don't run out of tokens.

rahimnathwani2mo ago

This site presents models in an incomplete and misleading way.

When I visit the site with an Apple M1 Max with 32GB RAM, the first model that's listed is Llama 3.1 8B, which is listed as needing 4.1GB RAM.

But the weights for Llama 3.1 8B are over 16GB. You can see that here in the official HF repo: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main

The model this site calls 'Llama 3.1 8B' is actually a 4-bit quantized version ( Q4_K_M) available on ollama.com/library: https://ollama.com/library/llama3.1:8b

If you're going to recommend a model to someone based on their hardware, you have to recommend not only a specific model, but a specific version of that model (either the original, or some specific quantized version).

This matters because different quantized versions of the model will have different RAM requirements and different performance characteristics.

Another thing I don't like is that the model names are sometimes misleading. For example, there's a model with the name 'DeepSeek R1 1.5B'. There's only one architecture for DeepSeek R1, and it has 671B parameters. The model they call 'DeepSeek R1 1.5B' does not use that architecture. It's a qwen2 1.5B model that's been finetuned on DeepSeek R1's outputs. (And it's a Q4_K_M quantized version.)

zargon2mo ago

They appear to be using Ollama as a data source. Ollama does that sort of thing regularly.

sxates2mo ago

Cool thing!

A couple suggestions:

1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!

utopcell2mo ago

Unfortunately, Apple retired the 512GiB models.

ProllyInfamous2mo ago

Sure, but those already sold still exist.

ActorNightly2mo ago

>. I have an M3 Ultra with 256GB of memory,

Im sorry but spending this kind of money when you could have just built yourself a dual 3090 workstation that would have been better for pretty much everything including local models is just plain stupid.

Hell, even one 3090 can now run Gemma 3 27b qat very fast.

brulard2mo ago

Are you aware that your 3090s have nowhere close to 256GB of VRAM? Or maybe you are not aware that on macs you have unified memory (working both as RAM and VRAM).

ActorNightly2mo ago

Are you aware that having ram doesn't matter when your tokens/second is slow as shit?

You don't need to run large models, Gemma QAT 27B fits on one GPU and is quite good. Other models like Qwen3 are great for coding.

3090 gets 100+ tokens/second for QWEN, very close to what you would see with a cloud based model.

M3 ultra gets ~30.

Congrats, you played yourself.

1 more reply

nozzlegear2mo ago

> a dual 3090 workstation that would have been better for pretty much everything

Doesn't run macOS

xiconfjs2mo ago

Except if you are living in a region where electricity is quite expensive :/

gentleman112mo ago

ask apple to graciously allow you to install your own ram in the computer you "own"

torginus2mo ago

Huh, I never knew my browser just volunteers my exact hardware specs to any website without so much as even notifying me about it.

Jaxan2mo ago

It doesn’t really. The website thinks I’m on a iPhone 19 pro, although I’m actually on a iPhone SE 1st gen. So it’s off by roughly a decade.

weikju2mo ago

> on a iPhone 19 pro

I wish the website could tell us how life is like in 2027!

torginus2mo ago

Maybe that's one of Safari's numerous 'quirks' our frontend devs keep bitching about.

Which in this case Im thankful that Apple isn't too keen on following standards like these.

tstrimple2mo ago

Mine is radically off as well. Says I've got a GeForce 980 or equivalent with 4GB instead of a 5090. I'm guessing the detection only really works on Chromium based browsers.

DanielHB2mo ago

This stuff is used a lot in browser fingerprinting for tracking purposes. More privacy-focused browsers usually feed randomized info.

hotsalad2mo ago

The latest Librewolf prompted me to allow the site permission to make a WebGL context. That's what it used for hardware detection.

nozzlegear2mo ago

It couldn't guess any of my hardware specs when I opened the page in Safari on my Mac.

ebbi2mo ago

I thought that's how airlines do the whole trickery around having different pricing if you access the site from Windows or Mac...

dxxvi2mo ago

Not sure if there's anybody like me. I use AI for only 2 purposes: to replace Google Search to learn something and to generate images. I wonder where there are not many models that do only 1 thing and do it well. For example, there's this one https://huggingface.co/Fortytwo-Network/Strand-Rust-Coder-14... for Rust coding. I haven't used it yet, so don't know how it's compared to the free models that Kilo Code provides.

dyauspitr2mo ago

For learning and general searching I find ChatGPT to be the best.

Nano Banana Pro for anything image and video related.

Grok Imagine for pretty decent porn generation.

never_inline2mo ago

Didn't want to hear the grok thing from handle named "dyauspitr". My day is ruined.

dyauspitr2mo ago

I would imagine the Dyaus was pretty randy if Zeus is anything to go by.

1 more reply

mmaunder2mo ago

OP can you please make it not as dark and slightly larger. Super useful otherwise. Qwen 3.5 9B is going to get a lot of love out of this.

ProllyInfamous2mo ago

I'm not usually one to whine, but agreed; additionally, add contrast to the modifiers (e.g. processor select). First thing I did when I visited was scale the website to 150%

Super impressive comparisons, and correlates with my perception having three seperate generations of GPU (from your list pulldown). Thanks for including the "old AMD" Polaris chipsets, which are actually still much faster than lower-spec Apple silicon. I have Ollama3.1 on a VEGA64 and it really is twice as fast as an M2Pro...

----

For anybody that thinks installing a local LLM is complicated: it's not (so long as you have more than one computer, don't tinker on your primary workhorse). I am a blue collar electrician (admittedly: geeky); no more difficult than installing linux. I used an online LLM to help me install both =D

aanet2mo ago

The website is super useful. That theme though... low-contrast text on too-dark theme is, uh, barely readable for me.

duskdozer2mo ago

Have to disagree in part at least. Text is pretty small which isn't good, but I'm glad to see it when sites don't succumb to the make-dark-mode-lighter trend.

nozzlegear2mo ago

I can't see shit on this website lol. It'd be nice if they had a switch to toggle a light mode.

ricardbejaranoOP2mo ago

OP here, it's not mine though!

andy_ppp2mo ago

Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max) the data looks identical. Also the memory does not seem to improve performance on larger models when I thought it would have?

Love the idea though!

EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.

GeekyBear2mo ago

> Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max)

Preliminary testing did not come to that conclusion.

> Apple’s New M5 Max Changes the Local AI Story

https://www.youtube.com/watch?v=XGe7ldwFLSE

lostmsu2mo ago

From the video: 4.4k is "almost" 4x times 1.8k because 4.4k has "number 4" in the beginning, and the other one - number 1.

For the lazy: that's less then 3x: 1.8 * 3 = 5.4

andy_ppp2mo ago

It’s not even the largest part, just prefill so I think maybe M5 Max is 30% faster overall. Still pretty good I think but the 4x nonsense is just marketing!

carra2mo ago

Having the rating of how well the model will run for you is cool. I miss to also have some rating of the model capabilities (even if this is tricky). There are way too many to choose. And just looking at the parameter number or the used memory is not always a good indication of actual performance.

phelm2mo ago

This is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is

metalliqaz2mo ago

Hugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware. Also there are different quants available that make a big difference.

cafed00d2mo ago

Open with multiple browsers (safari vs chrome) to get more "accurate + glanceable" rankings.

Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.

Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)

azmenak2mo ago

From my personal testing, running various agentic tasks with a bunch of tool calls on an M4 Max 128GB, I've found that running quantized versions of larger models to produce the best results which this site completely ignores.

Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)

DrAwdeOccarim2mo ago

Totally doing this today! Have you tried OpenJarvis or NemoClaw (is it out yet?). I want to use my computer “through” the LLM.

amdivia2mo ago

I found this to be inaccurate, I can run OSS GPT 120B (4 bit quant) on my 5090 and 64 ram system with around 40 t/s. Yet here the site claims it won't work

freediddy2mo ago

i think the perplexity is more important than tokens per second. tokens per second is relatively useless in my opinion. there is nothing worse than getting bad results returned to you very quickly and confidently.

ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.

sroussey2mo ago

No, getting bad results slowly is much worse. Bad results quickly and you can make adjustments.

But yes, if there is a choice I want quality over speed. At same quality, I definitely want speed.

orthoxerox2mo ago

For some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.

bityard2mo ago

Yeah, this site is iffy at best. I didn't even see Strix Halo on the list, but I selected 128GB and bumped up the memory bandwidth. It says gpt-oss-120b "barely runs" at ~2 t/s.

In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.

John238322mo ago

RTX Pro 6000 is a glaring omission.

embedding-shape2mo ago

Yeah, that's weird, seems it has later models, and earlier, but specifically not Pro 6000? Also, based on my experience, the given numbers seems to be at least one magnitude off, which seems like a lot, when I use the approx values for a Pro 6000 (96GB VRAM + 1792 GB/s)

schaefer2mo ago

No Nvidia Spark workstation is another omission.

am17an2mo ago

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090

GrayShade2mo ago

This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.

Aurornis2mo ago

At what quantization and with what size context window?

GrayShade2mo ago

Looks like it's a bit slower today. Running llama.cpp b8192 Vulkan.

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"

[snip 73 lines]

[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"

[snip 128 lines]

[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]

I suspect the ROCm build will be faster, but it doesn't work out of the box for me.

zahirbmirza2mo ago

This was depressing. But, also, I can't figure why AI companies are valued so high. The models will reach a limit (ie for what most people want to use a model for), and compute will increase over time.

zahirbmirza2mo ago

Also, I have to add, this project is an excellent piece of work.

StefanoC2mo ago

Can anybody share their setup using 64GB macs? I have an M2 Ultra studio and I'm trying Qwen 3.5 MLX models hosting them from the CLI, but I'm a bit stuck picking bigger models, more context, 4/8 bits, Opus-Reasoning-Distilled, coder... There are a bit too many permutations between mlx CLI flags, env variables, and models.

At the moment I'm exploring:

- nightmedia/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-qx64-hi-mlx

- BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit

- mlx-community/Qwen3-Coder-Next-4bit

johneth2mo ago

Re: the design of the site. Please use higher contrast colours, especially the barely visible grey text on black background. It's annoying to try to read.

0xbadcafebee2mo ago

Couple thoughts:

- The t/s estimation per machine is off. Some of these models run generation at twice the speed listed (I just checked on a couple macs & an AMD laptop). I guess there's no way around that, but some sort of sliding scale might be better.

- Ollama vs Llama.cpp vs others produce different results. I can run gpt-oss 20b with Ollama on a 16GB Mac, but it fails with "out of memory" with the latest llama.cpp (regardless of param tuning, using their mxfp4). Otoh, when llama.cpp does work, you can usually tweak it to be faster, if you learn the secret arts (like offloading only specific MoE tensors). So the t/s rating is even more subjective than just the hardware.

- It's great that they list speed and size per-quant, but that needs to be a filter for the main list. It might be "16 t/s" at Q4, but if it's a small model you need higher quant (Q5/6/8) to not lose quality, so the advertised t/s should be one of those

- Why is there an initial section which is all "performs poorly", and then "all models" below it shows a ton of models that perform well?

gopalv2mo ago

Chrome runs Gemini Nano if you flip a few feature flags on [1].

The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.

Including structured output, but has a tiny context window I could use.

[1] - https://notmysock.org/code/voice-gemini-prompt.html

adamhsn2mo ago

Cool project!!

It would be useful to filter which model to use based on the objective or usage (i.e., for data extraction vs. coding).

Also, just looking at VRAM kind of misses that a lot of CPU memory can be shared with the GPU via layer offloading. I think there is ultimately a need for a native client, like a CPU/GPU benchmark, to figure out how the model will actually perform more precisely.

sdingi2mo ago

When running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?

I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.

RagnarD2mo ago

I have an RTX 6000 Pro Max-Q, which has 96GB VRAM. It identified the hardware correctly but incorrectly thought it had 4GB, at least if I interpret the RAM dropdown correctly.

Then it shows the full resolution models, which are completely unnecessary to run quality inference. Quantized models are routine for local inference and it should realize that.

Needs work.

kpw942mo ago

People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...

There's so many knobs to tweak, it's a non trivial problem

- Average/median length of your Prompts

- prompt eval speed (tok/s)

- token generation speed (tok/s)

- Image/media encoding speed for vision tasks

- Total amount of RAM

- Max bandwidth of ram (ddr4, ddr5, etc.?)

- Total amount of VRAM

- "-ngl" (amount of layers offloaded to GPU)

- Context size needed (you may need sub 16k for OCR tasks for instance)

- Size of billion parameters

- Size of active billion parameters for MoE

- Acceptable level of Perplexity for your use case(s)

- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)

- even finer grain knobs: temperature, penalties etc.

Also, Tok/s as a metric isn't enough then because there's:

- thinking vs non-thinking: which mode do you need?

- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)

At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?

The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions

rando12342mo ago

On a related question, I'm in the market to buy a new laptop for development and want to get something with good support for local models. What is a good recommendation in terms of GPU support etc? I currently have a Dell XPS 13. Should I just get a MacBook? Or are there good non-Mac options?

varispeed2mo ago

Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.

boutell2mo ago

I'm not sure how long ago you tried it, but look at Qwen 3.5 32b on a fast machine. Usually best to shut off thinking if you're not doing tool use.

mongrelion2mo ago

Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?

casey22mo ago

Something notable is that Qwen3.5:0.8B does better on benchmarks than GPT3.5. Runs much faster on local hardware than GPT3.5 on release. However Qwen3.5:0.8B dumber and slower than GPT3.5. It's dumber: it can do 3*3, but if asked to explain it in terms of the definition (i.e. 3+3+3=9) it fails. It's slower: It's a thinking model so your 900T/S are mainly spent "thinking" most of the time it will just repeat until it hangs.

It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.

Western02mo ago

I tried using agents on a small Orange Pi and other small machines, and I’ve come to the conclusion that, unfortunately, it’s not feasible to do this in a practical way. Of course, I started writing an agent that could run in such an environment (long timeouts, retries, etc.), but it’s a real pain. For this to make sense, you need much more powerful hardware (a 5-year-old Mac mini is fine); the other issue is power consumption. Unfortunately, until you can run Mavericks 2 on your own hardware, it’s pretty expensive.

charcircuit2mo ago

On mobile it does not show the name of the model in favor of the other stats.

rcarmo2mo ago

This is kind of bogus since some of the S and A tier models are pretty useless for reasoning or tool calls and can’t run with any sizable system prompt… it seems to be solely based on tokens per second?

AstroBen2mo ago

This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-

misnome2mo ago

It seems to be missing a whole load of the quantized Qwen models, Qwen3.5:122b works fine in the 96GB GH200 (a machine that is also missing here....)

mongrelion2mo ago

Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!

paxys2mo ago

I wish creators of local model inference tools (LM Studio, Ollama etc.) would release these numbers publicly, because you can be sure they are sitting on a large dataset of real-world performance.

pants22mo ago

This really highlights the impracticality of local models:

My $3k Macbook can run `GPT-OSS 20B` at ~16 tok/s according to this guide.

Or I can run `GPT-OSS 120B` (a 6X larger model) at 360 tok/s (30X faster) on Groq at $0.60/Mtok output tokens.

To generate $3k worth of output tokens on my local Mac at that pricing it would have to run 10 years continuously without stopping.

There's virtually no economic break-even to running local models, and no advantage in intelligence or speed. The only thing you really get is privacy and offline access.

danny_codes2mo ago

A million tokens is like 5 minutes of inference for heavy coding use.

girvo2mo ago

At work I regularly hit my 7.5mil tokens per hour limit one of our tools has, and have to switch model of tool, and I’m not even really a remotely heavy user. I think people don’t realise how many tokens get burned with CoT and tool calls these days

At 7.5mil per hour hard limit, 84 days to hit the grandparents $3k

That said local models really are slow still, or fast enough and not that great

reverius422mo ago

They already stated they can only generate 57,600 tokens per hour locally (expressed as 16 tokens per second). So that's the limiting factor here.

xandrius2mo ago

You're saying it as if privacy was worthless? Also not many people would consider the price of buying a macbook and put it strictly towards running a local model.

Instead if you wanted to get a macbook anyway, you get to run local models for free on top. Very different story.

pants22mo ago

The privacy angle is not that interesting to me.

- You can find inference providers with whatever privacy terms you're looking for

- If you're using LLMs with real data (let's say handling GMail) then Google has your data anyway so might as well use Gemini API

- Even if you're a hardcore roll-your-own-mail-server type, you probably still use a hosted search engine and have gotten comfortable with their privacy terms

Also on cost the point is you can use an API that's many times smarter and faster for a rounding error in cost compared to your Mac. So why bother with local except for the cool factor?

throwdbaaway2mo ago

90% of what you pay in agentic coding is for cached reads, which are free with local inference serving one user. This is well known in r/LocalLLaMA for ages, and an article about this also hit HN front page few weeks ago.

dexterlagan2mo ago

Same as top comment, have spent a lot of time on local models. IMHO, qwen3.5 is the very first model that is actually usable for serious work, ever - and I've tried them all. The 35B 3B is very smart. It understands things no other local model I've ever used does, it's that good. The 9B runs on my slow Mac, and it's also very 'smart'. I can say with confidence that 2026 is the year of the local model, at last.

mkagenius2mo ago

Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499

mongrelion2mo ago

What front-end framework did you use? I find the UI so visually appealing

hatthew2mo ago

FWIW, while I find it appealing, I also strongly associate it with "vibe coded webapp of dubious quality," so personally I'm not gonna try to replicate it myself.

mkagenius2mo ago

Thanks. I actually used Google AI Studio for this. Prompted with my color choices and let it do the rest, turned out pretty good.

coinexpert2mo ago

Most of the friction around local AI comes from juggling different runtimes for different providers. We built Milady specifically to solve that — one unified runtime that works with Ollama, OpenAI, Anthropic, and others. Switch providers without rewriting a line of code. Fully offline capable, zero telemetry. Happy to answer questions if anyone's curious: milady.ai

scorpioxy2mo ago

Besides trying to run on your own hardware, anybody have recommendations for running some decent models on one of the many "AI clouds" providers? This is for sporadic use and so maybe one of the "serverless" providers that bill by the hour or minute or similar as opposed to monthly renting GPUs.

There are quite a few of them but their marketing is just confusing and full of buzz words. I've been tinkering with OpenRouter that acts as a middleman.

metrix2mo ago

use openrouter, and call it a day. auto switching between providers, connectivity to all clouds and even works with free models

scorpioxy2mo ago

Yeah, that's what I've been doing. But in terms of privacy policies, I have to review(and trust) 2 providers instead of 1. OpenRouter and whatever provider is used for any particular model. I agree with you that it is more convenient though.

ActorNightly2mo ago

I mean AWS bedrock fits your use case pretty much. They have a bunch of models that are serverless that you can use on a per token pricing cost.

Gemini api use also comes with a free tier.

scorpioxy2mo ago

Thanks, I'll check out Bedrock. I was under the impression they only provide "enterprise" access as OpenRouter uses them as one of the providers but I didn't actually check. Looking at their docs now, seems I was wrong.

suheilaaita2mo ago

The simplest way to really start, use anything like claude code, vs code, cursor, antigratvity, (or any other IDE) ask them to install ollama and pull the latest solid local model that was released that you can run based on your computer specs.

Wait 5-10 minutes, and should be done.

It genuinely is that simple.

You can even use local models using claude code or codex infrastrucutre (MASSIVE UNLOCK), but you need solid GPU(s) to run decent models. So that's the downside.

cloogshicer2mo ago

Genuine question, will this actually give you the latest solid local model?

I would've thought no, because of the knowledge cutoff in whatever model you use to download it.

suheilaaita2mo ago

I think it will give you a good "starter model". But then, it ultimately depends on what you want to do with the model exactly and your computer's specs.

For example, I needed a local model to review some transactions and output structured output in .json format. Not all local models are necesserily good at structured outputs, so I asked grok (becuase it has solid web search and is up to date), what are the best recommended models given this use case and my laptop's specs. It suggested a few models, I chose one and went for it and now it's working.

To summarise, - find model given use case and specs. - trial and error - test other models (if needed) - rinse repeat - because models are always coming out and getting better

jiggunjer2mo ago

What knowledge cutoff? They all have web agents to Google it.

suheilaaita2mo ago

They all do, true. But some are better than the others in how they retrieve, digest and present you with the information. Boils down to personal preferences and experimenting.

amelius2mo ago

It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.

rootusrootus2mo ago

Someone linked to llmfit. That would be a great tool to integrate with ollama. Just highlight the one you want and tell it to install.

Quick, someone go vibe code that.

dugidugout2mo ago

The latest level of abstraction! You just release your ideas half baked in some internet connected box and wake up with products! Yahoo! Onwards into the Gestell!

rootusrootus2mo ago

Okay, now I’m tempted to set up a bluesky account that takes requests and spits out working software.

I’m certain this has already been done. It’s too obvious, and too hilarious.

1 more reply

modernerd2mo ago

Would love it more if it could help me to answer:

- Which models in the list are the best for my selected task? (If you don't track these things regularly, the list is a little overwhelming.) Sorting by various benchmark scores might be useful?

- How much more system resources do I need to run the models currently listed at F, D or C at B, A, or S-tier levels? (Perhaps if you hover over the score, it could tell you?)

sshagent2mo ago

I don't see my beloved 5060ti. looks great though

xylon2mo ago

nor my plain 5060

dzink2mo ago

This would be wonderful if it is accurate - instead of guesstimating, let people report their actual findings. I can confirm GLM 4.7 is possible on M1 Max and it can do nice comprehensive answers (albeit at 12 min an answer) locally. You can also easily do Mistral7B and OSS 20B and others. Structure it as a way to report accruals, similarly to Levels.xyz for salaries, instead of guestimating.

starkeeper2mo ago

This is awesome!!!

Could you please add title="explanation" over each selected item at the top. For example, when I choose my video card the ram changes... I'm not sure if the RAM selection is GPU RAM? The GRAM was already listed with the graphics card. SO I choose 96GB which is my Main memory? And the GB/s I am assuming it's GPU -> CPU speed?

TheCapn2mo ago

@OP are you the creator? Could you add my GPU to the list?

Radeon VII

https://www.amd.com/en/support/downloads/drivers.html/graphi...

sidchilling2mo ago

I have been trying to run Qwen Coder models (8B at 4bit) on my M3 Pro 18GB behind Ollama and connecting codex CLI to it. The tool usage seems practically zero, like it returns the tool call in text JSON and codex CLI doesn’t run the tool (just displays the tool call in text). Has anyone succeeded in doing something like this? What am I missing?

mongrelion2mo ago

It might be that the system prompt sent by codex is not optimal for that model. Try with open code and see if your results improve

MikeNotThePope2mo ago

I have the same hardware. Been curious about trying it with Opencode.

rurban2mo ago

No, not yet. Up to a single H100 there is no single local model, which doesn't make your code worse. (excluding trivial stuff like ruby, python, typescript). Implement features, fix bugs.

Right now we started experimenting with 2 H100's, 160GB models. But even a single one is wide out of anyone others league.

dirk940182mo ago

We wrote the linuxtoaster inference engine, toasted, and are getting 400 prefill, 100 gen on a M4 Max w 128GB RAM on Qwen3-next-coder 6bit, 8bit runs too. KV caching means it feels snappy in chat mode. Local can work. For pro work, programming, I'd still prefer SOTA models, or GLM 4.7 via Cerebras.

vova_hn22mo ago

It says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?

arjie2mo ago

Cool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.

SXX2mo ago

Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

teaearlgraycold2mo ago

Wait for the M5 Ultra. It will get the 4x prompt processing speeds from the rest of the M5 product line. I hear rumors it will be released this year.

Stronz2mo ago

One thing I noticed with local models is that conversation behavior tends to drift over time as well.

Even when running locally, the model often starts structured but gradually becomes more verbose or explanatory in longer threads.

Curious if others have seen similar behavior when using local setups.

Decabytes2mo ago

Does anyone use the super tiny models for anything ? Like in the 2billion or lower parameter level?

genpfault2mo ago

Speculative decoding[1]?

[1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/specu...

tristor2mo ago

This does not seem accurate based on my recently received M5 Max 128GB MBP. I think there's some estimates/guesswork involved, and it's also discounting that you can move the memory divider on Unified Memory devices like Apple Silicon and AMD AI Max 395+.

mrdependable2mo ago

This is great, I've been trying to figure this stuff out recently.

One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.

cortesoft2mo ago

Ollama runs a web server that you use to interact with the models: https://docs.ollama.com/quickstart

You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/

rebolek2mo ago

ssh?

kylehotchkiss2mo ago

My Mac mini rocks qwen2.5 14b at a lightning fast 11/tokens a second. Which is actually good enough for the long term data processing I make it spend all day doing. It doesn’t lock up the machine or prevent its primary purpose as webserver from being fulfilled.

ge962mo ago

Raspberry pi? Say 4B with 4GB of ram.

I also want to run vision like Yocto and basic LLM with TTS/STT

boutell2mo ago

I've been trying to get speech to text to work with a reasonable vocabulary on pis for a while. It's tough. All the modern models just need more GPU than is available

meatmanek2mo ago

For ASR/STT on a budget, you want https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 - it works great on CPU.

I haven't tried on a raspberry pi, but on Intel it uses a little less than 1s of CPU time per second of audio. Using https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/a... for chunked streaming inference, it takes 6 cores to process audio ~5x faster than realtime. I expect with all cores on a Pi 4 or 5, you'd probably be able to at least keep up with realtime.

(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)

EDIT: there are also the multitalker-parakeet-streaming-0.6b-v1 and nemotron-speech-streaming-en-0.6b models, which have similar resource requirements but are built for true streaming inference instead of chunked inference. In my tests, these are slightly less accurate. In particular, they seem to completely omit any sentence at the beginning or end of a stream that was partially cut off.

ge962mo ago

Whispr?

For wakewords I have used pico rhino voice

I want to use these I2S breakout mics

winterismute2mo ago

Oddly, the website lists "M4 Ultra" which however does not exist... Also, it does not account for Apple Silicon chips to have up to 512GB of memory in some cases, but that might be only a limitation of the gathered data.

intrasight2mo ago

Your LLM visited the future

hotsalad2mo ago

This says I can't run anything, because it's missing some of the smallest models. I know that I can run Qwen3.5 up to 4B, Ministral 3B, Qwen3VL up to 4B, and I know there are some Gemmas and Llamas in my size range.

manlymuppet2mo ago

Would be useful if comparable scores for performance are added, perhaps from arena.ai or ARC. I know scores can be imperfect, but it would be nice to be able to easily see what the best model your machine can handle is.

eichin2mo ago

I'm surprised that this shows anything running usefully on my 2021-era thinkpad (with "Iris Xe"'TigerLake graphics) which inspires me to ask - are external GPUs useful for this sort of thing?

comrade12342mo ago

I can't tell at a glance what this page is showing, but I am curious about the licenses on the various models that let me run it locally and make money off it. Awhile ago only deepseek let you do that - not sure now.

mind_heist2mo ago

nice, this is an interesting idea. Can you elaborate on the licensing issue ? how do you get blocked for using the models commercially ?

comrade12342mo ago

Just read the license agreement. Last time I looked into this the only model I could run locally and do what I want was deepseek. I think it was the MIT license. The others had various restrictions that just didn't make it worth it.

I stopped researching this because buying the hardware to run deepseek full model just isn't practical right now. Our customers will have to be happy with us sending data to OpenAI/deepseek/etc if they want to use those features.

singpolyma32mo ago

qwen3.5 is just apache

dale_glass2mo ago

It's missing Ryzen AI MAX+, which is sort of the Apple Silicon equivalent.

ThrowawayTestr2mo ago

For image generation or even video generation, local models are totally feasible. I can generate a 5 second clip with wan 2.2 in about 30 minutes on my 3060 12G. Plus, I have full control on the loras used.

brcmthrowaway2mo ago

If anyone hasn't tried Qwen3.5 on Apple Silicon, I highly suggest you to! Claude level performance on local hardware. If the Qwen team didn't get fired, I would be bullish on Local LLM.

zitterbewegung2mo ago

The M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.

kuon2mo ago

I have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.

The tool is very nice though.

stared2mo ago

I would love to see on this list some (any) benchmark.

"I can run a model" is mildly interesting. I can run OSS-20B on my M1 Pro. It works, I tried it, just I don't find any application.

raiph_ai2mo ago

Great site, I have an M2 and M3pro and was thinking about getting and Ultra M4 and wanted to know if it was going to be worth it. Now I can see exactly what models I can run locally.

adithyassekhar2mo ago

This just reminded me of this https://www.systemrequirementslab.com/cyri.

Not sure if it still works.

butILoveLife2mo ago

This is borderline irresponsible. Conflating first tokens with all tokens is terrible. Apple looks far better than it actually is.

Just ask any Apple user, they don't actually use local models.

opengrass2mo ago

It would be more broad if you entered a Passmark score. Us CPU/integrated suckers can only see if we beat a RPI4, and same goes for GPUs.

havaloc2mo ago

Missing the A18 Neo! :)

debatem12mo ago

For me the "can run" filter says "S/A/B" but lists S, A, B, and C and the "tight fit" filter says "C/D" but lists F.

Just FYI.

starkparker2mo ago

Every time I refresh the page, I get a higher tokens/second value, presumably because of the keying off memory bandwidth.

golem142mo ago

Has anyone actually built anything with this tool?

The website says that code export is not working yet.

That’s a very strange way to advertise yourself.

urba_2mo ago

Man, I wonder when there will be AI server farms made from iCloud locked jailbroken iPhone 16s with backported MacOS

anigbrowl2mo ago

Useful tool, although some of the dark grey text is dark that I had to squint to make it out against the background.

lrpe2mo ago

This site desperately needs a light mode.

fraywing2mo ago

This is amazing. Still waiting for the "Medusa" class AMD chips to build my own AI machine.

storus2mo ago

Missing latest Nvidia cards like RTX Pro 6000; M3 Ultra can have at most 192GB selected etc.

bearjaws2mo ago

So many people have vibe coded these websites, they are posted to Reddit near daily.

3Sophons2mo ago

a lighter-weight alternative of docker and python is the Rust+Wasm stack https://github.com/LlamaEdge/LlamaEdge

GTP2mo ago

If I'm sorting models by score (the default), which kind of score is it?

amelius2mo ago

What is this S/A/B/C/etc. ranking? Is anyone else using it?

relaxing2mo ago

Apparently S being a level above A comes from Japanese grading. I’ve been confused by that, too.

swiftcoder2mo ago

It's very common in Japanese-developed video games as well

bitexploder2mo ago

Common in gaming culture. Kind of a meme template. S tier is the best tier of something. People make tier lists of all sorts of things with that grading.

vikramkr2mo ago

Just a tier list I think

jrmg2mo ago

Is there a reliable guide somewhere to setting up local AI for coding (please don’t say ‘just Google it’ - that just results in a morass of AI slop/SEO pages with out of date, non-self-consistent, incorrect or impossible instructions).

I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.

I suspect this is actually fairly easy to set up - if you know how.

AstroBen2mo ago

Ollama or LM Studio are very simple to setup.

You're probably not going to get anything working well as an agent on an M2 MacBook, but smaller models do surprisingly well for focused autocomplete. Maybe the Qwen3.5 9B model would run decently on your system?

jrmg2mo ago

Right - setting up LM studio is not hard. But how do I connect LM Studio to Copilot, or set up an agent?

NortySpock2mo ago

I tried the Zed editor and it picked up Ollama with almost no fiddling, so that has allowed me to run Qwen3.5:9B just by tweaking the ollama settings (which had a few dumb defaults, I thought, like assuming I wanted to run 3 LLMs in parallel, initially disabling Flash Attention, and having a very short context window...).

Having a second pair of "eyes" to read a log error and dig into relevant code is super handy for getting ideas flowing.

brcmthrowaway2mo ago

Basically LM Studio has a server that serves models over HTTP (localhost). Configure/enable the server and connect OpenCode to it.

Try this article https://advanced-stack.com/fields-notes/qwen35-opencode-lm-s...

I'm looking for an alternative to OpenCode though, I can barely see the UI.

1 more reply

AstroBen2mo ago

It looks like Copilot has direct support for Ollama if you're willing to set that up: https://docs.ollama.com/integrations/vscode

For LM Studio under server settings you can start a local server that has an OpenAI-compatible API. You'd need to point Copilot to that. I don't use Copilot so not sure of the exact steps there

kristianp2mo ago

https://github.com/ggml-org/llama.cpp/releases - has mac binaries

https://unsloth.ai/docs/models/qwen3.5 - running locally guide for the Qwen 3.5 family of models, which have a range of different sizes.

randusername2mo ago

Personally I'd start with llamafile [0] then move to compiling your own llama.cpp.

It's not as bad as you might think to compile llama.cpp for your target architecture and spin up an OpenAI compatible API endpoint. It even downloads the models for you.

[0]: https://github.com/mozilla-ai/llamafile

thexa42mo ago

I've created a llama.cpp integration with Copilot in vscode. The extension readme contains setup instructions: https://marketplace.visualstudio.com/items?itemName=delft-so...

chatmasta2mo ago

Any time I google something on this topic, the results are useful but also out of date, because this space is moving so absurdly fast.

ementally2mo ago

In mobile section it is missing Tensor chips (used by Google Pixel devices).

vednig2mo ago

Our work at DoShare is a lot of this stuff we've been on it for 2 years

markdown2mo ago

Protip: Website requires CMD+ a few times to increase font size to 200%.

spidrahedron2mo ago

This is super useful for people not having access to GPUs and Servers

amelius2mo ago

Why isn't there some kind of benchmark score in the list?

ipunchghosts2mo ago

What is S? Also, NVIDIA RTX 4500 Ada is missing.

Western02mo ago

I like a old gemma 3n 4EB for small device

tcbrah2mo ago

tbh i stopped caring about "can i run X locally" a while ago. for anything where quality matters (scripting, code, complex reasoning) the local models are just not there yet compared to API. where local shines is specific narrow tasks - TTS, embeddings, whisper for STT, stuff like that. trying to run a 70b model at 3 tok/s on your gaming GPU when you could just hit an API for like $0.002/req feels like a weird flex IMO

hatthew2mo ago

For me and probably many other people, local has nothing to do with cost and everything to do with privacy

tcbrah2mo ago

genuine question - what are you working on that needs that level of privacy? outside of NSFW stuff most API providers arent doing anything with your prompts

hatthew2mo ago

I would answer that, but it's private :)

I can think of several reasons: corporate policy, personal principles, NSFW stuff, illegal stuff

tencentshill2mo ago

Missing laptop versions of all these chips.

ryandrake2mo ago

Missing RTX A4000 20GB from the GPU list.

sand5002mo ago

How does it have details for M4 ultra?

nicklo2mo ago

the animation of the model name text when opening the detail view is so smooth and delightful

g_br_l2mo ago

could you add raspi to the list to see which ridiculously small models it can run?

Bydgoszczo2mo ago

ok, what buy a hardware for local ai for agent (coding, and other) in 2026

reactordev2mo ago

This shows no models work with my hardware but that’s furthest from the truth as I’m running Qwen3.5…

This isn’t nearly complete.

kennywinker2mo ago

Well… don’t keep us guessing -what hardware? And which size qwen3.5?

Readerium2mo ago

Qwen 3.5 4B is the goat then

d01002mo ago

Why is there no RTX 5060ti?

bheadmaster2mo ago

Missing 5060 Ti 16GB

lagrange772mo ago

Finally! I've been waiting for something like this.

S4phyre2mo ago

Oh how cool. Always wanted to have a tool like this.

nilslindemann2mo ago

1. More title attributes please ("S 16 A 7 B 7 C 0 D 4 F 34", huh?)

2. Add a 150% size bonus to your site.

Otherwise, cool site, bookmarked.

nazbasho2mo ago

its perfect

Akuehne2mo ago

Can we get some of the ancient Nvidia Teslas, like the p40 added?

tkfoss2mo ago

Nice UI, but crap data, probably llm generated.

polyterative2mo ago

awesome, needed this

vincentbusch2mo ago

Interesting point about #2. I've been doing something similar but from a different angle — running the same question through Claude, GPT-4o and Gemini to see where they disagree. Turns out they give completely different root causes about 30% of the time, which honestly surprised me.

What's your experience with qwen3.5 for debugging tasks? I've mostly stuck with the big models so far.

remote3body2mo ago

The 'spent 100 hours configuring' part hits home. That fragmentation is exactly why we started building Olares (https://github.com/beclab/Olares).

It’s basically an open-source OS layer that standardizes the local AI stack—Kubernetes (K3s) for orchestration, standardized model serving, and GPU scheduling. The goal is to stop fiddling with Python environments/drivers and just treat local agents like standardized containers. It runs on Mac Minis or dedicated hardware.

j / k navigate · click thread line to collapse

353 comments

mark_l_watson2mo ago

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

sdrinf2mo ago

Strong, strong recommend.

steve_adams_862mo ago

I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.

tempoponet2mo ago

1 more reply

threecheese2mo ago

Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.

scottmf2mo ago

As a person who also knows there's a connection between that phrase and Monty Python and not much more information beyond that, I'm not sure how to feel.

cassianoleal2mo ago

African or European?

1 more reply

8note2mo ago

could that be some of the RL trying to get it to not regurgitate?

the gag is giving in detail which one

1 more reply

kingo552mo ago

How's it compare in quality with larger models in the same series? E.g 122b?

buzzin_2mo ago

The chart on this link compares all qwen3.5 models down to 0.8B.

https://www.reddit.com/r/LocalLLaMA/comments/1ro7xve/qwen35_...

ggsp2mo ago

How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?

rnewme2mo ago

Less than expected, search for unsloths recent benchmark

RRRA2mo ago

I'd be curious to see people give their opinion on embedded models for less tech focused needs, say what's that bug killing spray chemistry like or what is the history of this or that...

boppo12mo ago

flutetornado2mo ago

mongrelion2mo ago

At what temperature did you run it and what was your context limit?

mongrelion2mo ago

I don't understand why I'm getting downvoted.

johnmaguire2mo ago

philipkglass2mo ago

mandeepj2mo ago

> I formerly used that cost hundreds or thousands of dollars for a license

Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?

Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.

1 more reply

tempaccount50502mo ago

Bluecobra2mo ago

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

lambda2mo ago

I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

1 more reply

AzN1337c0d3r2mo ago

Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

3 more replies

saltwounds2mo ago

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not

echelon2mo ago

Shouldn't we prioritize large scale open weights and open source cloud infra?

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

girvo2mo ago

I run Qwen3.5-plus through Alibaba’s coding plan (Model Studio): incredibly cheap, pretty fast, and decent. I can’t compare it to the highest released weight one though.

2 more replies

adamkittelson2mo ago

1 more reply

eek21212mo ago

cyanydeez2mo ago

dataflow2mo ago

lilactown2mo ago

If your inbox is as big as mine, you won’t be able to load all the text content into a prompt even with SotA cloud hosted models.

dataflow2mo ago

perbu2mo ago

Prompt injection is a problem if your agent has access to anything.

The local models are quite weak here.

dataflow2mo ago

My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.

1 more reply

dhblumenfeld12mo ago

storus2mo ago

kylehotchkiss2mo ago

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

Western02mo ago

I only put this https://unsloth.ai/docs/models/qwen3.5 (please look at old gemma 3n 4eb, and small IBM granite)

chrisweekly2mo ago

Thanks for this, Mark. And for your website and books and generosity of spirit. Signal in the noise. Have an awesome weekend!

manmal2mo ago

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

girvo2mo ago

IMO you’re better off using qwen3.5-plus through the model studio coding plan, but ymmv

sakesun2mo ago

Becoming a retired builder is the ultimate bliss.

nine_k2mo ago

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

fzzzy2mo ago

A Mac Pro with 512 gb unified ram does not exist.

nine_k2mo ago

Mac Studio Ultra, my bad. The 512 GB option existed up until March 2026: https://macdailynews.com/2026/03/06/apple-drops-512gb-m3-ult...

meatmanek2mo ago

lambda2mo ago

Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.

For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.

pbronez2mo ago

The docs page addresses this:

https://www.canirun.ai/docs

lambda2mo ago

littlestymaar2mo ago

While your remark is valid, there's two small inaccuracies here:

> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).

lambda2mo ago

So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.

littlestymaar2mo ago

I agree, and I even started my response expressing my agreement with the whole point.

But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.

tommy_axle2mo ago

mopierotti2mo ago

This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

0xbadcafebee2mo ago

Too generic question. Gotta be more specific:

   "what is the best open weight model for high-quality coding that fits in 8GB VRAM and 32GB system RAM with t/s >= 30 and context >= 32768" -> Qwen2.5-Coder-7B-Instruct

   "what is the best open weight model for research w/web search that fits in 24GB VRAM and 32GB system RAM with t/s >= 60 and context >= 400k" -> Qwen3-30B-A3B-Instruct-2507

   "what is the best open weight embedding model for RAG on a collection of 100,000 documents that fits in 40GB VRAM and 128GB system RAM with t/s >= 50 and context >= 200k" -> Qwen3-Embedding-8B

Specific models & sizes for specific use cases on specific hardware at specific speeds.

comboy2mo ago

What is the $/Mtok that would make you choose your time vs savings of running stuff locally?

Or is it not about the cost at all, just about not pushing your data into the clouds.

mopierotti2mo ago

Nevertheless, I spend a lot of time with local models because of:

1. Pure engineering/academic curiosity. It's a blast to experiment with low-level settings/finetunes/lora's/etc. (I have a Cog Sci/ML/software eng background.)

3. It's nice to be able solve simple tasks without having to reason about any external 'side-effects' outside my machine.

wilkystyle2mo ago

phillmv2mo ago

i can think of some tasks (classification, structured info extraction) that i _imagine_ even small meh models could do quite well at

on data i would never ever want to upload to any vendor if i can avoid it

J_Shelby_J2mo ago

It’s a hard problem. I’ve been working on it for the better part of a year.

But “quality” is the hard part. In this case I’m just choosing the largest quants.

mopierotti2mo ago

Supporting all the various devices does sound quite challenging.

downrightmike2mo ago

twampss2mo ago

Is this just llmfit but a web version of it?

https://github.com/AlexsJones/llmfit

deanc2mo ago

Yes. But llmfit is far more useful as it detects your system resources.

Someone12342mo ago

I feel like they both solve different issues well:

- If you already HAVE a computer and are looking for models: LLMFit

- If you are looking to BUY a computer/hardware, and want to compare/contrast for local LLM usage: This

You cannot exactly run LLMFit on hardware you don't have.

shrinks992mo ago

Yes, but you can get LLMFit to recommend hardware requirements with `llmfit plan --context <TOKENS> <MODEL>`.

dgrin912mo ago

Honestly I was surprised about this. It accurately got my GPU and specs without asking for any permissions. I didnt realize I was exposing this info.

johnisgood2mo ago

Why were you surprised?

You can check out here how it does that: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

To detect NVIDIA GPUs, for example: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

In this case it just runs the command "nvidia-smi".

Note: llmfit is not web-based.

spudlyo2mo ago

I run LibreWolf, which is configured to ask me before a site can use WebGL, which is commonly used for fingerprinting. I got the popup on this site, so I assume that's how they're doing it.

1 more reply

dekhn2mo ago

How could it not? That information is always available to userspace.

1 more reply

rithdmc2mo ago

Do you mean the OPs website? Mine's way off.

> Estimates based on browser APIs. Actual specs may vary

rootusrootus2mo ago

That's super handy, thanks for sharing the link. Way more useful than the web site this post is about, to be honest.

1 more reply

LeifCarrotson2mo ago

This lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.

But I don't run out of tokens.

rahimnathwani2mo ago

This site presents models in an incomplete and misleading way.

When I visit the site with an Apple M1 Max with 32GB RAM, the first model that's listed is Llama 3.1 8B, which is listed as needing 4.1GB RAM.

But the weights for Llama 3.1 8B are over 16GB. You can see that here in the official HF repo: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main

The model this site calls 'Llama 3.1 8B' is actually a 4-bit quantized version ( Q4_K_M) available on ollama.com/library: https://ollama.com/library/llama3.1:8b

This matters because different quantized versions of the model will have different RAM requirements and different performance characteristics.

zargon2mo ago

They appear to be using Ollama as a data source. Ollama does that sort of thing regularly.

sxates2mo ago

Cool thing!

A couple suggestions:

utopcell2mo ago

Unfortunately, Apple retired the 512GiB models.

ProllyInfamous2mo ago

Sure, but those already sold still exist.

ActorNightly2mo ago

>. I have an M3 Ultra with 256GB of memory,

Hell, even one 3090 can now run Gemma 3 27b qat very fast.

brulard2mo ago

Are you aware that your 3090s have nowhere close to 256GB of VRAM? Or maybe you are not aware that on macs you have unified memory (working both as RAM and VRAM).

ActorNightly2mo ago

Are you aware that having ram doesn't matter when your tokens/second is slow as shit?

You don't need to run large models, Gemma QAT 27B fits on one GPU and is quite good. Other models like Qwen3 are great for coding.

3090 gets 100+ tokens/second for QWEN, very close to what you would see with a cloud based model.

M3 ultra gets ~30.

Congrats, you played yourself.

1 more reply

nozzlegear2mo ago

> a dual 3090 workstation that would have been better for pretty much everything

Doesn't run macOS

xiconfjs2mo ago

Except if you are living in a region where electricity is quite expensive :/

gentleman112mo ago

ask apple to graciously allow you to install your own ram in the computer you "own"

torginus2mo ago

Huh, I never knew my browser just volunteers my exact hardware specs to any website without so much as even notifying me about it.

Jaxan2mo ago

It doesn’t really. The website thinks I’m on a iPhone 19 pro, although I’m actually on a iPhone SE 1st gen. So it’s off by roughly a decade.

weikju2mo ago

> on a iPhone 19 pro

I wish the website could tell us how life is like in 2027!

torginus2mo ago

Maybe that's one of Safari's numerous 'quirks' our frontend devs keep bitching about.

Which in this case Im thankful that Apple isn't too keen on following standards like these.

tstrimple2mo ago

Mine is radically off as well. Says I've got a GeForce 980 or equivalent with 4GB instead of a 5090. I'm guessing the detection only really works on Chromium based browsers.

DanielHB2mo ago

This stuff is used a lot in browser fingerprinting for tracking purposes. More privacy-focused browsers usually feed randomized info.

hotsalad2mo ago

The latest Librewolf prompted me to allow the site permission to make a WebGL context. That's what it used for hardware detection.

nozzlegear2mo ago

It couldn't guess any of my hardware specs when I opened the page in Safari on my Mac.

ebbi2mo ago

I thought that's how airlines do the whole trickery around having different pricing if you access the site from Windows or Mac...

dxxvi2mo ago

dyauspitr2mo ago

For learning and general searching I find ChatGPT to be the best.

Nano Banana Pro for anything image and video related.

Grok Imagine for pretty decent porn generation.

never_inline2mo ago

Didn't want to hear the grok thing from handle named "dyauspitr". My day is ruined.

dyauspitr2mo ago

I would imagine the Dyaus was pretty randy if Zeus is anything to go by.

1 more reply

mmaunder2mo ago

OP can you please make it not as dark and slightly larger. Super useful otherwise. Qwen 3.5 9B is going to get a lot of love out of this.

ProllyInfamous2mo ago

I'm not usually one to whine, but agreed; additionally, add contrast to the modifiers (e.g. processor select). First thing I did when I visited was scale the website to 150%

----

aanet2mo ago

The website is super useful. That theme though... low-contrast text on too-dark theme is, uh, barely readable for me.

duskdozer2mo ago

Have to disagree in part at least. Text is pretty small which isn't good, but I'm glad to see it when sites don't succumb to the make-dark-mode-lighter trend.

nozzlegear2mo ago

I can't see shit on this website lol. It'd be nice if they had a switch to toggle a light mode.

ricardbejaranoOP2mo ago

OP here, it's not mine though!

andy_ppp2mo ago

Love the idea though!

GeekyBear2mo ago

> Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max)

Preliminary testing did not come to that conclusion.

> Apple’s New M5 Max Changes the Local AI Story

https://www.youtube.com/watch?v=XGe7ldwFLSE

lostmsu2mo ago

From the video: 4.4k is "almost" 4x times 1.8k because 4.4k has "number 4" in the beginning, and the other one - number 1.

For the lazy: that's less then 3x: 1.8 * 3 = 5.4

andy_ppp2mo ago

It’s not even the largest part, just prefill so I think maybe M5 Max is 30% faster overall. Still pretty good I think but the 4x nonsense is just marketing!

carra2mo ago

phelm2mo ago

This is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is

metalliqaz2mo ago

cafed00d2mo ago

Open with multiple browsers (safari vs chrome) to get more "accurate + glanceable" rankings.

Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.

Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)

azmenak2mo ago

Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)

DrAwdeOccarim2mo ago

Totally doing this today! Have you tried OpenJarvis or NemoClaw (is it out yet?). I want to use my computer “through” the LLM.

amdivia2mo ago

I found this to be inaccurate, I can run OSS GPT 120B (4 bit quant) on my 5090 and 64 ram system with around 40 t/s. Yet here the site claims it won't work

freediddy2mo ago

sroussey2mo ago

No, getting bad results slowly is much worse. Bad results quickly and you can make adjustments.

But yes, if there is a choice I want quality over speed. At same quality, I definitely want speed.

orthoxerox2mo ago

bityard2mo ago

Yeah, this site is iffy at best. I didn't even see Strix Halo on the list, but I selected 128GB and bumped up the memory bandwidth. It says gpt-oss-120b "barely runs" at ~2 t/s.

In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.

John238322mo ago

RTX Pro 6000 is a glaring omission.

embedding-shape2mo ago

schaefer2mo ago

No Nvidia Spark workstation is another omission.

am17an2mo ago

GrayShade2mo ago

This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.

Aurornis2mo ago

At what quantization and with what size context window?

GrayShade2mo ago

Looks like it's a bit slower today. Running llama.cpp b8192 Vulkan.

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"

[snip 73 lines]

[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"

[snip 128 lines]

[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]

I suspect the ROCm build will be faster, but it doesn't work out of the box for me.

zahirbmirza2mo ago

Also, I have to add, this project is an excellent piece of work.

StefanoC2mo ago

At the moment I'm exploring:

- nightmedia/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-qx64-hi-mlx

- BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit

- mlx-community/Qwen3-Coder-Next-4bit

johneth2mo ago

Re: the design of the site. Please use higher contrast colours, especially the barely visible grey text on black background. It's annoying to try to read.

0xbadcafebee2mo ago

Couple thoughts:

- Why is there an initial section which is all "performs poorly", and then "all models" below it shows a ton of models that perform well?

gopalv2mo ago

Chrome runs Gemini Nano if you flip a few feature flags on [1].

The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.

Including structured output, but has a tiny context window I could use.

[1] - https://notmysock.org/code/voice-gemini-prompt.html

adamhsn2mo ago

Cool project!!

It would be useful to filter which model to use based on the objective or usage (i.e., for data extraction vs. coding).

sdingi2mo ago

When running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?

RagnarD2mo ago

I have an RTX 6000 Pro Max-Q, which has 96GB VRAM. It identified the hardware correctly but incorrectly thought it had 4GB, at least if I interpret the RAM dropdown correctly.

Then it shows the full resolution models, which are completely unnecessary to run quality inference. Quantized models are routine for local inference and it should realize that.

Needs work.

kpw942mo ago

People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...

There's so many knobs to tweak, it's a non trivial problem

- Average/median length of your Prompts

- prompt eval speed (tok/s)

- token generation speed (tok/s)

- Image/media encoding speed for vision tasks

- Total amount of RAM

- Max bandwidth of ram (ddr4, ddr5, etc.?)

- Total amount of VRAM

- "-ngl" (amount of layers offloaded to GPU)

- Context size needed (you may need sub 16k for OCR tasks for instance)

- Size of billion parameters

- Size of active billion parameters for MoE

- Acceptable level of Perplexity for your use case(s)

- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)

- even finer grain knobs: temperature, penalties etc.

Also, Tok/s as a metric isn't enough then because there's:

- thinking vs non-thinking: which mode do you need?

At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?

rando12342mo ago

varispeed2mo ago

boutell2mo ago

I'm not sure how long ago you tried it, but look at Qwen 3.5 32b on a fast machine. Usually best to shut off thinking if you're not doing tool use.

mongrelion2mo ago

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?

casey22mo ago

It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.

Western02mo ago

charcircuit2mo ago

On mobile it does not show the name of the model in favor of the other stats.

rcarmo2mo ago

AstroBen2mo ago

This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-

misnome2mo ago

It seems to be missing a whole load of the quantized Qwen models, Qwen3.5:122b works fine in the 96GB GH200 (a machine that is also missing here....)

mongrelion2mo ago

Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!

paxys2mo ago

I wish creators of local model inference tools (LM Studio, Ollama etc.) would release these numbers publicly, because you can be sure they are sitting on a large dataset of real-world performance.

pants22mo ago

This really highlights the impracticality of local models:

My $3k Macbook can run `GPT-OSS 20B` at ~16 tok/s according to this guide.

Or I can run `GPT-OSS 120B` (a 6X larger model) at 360 tok/s (30X faster) on Groq at $0.60/Mtok output tokens.

To generate $3k worth of output tokens on my local Mac at that pricing it would have to run 10 years continuously without stopping.

There's virtually no economic break-even to running local models, and no advantage in intelligence or speed. The only thing you really get is privacy and offline access.

danny_codes2mo ago

A million tokens is like 5 minutes of inference for heavy coding use.

girvo2mo ago

At 7.5mil per hour hard limit, 84 days to hit the grandparents $3k

That said local models really are slow still, or fast enough and not that great

reverius422mo ago

They already stated they can only generate 57,600 tokens per hour locally (expressed as 16 tokens per second). So that's the limiting factor here.

xandrius2mo ago

You're saying it as if privacy was worthless? Also not many people would consider the price of buying a macbook and put it strictly towards running a local model.

Instead if you wanted to get a macbook anyway, you get to run local models for free on top. Very different story.

pants22mo ago

The privacy angle is not that interesting to me.

- You can find inference providers with whatever privacy terms you're looking for

- If you're using LLMs with real data (let's say handling GMail) then Google has your data anyway so might as well use Gemini API

- Even if you're a hardcore roll-your-own-mail-server type, you probably still use a hosted search engine and have gotten comfortable with their privacy terms

Also on cost the point is you can use an API that's many times smarter and faster for a rounding error in cost compared to your Mac. So why bother with local except for the cool factor?

throwdbaaway2mo ago

dexterlagan2mo ago

mkagenius2mo ago

Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499

mongrelion2mo ago

What front-end framework did you use? I find the UI so visually appealing

hatthew2mo ago

FWIW, while I find it appealing, I also strongly associate it with "vibe coded webapp of dubious quality," so personally I'm not gonna try to replicate it myself.

mkagenius2mo ago

Thanks. I actually used Google AI Studio for this. Prompted with my color choices and let it do the rest, turned out pretty good.

coinexpert2mo ago

scorpioxy2mo ago

There are quite a few of them but their marketing is just confusing and full of buzz words. I've been tinkering with OpenRouter that acts as a middleman.

metrix2mo ago

use openrouter, and call it a day. auto switching between providers, connectivity to all clouds and even works with free models

scorpioxy2mo ago

ActorNightly2mo ago

I mean AWS bedrock fits your use case pretty much. They have a bunch of models that are serverless that you can use on a per token pricing cost.

Gemini api use also comes with a free tier.

scorpioxy2mo ago

suheilaaita2mo ago

Wait 5-10 minutes, and should be done.

It genuinely is that simple.

You can even use local models using claude code or codex infrastrucutre (MASSIVE UNLOCK), but you need solid GPU(s) to run decent models. So that's the downside.

cloogshicer2mo ago

Genuine question, will this actually give you the latest solid local model?

I would've thought no, because of the knowledge cutoff in whatever model you use to download it.

suheilaaita2mo ago

I think it will give you a good "starter model". But then, it ultimately depends on what you want to do with the model exactly and your computer's specs.

To summarise, - find model given use case and specs. - trial and error - test other models (if needed) - rinse repeat - because models are always coming out and getting better

jiggunjer2mo ago

What knowledge cutoff? They all have web agents to Google it.

suheilaaita2mo ago

They all do, true. But some are better than the others in how they retrieve, digest and present you with the information. Boils down to personal preferences and experimenting.

amelius2mo ago

It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.

rootusrootus2mo ago

Someone linked to llmfit. That would be a great tool to integrate with ollama. Just highlight the one you want and tell it to install.

Quick, someone go vibe code that.

dugidugout2mo ago

The latest level of abstraction! You just release your ideas half baked in some internet connected box and wake up with products! Yahoo! Onwards into the Gestell!

rootusrootus2mo ago

Okay, now I’m tempted to set up a bluesky account that takes requests and spits out working software.

I’m certain this has already been done. It’s too obvious, and too hilarious.

1 more reply

modernerd2mo ago

Would love it more if it could help me to answer:

- Which models in the list are the best for my selected task? (If you don't track these things regularly, the list is a little overwhelming.) Sorting by various benchmark scores might be useful?

- How much more system resources do I need to run the models currently listed at F, D or C at B, A, or S-tier levels? (Perhaps if you hover over the score, it could tell you?)

sshagent2mo ago

I don't see my beloved 5060ti. looks great though

xylon2mo ago

nor my plain 5060

dzink2mo ago

starkeeper2mo ago

This is awesome!!!

TheCapn2mo ago

@OP are you the creator? Could you add my GPU to the list?

Radeon VII

https://www.amd.com/en/support/downloads/drivers.html/graphi...

sidchilling2mo ago

mongrelion2mo ago

It might be that the system prompt sent by codex is not optimal for that model. Try with open code and see if your results improve

MikeNotThePope2mo ago

I have the same hardware. Been curious about trying it with Opencode.

rurban2mo ago

No, not yet. Up to a single H100 there is no single local model, which doesn't make your code worse. (excluding trivial stuff like ruby, python, typescript). Implement features, fix bugs.

Right now we started experimenting with 2 H100's, 160GB models. But even a single one is wide out of anyone others league.

dirk940182mo ago

vova_hn22mo ago

It says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?

arjie2mo ago

Cool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.

SXX2mo ago

Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

teaearlgraycold2mo ago

Wait for the M5 Ultra. It will get the 4x prompt processing speeds from the rest of the M5 product line. I hear rumors it will be released this year.

Stronz2mo ago

One thing I noticed with local models is that conversation behavior tends to drift over time as well.

Even when running locally, the model often starts structured but gradually becomes more verbose or explanatory in longer threads.

Curious if others have seen similar behavior when using local setups.

Decabytes2mo ago

Does anyone use the super tiny models for anything ? Like in the 2billion or lower parameter level?

genpfault2mo ago

Speculative decoding[1]?

[1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/specu...

tristor2mo ago

mrdependable2mo ago

This is great, I've been trying to figure this stuff out recently.

cortesoft2mo ago

Ollama runs a web server that you use to interact with the models: https://docs.ollama.com/quickstart

You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/

rebolek2mo ago

ssh?

kylehotchkiss2mo ago

ge962mo ago

Raspberry pi? Say 4B with 4GB of ram.

I also want to run vision like Yocto and basic LLM with TTS/STT

boutell2mo ago

I've been trying to get speech to text to work with a reasonable vocabulary on pis for a while. It's tough. All the modern models just need more GPU than is available

meatmanek2mo ago

For ASR/STT on a budget, you want https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 - it works great on CPU.

(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)

ge962mo ago

Whispr?

For wakewords I have used pico rhino voice

I want to use these I2S breakout mics

winterismute2mo ago

intrasight2mo ago

Your LLM visited the future

hotsalad2mo ago

manlymuppet2mo ago

eichin2mo ago

I'm surprised that this shows anything running usefully on my 2021-era thinkpad (with "Iris Xe"'TigerLake graphics) which inspires me to ask - are external GPUs useful for this sort of thing?

comrade12342mo ago

mind_heist2mo ago

nice, this is an interesting idea. Can you elaborate on the licensing issue ? how do you get blocked for using the models commercially ?

comrade12342mo ago

singpolyma32mo ago

qwen3.5 is just apache

dale_glass2mo ago

It's missing Ryzen AI MAX+, which is sort of the Apple Silicon equivalent.

ThrowawayTestr2mo ago

brcmthrowaway2mo ago

If anyone hasn't tried Qwen3.5 on Apple Silicon, I highly suggest you to! Claude level performance on local hardware. If the Qwen team didn't get fired, I would be bullish on Local LLM.

zitterbewegung2mo ago

The M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.

kuon2mo ago

I have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.

The tool is very nice though.

stared2mo ago

I would love to see on this list some (any) benchmark.

"I can run a model" is mildly interesting. I can run OSS-20B on my M1 Pro. It works, I tried it, just I don't find any application.

raiph_ai2mo ago

Great site, I have an M2 and M3pro and was thinking about getting and Ultra M4 and wanted to know if it was going to be worth it. Now I can see exactly what models I can run locally.

adithyassekhar2mo ago

This just reminded me of this https://www.systemrequirementslab.com/cyri.

Not sure if it still works.

butILoveLife2mo ago

This is borderline irresponsible. Conflating first tokens with all tokens is terrible. Apple looks far better than it actually is.

Just ask any Apple user, they don't actually use local models.

opengrass2mo ago

It would be more broad if you entered a Passmark score. Us CPU/integrated suckers can only see if we beat a RPI4, and same goes for GPUs.

havaloc2mo ago

Missing the A18 Neo! :)

debatem12mo ago

For me the "can run" filter says "S/A/B" but lists S, A, B, and C and the "tight fit" filter says "C/D" but lists F.

Just FYI.

starkparker2mo ago

Every time I refresh the page, I get a higher tokens/second value, presumably because of the keying off memory bandwidth.

golem142mo ago

Has anyone actually built anything with this tool?

The website says that code export is not working yet.

That’s a very strange way to advertise yourself.

urba_2mo ago

Man, I wonder when there will be AI server farms made from iCloud locked jailbroken iPhone 16s with backported MacOS

anigbrowl2mo ago

Useful tool, although some of the dark grey text is dark that I had to squint to make it out against the background.

lrpe2mo ago

This site desperately needs a light mode.

fraywing2mo ago

This is amazing. Still waiting for the "Medusa" class AMD chips to build my own AI machine.

storus2mo ago

Missing latest Nvidia cards like RTX Pro 6000; M3 Ultra can have at most 192GB selected etc.

bearjaws2mo ago

So many people have vibe coded these websites, they are posted to Reddit near daily.

3Sophons2mo ago

a lighter-weight alternative of docker and python is the Rust+Wasm stack https://github.com/LlamaEdge/LlamaEdge

GTP2mo ago

If I'm sorting models by score (the default), which kind of score is it?

amelius2mo ago

What is this S/A/B/C/etc. ranking? Is anyone else using it?

relaxing2mo ago

Apparently S being a level above A comes from Japanese grading. I’ve been confused by that, too.

swiftcoder2mo ago

It's very common in Japanese-developed video games as well

bitexploder2mo ago

Common in gaming culture. Kind of a meme template. S tier is the best tier of something. People make tier lists of all sorts of things with that grading.

vikramkr2mo ago

Just a tier list I think

jrmg2mo ago

I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.

I suspect this is actually fairly easy to set up - if you know how.

AstroBen2mo ago

Ollama or LM Studio are very simple to setup.

jrmg2mo ago

Right - setting up LM studio is not hard. But how do I connect LM Studio to Copilot, or set up an agent?

NortySpock2mo ago

Having a second pair of "eyes" to read a log error and dig into relevant code is super handy for getting ideas flowing.

brcmthrowaway2mo ago

Basically LM Studio has a server that serves models over HTTP (localhost). Configure/enable the server and connect OpenCode to it.

Try this article https://advanced-stack.com/fields-notes/qwen35-opencode-lm-s...

I'm looking for an alternative to OpenCode though, I can barely see the UI.

1 more reply

AstroBen2mo ago

It looks like Copilot has direct support for Ollama if you're willing to set that up: https://docs.ollama.com/integrations/vscode

For LM Studio under server settings you can start a local server that has an OpenAI-compatible API. You'd need to point Copilot to that. I don't use Copilot so not sure of the exact steps there

kristianp2mo ago

https://github.com/ggml-org/llama.cpp/releases - has mac binaries

https://unsloth.ai/docs/models/qwen3.5 - running locally guide for the Qwen 3.5 family of models, which have a range of different sizes.

randusername2mo ago

Personally I'd start with llamafile [0] then move to compiling your own llama.cpp.

It's not as bad as you might think to compile llama.cpp for your target architecture and spin up an OpenAI compatible API endpoint. It even downloads the models for you.

[0]: https://github.com/mozilla-ai/llamafile

thexa42mo ago

I've created a llama.cpp integration with Copilot in vscode. The extension readme contains setup instructions: https://marketplace.visualstudio.com/items?itemName=delft-so...

chatmasta2mo ago

Any time I google something on this topic, the results are useful but also out of date, because this space is moving so absurdly fast.

ementally2mo ago

In mobile section it is missing Tensor chips (used by Google Pixel devices).

vednig2mo ago

Our work at DoShare is a lot of this stuff we've been on it for 2 years

markdown2mo ago

Protip: Website requires CMD+ a few times to increase font size to 200%.

spidrahedron2mo ago

This is super useful for people not having access to GPUs and Servers

amelius2mo ago

Why isn't there some kind of benchmark score in the list?

ipunchghosts2mo ago

What is S? Also, NVIDIA RTX 4500 Ada is missing.

Western02mo ago

I like a old gemma 3n 4EB for small device

tcbrah2mo ago

hatthew2mo ago

For me and probably many other people, local has nothing to do with cost and everything to do with privacy

tcbrah2mo ago

genuine question - what are you working on that needs that level of privacy? outside of NSFW stuff most API providers arent doing anything with your prompts

hatthew2mo ago

I would answer that, but it's private :)

I can think of several reasons: corporate policy, personal principles, NSFW stuff, illegal stuff

tencentshill2mo ago

Missing laptop versions of all these chips.

ryandrake2mo ago

Missing RTX A4000 20GB from the GPU list.

sand5002mo ago

How does it have details for M4 ultra?

nicklo2mo ago

the animation of the model name text when opening the detail view is so smooth and delightful

g_br_l2mo ago