Gemma3 – The current strongest model that fits on a single GPU (opens in new tab)

(ollama.com)

252 pointsbrylie1y ago138 comments

138 comments

I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.

Models that are worth writing home about are;

EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.

Rocinante-12B-v2i - Fun for stories and D&D

Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks

OpenThinker-7B - Good and fast reasoning

The Deepseek destills - Able to handle more complex task while still being fast

DeepHermes-3-Llama-3-8B - A really good vLLM

Medical-Llama3-v2 - Very interesting but be careful

Plus more but not Gemma.

anon3738391y ago

From the limited testing I've done, Gemma 3 27B appears to be an incredibly strong model. But I'm not seeing the same performance in Ollama as I'm seeing on aistudio.google.com. So, I'd recommend trying it from the source before you draw any conclusions.

One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.

moffkalast1y ago

At the end of the day it doesn't matter how good it its, it has no system prompt which means no steerability, a sliding window for incredibly slow inference compared to similar sized models because it's too niche and most inference systems have high overhead implementations of it, and Google's psychotic instruct tuning that made Gemma 2 an inconsistent and unreliable glass cannon.

I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.

hnfong1y ago

If you actually looked at gemma-3 you’ll see that it does support system prompts.

I’ve never seen a case where putting the system prompt in the user prompt would lead to significantly different outcomes though. Would like to see some examples.

(edit: my bad. i stand corrected. it seems the code just prepends the system prompts to the first user prompt.)

2 more replies

sieve1y ago

The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.

The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.

Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.

[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)

1 more reply

mythz1y ago

Concur with Gemma2 being underwhelming, I dismissed it pretty quickly but gemma3:27b is looking pretty good atm.

BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.

mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.

InsideOutSanta1y ago

One more vote for Mistral for local models. The 7B model is extremely fast and still good enough for many prompts.

zacksiri1y ago

You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.

jrm41y ago

Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"

I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?

blooalien1y ago

I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.

usef-1y ago

To clarify, are you basing this comment on experience with previous Gemma releases, or the one from today?

mupuff12341y ago

Ok, but have you tried Gemma3?

rpastuszak1y ago

Thanks for the overview.

> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning

Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?

miroljub1y ago

Qwen2.5-Coder:32B is the best open source coding model. I use it daily, and I don't notice that it lags much behind Claude 3.5.

I would be actually happy to see R1 distilled version, it may make it perform better with the less resource usage.

rpastuszak1y ago

Thanks! Do you use it with Aider/terminal/a web GUI?

1 more reply

thom1y ago

Could you talk a little more about your D&D usage? This has turned into one of my primary use cases for ChatGPT, cooking up encounters or NPCs with a certain flavour if I don't have time to think something up myself. I've also been working on hooking up to the D&D Beyond API so you can get everything into homebrew monsters and encounters.

archerx1y ago

I noticed the prompt makes a big difference in the experience you get. I have a simple game prompt.

The first prompt I tested out I got from this video; https://www.youtube.com/watch?v=0Cq-LuJnaRg

It was ok and produces shallow adventures.

The second one I tried was from this site; https://www.rpgprompts.com/post/dungeons-dragons-chatgpt-pro...

a bit better and is easier to modify but still shallow.

The best one I have tried so far is this one from reddit; https://old.reddit.com/r/ChatGPT/comments/zoiqro/most_improv...

It is a super long prompt and I had to edit it a lot, and manually extract the data from some of the links but it has been the best experience by far. I even became "friends" with an NPC who accompanied me on a quest and it was a lot of fun and I was fully engaged.

The model of choice matters but even llama 1B and 2B can handle some stories.

camel_Snake1y ago

May want to check out the Wayfarer models on: https://huggingface.co/LatitudeGames

afaik they are more for roleplaying a D&D style adventure than planning it, but I've heard good things.

DeepSeaTortoise1y ago

TBH, I REALLY like the tiny models. Like smollm2.

Also lobotomized LLMs ("abliterated") can be a lot of fun.

andai1y ago

I think you mean un-lobotomize, and apparently it can be done without retraining? Wild!

https://huggingface.co/blog/mlabonne/abliteration

memhole1y ago

Thanks for clarifying that. I can understand their take. But, I also think of the models without abliteration as lobotomized

pduggishetti1y ago

Recently phi4 has been very good too!

m00dy1y ago

sshht, don't make it a public debate :P)

memhole1y ago

Do you mostly stick with smaller models? I’m pretty surprised at how good the smaller models can be at times now. A year ago they were nearly useless. I kind of like too the hallucinations are more obvious sometimes. Or at least it seems like they are.

archerx1y ago

I like the smaller models because they are faster. I even got a Llama 3 1B model running on TinkerBoard 2S and it was fun to play around with and not too slow. The smaller models are still good at summarizing and other basic tasks. For coding they start showing their limits but still work great for trying to figure out issues in small bits of code.

The real issue with local models is managing context. smaller models let you have a longer context without losing performance but bigger models are smarter but if you want to keep it fast I have to reduce the context length.

Also all of the models have their own "personalities" and they still manifest in the finetunes.

memhole1y ago

Yeah, that’s why I like the smaller models too. Big context windows and intelligent enough most of the time. They don’t follow instructions as well as the larger models ime. But then on the flip side the reasoning models struggle to deviate. I gave deepseek an existential crisis by accident the other day lol.

Agreed on personalities. Phi, I think because of the curated training data comes across as very dry.

jeswin1y ago

I still find them useless. What do you use them for?

sebastiansm1y ago

Anyone can recommend a small model specific for translation? english to spanish mostly.

pzo1y ago

Depends what you mean small 4B? 7B? You can try qwen2.5 3B or 7B though 3B version is on no commercial license. Phi4-mini also should be good. Tested only on polish/english pairs should be good for spanish too. Smaller models like 1.5B were kind of useless for me.

archerx1y ago

I haven't done deep testing on it but Tower-Babel_Babel-9B should be what you are looking for.

karma_fountain1y ago

Ah, OpenThinker-7B. A diverse variety of LLM from the OpenThoughts team. Light and airy, suitable for everyday usage and not too heavy on the CPU. A new world LLM for the discerning user.

flir1y ago

I find New World LLMs kinda... well, they don't have the terroir, ya know?

panki271y ago

I've had really good results with Qwen2.5-7b-Instruct.

Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?

archerx1y ago

If I could only use one model from now on it would either be the deepSeek R1 Qwen or Llama distill.

xnx1y ago

Let us know when you've evaluated Gemma 3. Just as with the switch between ChatGPT 3.5 and ChatGPT 4, old versions don't tell you much about the current version.

tomp1y ago

Any below 7B you'd recommend?

IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.

archerx1y ago

Try;

- EXAONE-3.5-2.4B-Instruct - Llama-3.2-3B-Instruct-uncensored - qwq-lcot-3b-instruct - qwen2.5-3b-instruct

These have been very interesting tiny models, they can do text processing task and can handle story telling. The Llama-3.2 is way to sensitive to random stuff so get the uncensored or abliterated versions

_11y ago

How are you grading these? Are you going on feeling, or do you have a formalized benchmarking process?

archerx1y ago

From just using them a lot and getting the results that I want without going "ugh!".

dudefeliciano1y ago

what hardware are you using those on? Is it still prohibitively expensive to self-host a model that gives decent outputs (sorry my last experience has been underwhelming with llama a while back)

sliken1y ago

I'm tinkering with gemma 3 27B on a last gen 12 core ryzen. I get 5 tokens/sec.

archerx1y ago

I have an AMD 6700 XT card with 12gb of VRAM and a 24 core cpu with 48gigs of ram. This is the bare minimum,

michaelbuckbee1y ago

What's the driving reason for local models? Cost? Censorship?

laborcontract1y ago

PII is the driving force for me. I like to have local models manage my browser tabs, reply to emails, and go through personal documents. I don't trust LLM providers not to retain my data.

dannyw1y ago

Privacy is another big reason. I like to store my files locally with a backup, not on Dropbox or whatever.

danielhanchen1y ago

I wrote a mini guide on running Gemma 3 at https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...!

The recommended settings according to the Gemma team are:

temperature = 0.95

top_p = 0.95

top_k = 64

Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

vessenes1y ago

Daniel, as always, thanks for these. I had good results with your Q4_K_M quant on mac / llama.cpp. However, on Linux/A100/ollama, there is something very wrong with your Q8_0 quant. python code has indentation errors, missing close parens, quite a lot that's bad. I ran both with your suggested command lines, but of course could have been some mistake I made. I'm testing the bf16 on the A100 now to make sure it's not a hardware issue, but my gut is there's a model or ollama sampling problem here.

EDIT: 27b size

tarruda1y ago

Thanks for this, but I'm still unable to reproduce the results from Google AI studio.

I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.

ac291y ago

Some models are more sensitive to quantization than others, presumably AI Studio is running the full 16 bit model.

Try maybe the 8bit quant if you have the hardware for it? ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0

tarruda1y ago

I tested the full fp16 gguf

svachalek1y ago

This seems worse than the official Ollama build. First question I tried:

>>> who is president

The বর্তমানpresident of the United States is Джо Байден (JoeBiden).

swores1y ago

See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.

https://news.ycombinator.com/item?id=43340491

iamgopal1y ago

Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.

antirez1y ago

After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.

archerx1y ago

That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.

antirez1y ago

Good test. I'm slowly accumulating private tests that I use to rate LLMs, and this one was missing... Thanks.

amelius1y ago

Aren't there any "blind" benchmarks?

nathanasmith1y ago

Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.

nickthegreek1y ago

OpenRouter Arena Ratings are probably the closet thing.

toinewx1y ago

can you expand a bit?

antirez1y ago

The model performs very poorly in practice, while in the benchmark it is shown to be DeepSeek V3 level. It's not terrible but it's at another level compared to the models it is very close to (a bit better / a bit worse) in the benchmarks.

anon3738391y ago

I’d recommend trying it on Google AI Studio (aistudio.google.com). I am getting exceptional results on a handful of novel problems that require deep domain knowledge and structured reasoning. I’m not able to replicate this performance with Ollama, so I suspect something is a bit off.

2 more replies

alekandreev1y ago

Hey, Gemma engineer here. Can you please share reports on the type of prompts and the implementation you used?

4 more replies

kamranjon1y ago

I really respect the work that you've done, but I am always very surprised when people just speak anecdotally as though it is truth with regards to AI models these days. It's as if everyone believes they are an expert now, but have nothing of substance to provide but their gut feelings.

It's as if people don't realize that these models are used for many different purposes, and subjectively one person could think one model is amazing and another person think it's awful. I just would hope that we could at least back up statements like "The model performs very poorly in practice" with actual data or at least some explanation of how it performed poorly.

tarruda1y ago

In my experience, Gemma models were always bad at coding (but good at other tasks).

bearjaws1y ago

Prompt adherence is pretty bad from what I can tell.

smcleod1y ago

No mention of how well it's claimed to perform with tool calling?

The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.

PKop1y ago

I wasn't able to get function calls to work for Gemma3 in ollama, nor were others[0]. What is another way to run these models locally?

[0] https://github.com/ollama/ollama/issues/9680

[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...

mythz1y ago

Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.

Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.

From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.

Patrick_Devine1y ago

There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.

dizhn1y ago

It might not be downloading but converting the model. Or if it's already downloading a properly formatted model file, deduping on disk which I hear it does. This also makes its model files on disk useless for other frontends.

squeakywhite1y ago

I experienced this just now. The download slowed down to approx 500kB/s for the last 1% or so. When this happens, you can Ctrl+C to cancel and then start the download again It will continue from where it left off, but at regular (fast) download speed.

elif1y ago

Good job Google. It is kinda hilarious that 'open'AI seems to be the big player least likely to release any of their models.

amelius1y ago

lyingAI

wtcactus1y ago

The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.

I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.

JKCalhoun1y ago

That's interesting. Are they often an amalgam of image & text tokens? Because, yeah, image generation is not interesting to em at all.

amelius1y ago

Perhaps the model performs better (has higher intelligence) if it was trained on a more diverse set of topics (?)

singularity20011y ago

How does it compare to OlympicCoder 7B [0] which allegedly beats Claude Sonnet 3.7 in the International Olympiad in Informatics [1] ?

[0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...

[1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...

tarruda1y ago

My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.

My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"

It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:

" Key improvements and explanations:

     Clearer Code Structure:  The code is now organized into a Tetris class, making it much more maintainable and readable.  This is essential for any non-trivial game.

Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.

I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".

Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:

"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"

Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.

tarruda1y ago

I ran the same prompt on google AI studio it had the same behavior of talking about improvements as if the code it wrote was not the first version.

Other than that, the experience was completely different:

- The game worked on first try

- I iterated with the model making enhancements. The first version worked but didn't show scores, levels or next piece, so I asked it to implement those features. It then produced a new version which almost worked: The only problem was that levels were increasing whenever a piece fell, and I didn't notice any increase in falling speed.

- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version

- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.

- As a final request, I asked it to port the latest working version of the game to JS/HTML with the implementation self contained in a file. It produced a broken implementation, but I was able to fix it after tweaking it a little bit.

Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.

Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.

whbrown1y ago

Those sound like the sort of issues which could be caused by your server silently truncating the middle of your prompts.

By default, Ollama uses a context window size of 2048 tokens.

tarruda1y ago

I checked this, the whole conversation was about 1000 tokens.

I suspect the Ollama version might have wrong default settings, such as conversation delimiters. The experience of Gemma 3 in AI studio is completely different.

whiplash4511y ago

Why did this get downvoted? Asking genuinely

sigmoid101y ago

These bar charts are getting more disingenuous every day. This one makes it seem like Gemma3 ranks as nr. 2 on the arena just behind the full DeepSeek R1. But they just cut out everything that ranks higher. In reality, R1 currently ranks as nr. 6 in terms of Elo. It's still impressive for such a small model to compete with much bigger models, but at this point you can't trust any publication by anyone who has any skin in model development.

swores1y ago

The chart isn't claiming to be an overview of the best ranking models - it's an evaluation of this particular model, which wouldn't be helped at all by having loads more unrelated models in the chart, even if that would have helped you avoid misunderstanding the point of the chart.

sigmoid101y ago

How are better ranking models unrelated? They are explicitly comparing open and closed, small and large foundation models. Leaving the best ones out is just plain disingenuous. There's no way to sugarcoat this.

antirez1y ago

The most disturbing thing is that in the chart it ranks higher than V3. Test a few prompts against DeepSeek V3 and Gemma 3. They are like at two totally different levels, one is a SOTA model, one is a small LLM that can be useful for certain vertical tasks perhaps.

pzo1y ago

open llm leaderboard [0] is probably good to compare open weights model on many different benchmarks - wish they put also some closed source one just to see what's relative ranking of best open weights to closed source one. They haven't updated yet for gemma 3 though

[0] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

sigmoid101y ago

Beware that they use very narrow metrics. Which is also why you only see fine-tunes over there gaming narrow aspects. If your edge case fits into one of those - great. If not and you just want a good general purpose model you'll have to look elsewhere.

leumon1y ago

In my opinion qwq is the strongest model that fits on a single gpu (Rtx 3090 for example, in Q4_K_M quantization which is the standard in Ollama)

moffkalast1y ago

Gemma 2 27B at 4 bits would be a drooling idiot anyway, even going down to 8 bits seems to significantly lobotomize it. Qwens are surprisingly resistant to quantization compared to most so it'll pull ahead just in that already in terms of coherence for the same VRAM amount.

We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption.

hnfong1y ago

Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill unless somehow need to squeeze the last bits of performance (in terms of generation quality) from it. There are papers to that effect though admittedly perhaps specific models might have divergent behavior ( https://arxiv.org/abs/2212.09720 )

All the above is subjective so maybe that’s true for you, but claiming there’s a lack of inference framework for gemma 2 is really off the mark.

Obviously ollama supports it. Also llama.cpp. Also mlx. I’ve listed 3 frameworks that support quantized versions of gemma 2

llama.cpp support for gemma-3 is out, the PR merged a couple hours after googles announcement. Obviously ollama supports it as well as you can see in TFA here.

I’m really curious how you’d get to the conclusions you’ve made. Are we living in different alternate universes?

moffkalast1y ago

> Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill

Two year old info, only really applies to heavily undertrained models with short tokenizers. Perplexity scores are a really terrible metric for measuring quantization impact, and quantized models tend to also score higher than they should in benchmarks ran as topk=1 where the added randomness seems to help.

In my experience it really seems to affect reliability most, which isn't often tested consistently. An fp16 model might get a question right every time, Q8 every other time, Q6 every third time, etc. In a long form conversation this means wasting a lot of time regenerating responses when the model throws itself off and loses coherence. It also destroys knowledge that isn't very strongly ingrained, so low learning rate fine tune data gets obliterated at a much higher rate. Gemma-2 specifically also loses a lot of its multilingual ability with quantization.

I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.

Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.

> there’s a lack of inference framework for gemma 2 is really of the mark

I mean mainly for the QAT format for Gemma 3, which surprisingly seems to be as a standard gguf this time. Last time around Google decided llama.cpp is not good enough for them and half-assedly implemented their own ripoff as gemma.cpp with basically zero usable features.

> llama.cpp support for gemma-3 is out

Yeah testing it right now, I'm surprised it runs coherently at all given the new global attention tbh. Every architectural change is usually followed with up to a month of buggy inference and back and forth patching, model reuploads and similar nonsense.

1 more reply

aravindputrevu1y ago

I'm curious. Is there any value to do these OSS models?

Suddenly after reasoning models, it looks like OSS models have lost their charm

archerx1y ago

Thee are a lot of open source reasoning models. The true value to local models is privacy and the ability to have the models be uncensored.

lelag1y ago

OSS model do not have to be local models, and it's not just about privacy, imo.

DeepSeek R1 hosting is out of reach for most, but it being open is a game changer if you are a building a business that needs the SoTA capabilities of such a large model, not because you will necessarily host it yourself, but because you can't be locked out of using it.

If you build your business on top of OpenAI, and they decide they don't like you, they can shut you down. If you use an open model like R1, you always have the option to self host even if it can be costly, and not be at the mercy of a third party being able to just kill your business by shutting down your access to their service.

pzo1y ago

Another benefit is they can be fine tuned. Also it's not only about if Openai will shut you down but decide to deprecate model (like they will do for gpt4.0) or swap the name for different model (like sonnet 3.5 did) or censure it or limit capability.

whiplash4511y ago

You can absolutely be locked out effectively if they stop releasing upgrades while the other providers move forward.

whiplash4511y ago

Uncensored at inference time does not imply uncensored at training time (not a specific comment about Gemma)

chaosprint1y ago

How does this compare with qwq 32B?

wewewedxfgdf1y ago

Discrete GPUs are finished for AI.

They've had years to provide the needed memory but can't/won't.

The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.

Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.

throwaway3141551y ago

That seems a tad dramatic. GPU's were widespread because of gaming, not AI. That the overlapping market would somehow just all magically have >3,000$ _and_ decide to switch to a non-standard, non-CUDA hardware solution in just 12 months is absurd.

lvl1551y ago

FWIW GPUs still do not saturate PCIe lanes.

tekichan1y ago

I found deepseek better for trivial tasks

casey21y ago

coalma3

axiosgunnar1y ago

PSA: DO NOT USE OLLAMA FOR TESTING.

Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).

The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).

Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!

Technetium1y ago

I looked around to get confirmation, and I did find some related issues. Seems like it works properly when context is defined explicitly. There also appears to be a warning logged about "truncating input prompt", so it isn't an entirely silent failure. https://github.com/ollama/ollama/issues/2653 + https://github.com/ollama/ollama/issues/4967 + https://github.com/ollama/ollama/issues/7043 + https://github.com/ollama/ollama/issues/8144

tarruda1y ago

Is "OpenAI" the only AI company that hasn't released any model weights?

archerx1y ago

They did release Whisper which to be fair has been incredibly helpful for a few of my projects.

whiplash4511y ago

A very long time ago in AI time

pzo1y ago

They updated also whisper with v3 large turbo few months ago that is significantly faster with almost same performance.

world2vec1y ago

Anthropic hasn't released anything either AFAIK

dev0p1y ago

They need to open source Sonnet 3.7.

I know they won't, but a man can dream.

ddalex1y ago

I'd wish people stop using "open sourcing" when speaking about models.

Open sourcing is about being able to change and replicate builds, they make the models "freely available" but the recipe on how they are made is kept secret.

It's akin to being able to download Windows shareware executables and calling that "open source" when nothing related to how the executables are build is available.

2 more replies

toinewx1y ago

would you be able to run Sonnet 3.7 on a consumer computer though?

1 more reply

finnjohnsen21y ago

"AI company" makes this an unreasonable wide question but I'll assume you mean of the big players in this ecosystem. I miss later models from Grok and xai, which don't seem to care about sharing models either

elif1y ago

Only if you give xAI credit for releasing grok-1, which is not a very useful model.

j / k navigate · click thread line to collapse

138 comments

archerx1y ago

Models that are worth writing home about are;

EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.

Rocinante-12B-v2i - Fun for stories and D&D

Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks

OpenThinker-7B - Good and fast reasoning

The Deepseek destills - Able to handle more complex task while still being fast

DeepHermes-3-Llama-3-8B - A really good vLLM

Medical-Llama3-v2 - Very interesting but be careful

Plus more but not Gemma.

anon3738391y ago

moffkalast1y ago

I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.

hnfong1y ago

If you actually looked at gemma-3 you’ll see that it does support system prompts.

I’ve never seen a case where putting the system prompt in the user prompt would lead to significantly different outcomes though. Would like to see some examples.

(edit: my bad. i stand corrected. it seems the code just prepends the system prompts to the first user prompt.)

2 more replies

sieve1y ago

The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.

The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.

Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.

[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)

1 more reply

mythz1y ago

Concur with Gemma2 being underwhelming, I dismissed it pretty quickly but gemma3:27b is looking pretty good atm.

BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.

mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.

InsideOutSanta1y ago

One more vote for Mistral for local models. The 7B model is extremely fast and still good enough for many prompts.

zacksiri1y ago

You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.

jrm41y ago

Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"

I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?

blooalien1y ago

I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.

usef-1y ago

To clarify, are you basing this comment on experience with previous Gemma releases, or the one from today?

mupuff12341y ago

Ok, but have you tried Gemma3?

rpastuszak1y ago

Thanks for the overview.

> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning

Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?

miroljub1y ago

Qwen2.5-Coder:32B is the best open source coding model. I use it daily, and I don't notice that it lags much behind Claude 3.5.

I would be actually happy to see R1 distilled version, it may make it perform better with the less resource usage.

rpastuszak1y ago

Thanks! Do you use it with Aider/terminal/a web GUI?

1 more reply

thom1y ago

archerx1y ago

I noticed the prompt makes a big difference in the experience you get. I have a simple game prompt.

The first prompt I tested out I got from this video; https://www.youtube.com/watch?v=0Cq-LuJnaRg

It was ok and produces shallow adventures.

The second one I tried was from this site; https://www.rpgprompts.com/post/dungeons-dragons-chatgpt-pro...

a bit better and is easier to modify but still shallow.

The best one I have tried so far is this one from reddit; https://old.reddit.com/r/ChatGPT/comments/zoiqro/most_improv...

The model of choice matters but even llama 1B and 2B can handle some stories.

camel_Snake1y ago

May want to check out the Wayfarer models on: https://huggingface.co/LatitudeGames

afaik they are more for roleplaying a D&D style adventure than planning it, but I've heard good things.

DeepSeaTortoise1y ago

TBH, I REALLY like the tiny models. Like smollm2.

Also lobotomized LLMs ("abliterated") can be a lot of fun.

andai1y ago

I think you mean un-lobotomize, and apparently it can be done without retraining? Wild!

https://huggingface.co/blog/mlabonne/abliteration

memhole1y ago

Thanks for clarifying that. I can understand their take. But, I also think of the models without abliteration as lobotomized

pduggishetti1y ago

Recently phi4 has been very good too!

m00dy1y ago

sshht, don't make it a public debate :P)

memhole1y ago

archerx1y ago

Also all of the models have their own "personalities" and they still manifest in the finetunes.

memhole1y ago

Agreed on personalities. Phi, I think because of the curated training data comes across as very dry.

jeswin1y ago

I still find them useless. What do you use them for?

sebastiansm1y ago

Anyone can recommend a small model specific for translation? english to spanish mostly.

pzo1y ago

archerx1y ago

I haven't done deep testing on it but Tower-Babel_Babel-9B should be what you are looking for.

karma_fountain1y ago

Ah, OpenThinker-7B. A diverse variety of LLM from the OpenThoughts team. Light and airy, suitable for everyday usage and not too heavy on the CPU. A new world LLM for the discerning user.

flir1y ago

I find New World LLMs kinda... well, they don't have the terroir, ya know?

panki271y ago

I've had really good results with Qwen2.5-7b-Instruct.

Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?

archerx1y ago

If I could only use one model from now on it would either be the deepSeek R1 Qwen or Llama distill.

xnx1y ago

Let us know when you've evaluated Gemma 3. Just as with the switch between ChatGPT 3.5 and ChatGPT 4, old versions don't tell you much about the current version.

tomp1y ago

Any below 7B you'd recommend?

IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.

archerx1y ago

Try;

- EXAONE-3.5-2.4B-Instruct - Llama-3.2-3B-Instruct-uncensored - qwq-lcot-3b-instruct - qwen2.5-3b-instruct

_11y ago

How are you grading these? Are you going on feeling, or do you have a formalized benchmarking process?

archerx1y ago

From just using them a lot and getting the results that I want without going "ugh!".

dudefeliciano1y ago

what hardware are you using those on? Is it still prohibitively expensive to self-host a model that gives decent outputs (sorry my last experience has been underwhelming with llama a while back)

sliken1y ago

I'm tinkering with gemma 3 27B on a last gen 12 core ryzen. I get 5 tokens/sec.

archerx1y ago

I have an AMD 6700 XT card with 12gb of VRAM and a 24 core cpu with 48gigs of ram. This is the bare minimum,

michaelbuckbee1y ago

What's the driving reason for local models? Cost? Censorship?

laborcontract1y ago

PII is the driving force for me. I like to have local models manage my browser tabs, reply to emails, and go through personal documents. I don't trust LLM providers not to retain my data.

dannyw1y ago

Privacy is another big reason. I like to store my files locally with a backup, not on Dropbox or whatever.

danielhanchen1y ago

I wrote a mini guide on running Gemma 3 at https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...!

The recommended settings according to the Gemma team are:

temperature = 0.95

top_p = 0.95

top_k = 64

Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

vessenes1y ago

EDIT: 27b size

tarruda1y ago

Thanks for this, but I'm still unable to reproduce the results from Google AI studio.

ac291y ago

Some models are more sensitive to quantization than others, presumably AI Studio is running the full 16 bit model.

Try maybe the 8bit quant if you have the hardware for it? ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0

tarruda1y ago

I tested the full fp16 gguf

svachalek1y ago

This seems worse than the official Ollama build. First question I tried:

>>> who is president

The বর্তমানpresident of the United States is Джо Байден (JoeBiden).

swores1y ago

See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.

https://news.ycombinator.com/item?id=43340491

iamgopal1y ago

antirez1y ago

After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.

archerx1y ago

antirez1y ago

Good test. I'm slowly accumulating private tests that I use to rate LLMs, and this one was missing... Thanks.

amelius1y ago

Aren't there any "blind" benchmarks?

nathanasmith1y ago

nickthegreek1y ago

OpenRouter Arena Ratings are probably the closet thing.

toinewx1y ago

can you expand a bit?

antirez1y ago

anon3738391y ago

2 more replies

alekandreev1y ago

Hey, Gemma engineer here. Can you please share reports on the type of prompts and the implementation you used?

4 more replies

kamranjon1y ago

tarruda1y ago

In my experience, Gemma models were always bad at coding (but good at other tasks).

bearjaws1y ago

Prompt adherence is pretty bad from what I can tell.

smcleod1y ago

No mention of how well it's claimed to perform with tool calling?

PKop1y ago

I wasn't able to get function calls to work for Gemma3 in ollama, nor were others[0]. What is another way to run these models locally?

[0] https://github.com/ollama/ollama/issues/9680

[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...

mythz1y ago

Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.

Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.

From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.

Patrick_Devine1y ago

There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.

dizhn1y ago

squeakywhite1y ago

elif1y ago

Good job Google. It is kinda hilarious that 'open'AI seems to be the big player least likely to release any of their models.

amelius1y ago

lyingAI

wtcactus1y ago

The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.

JKCalhoun1y ago

That's interesting. Are they often an amalgam of image & text tokens? Because, yeah, image generation is not interesting to em at all.

amelius1y ago

Perhaps the model performs better (has higher intelligence) if it was trained on a more diverse set of topics (?)

singularity20011y ago

How does it compare to OlympicCoder 7B [0] which allegedly beats Claude Sonnet 3.7 in the International Olympiad in Informatics [1] ?

[0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...

[1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...

tarruda1y ago

My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.

My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"

It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:

" Key improvements and explanations:

     Clearer Code Structure:  The code is now organized into a Tetris class, making it much more maintainable and readable.  This is essential for any non-trivial game.

Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.

I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".

Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:

"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"

Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.

tarruda1y ago

I ran the same prompt on google AI studio it had the same behavior of talking about improvements as if the code it wrote was not the first version.

Other than that, the experience was completely different:

- The game worked on first try

- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version

- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.

Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.

Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.

whbrown1y ago

Those sound like the sort of issues which could be caused by your server silently truncating the middle of your prompts.

By default, Ollama uses a context window size of 2048 tokens.

tarruda1y ago

I checked this, the whole conversation was about 1000 tokens.

I suspect the Ollama version might have wrong default settings, such as conversation delimiters. The experience of Gemma 3 in AI studio is completely different.

whiplash4511y ago

Why did this get downvoted? Asking genuinely

sigmoid101y ago

swores1y ago

sigmoid101y ago

antirez1y ago

pzo1y ago

[0] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

sigmoid101y ago

leumon1y ago

In my opinion qwq is the strongest model that fits on a single gpu (Rtx 3090 for example, in Q4_K_M quantization which is the standard in Ollama)

moffkalast1y ago

hnfong1y ago

All the above is subjective so maybe that’s true for you, but claiming there’s a lack of inference framework for gemma 2 is really off the mark.

Obviously ollama supports it. Also llama.cpp. Also mlx. I’ve listed 3 frameworks that support quantized versions of gemma 2

llama.cpp support for gemma-3 is out, the PR merged a couple hours after googles announcement. Obviously ollama supports it as well as you can see in TFA here.

I’m really curious how you’d get to the conclusions you’ve made. Are we living in different alternate universes?

moffkalast1y ago

> Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill

I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.

Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.

> there’s a lack of inference framework for gemma 2 is really of the mark

> llama.cpp support for gemma-3 is out

1 more reply

aravindputrevu1y ago

I'm curious. Is there any value to do these OSS models?

Suddenly after reasoning models, it looks like OSS models have lost their charm

archerx1y ago

Thee are a lot of open source reasoning models. The true value to local models is privacy and the ability to have the models be uncensored.

lelag1y ago

OSS model do not have to be local models, and it's not just about privacy, imo.

pzo1y ago

whiplash4511y ago

You can absolutely be locked out effectively if they stop releasing upgrades while the other providers move forward.

whiplash4511y ago

Uncensored at inference time does not imply uncensored at training time (not a specific comment about Gemma)

chaosprint1y ago

How does this compare with qwq 32B?

wewewedxfgdf1y ago

Discrete GPUs are finished for AI.

They've had years to provide the needed memory but can't/won't.

The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.

Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.

throwaway3141551y ago

lvl1551y ago

FWIW GPUs still do not saturate PCIe lanes.

tekichan1y ago

I found deepseek better for trivial tasks

casey21y ago

coalma3

axiosgunnar1y ago

PSA: DO NOT USE OLLAMA FOR TESTING.

Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).

Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!

Technetium1y ago

tarruda1y ago

Is "OpenAI" the only AI company that hasn't released any model weights?

archerx1y ago

They did release Whisper which to be fair has been incredibly helpful for a few of my projects.

whiplash4511y ago

A very long time ago in AI time

pzo1y ago

They updated also whisper with v3 large turbo few months ago that is significantly faster with almost same performance.

world2vec1y ago

Anthropic hasn't released anything either AFAIK

dev0p1y ago

They need to open source Sonnet 3.7.

I know they won't, but a man can dream.

ddalex1y ago

I'd wish people stop using "open sourcing" when speaking about models.

Open sourcing is about being able to change and replicate builds, they make the models "freely available" but the recipe on how they are made is kept secret.

It's akin to being able to download Windows shareware executables and calling that "open source" when nothing related to how the executables are build is available.

2 more replies

toinewx1y ago

would you be able to run Sonnet 3.7 on a consumer computer though?

1 more reply

finnjohnsen21y ago

elif1y ago

Only if you give xAI credit for releasing grok-1, which is not a very useful model.

j / k navigate · click thread line to collapse