Models that are worth writing home about are;
EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.
Rocinante-12B-v2i - Fun for stories and D&D
Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks
OpenThinker-7B - Good and fast reasoning
The Deepseek destills - Able to handle more complex task while still being fast
DeepHermes-3-Llama-3-8B - A really good vLLM
Medical-Llama3-v2 - Very interesting but be careful
Plus more but not Gemma.
One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.
I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.
The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.
Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.
[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)
BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.
mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.
I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?
> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning
Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?
I would be actually happy to see R1 distilled version, it may make it perform better with the less resource usage.
The first prompt I tested out I got from this video; https://www.youtube.com/watch?v=0Cq-LuJnaRg
It was ok and produces shallow adventures.
The second one I tried was from this site; https://www.rpgprompts.com/post/dungeons-dragons-chatgpt-pro...
a bit better and is easier to modify but still shallow.
The best one I have tried so far is this one from reddit; https://old.reddit.com/r/ChatGPT/comments/zoiqro/most_improv...
It is a super long prompt and I had to edit it a lot, and manually extract the data from some of the links but it has been the best experience by far. I even became "friends" with an NPC who accompanied me on a quest and it was a lot of fun and I was fully engaged.
The model of choice matters but even llama 1B and 2B can handle some stories.
afaik they are more for roleplaying a D&D style adventure than planning it, but I've heard good things.
Also lobotomized LLMs ("abliterated") can be a lot of fun.
The real issue with local models is managing context. smaller models let you have a longer context without losing performance but bigger models are smarter but if you want to keep it fast I have to reduce the context length.
Also all of the models have their own "personalities" and they still manifest in the finetunes.
Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?
IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.
- EXAONE-3.5-2.4B-Instruct - Llama-3.2-3B-Instruct-uncensored - qwq-lcot-3b-instruct - qwen2.5-3b-instruct
These have been very interesting tiny models, they can do text processing task and can handle story telling. The Llama-3.2 is way to sensitive to random stuff so get the uncensored or abliterated versions
The recommended settings according to the Gemma team are:
temperature = 0.95
top_p = 0.95
top_k = 64
Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M
EDIT: 27b size
I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.
Try maybe the 8bit quant if you have the hardware for it? ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0
>>> who is president
The বর্তমানpresident of the United States is Джо Байден (JoeBiden).
The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.
[0] https://github.com/ollama/ollama/issues/9680
[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...
Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.
From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.
I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.
[0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...
[1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...
My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"
It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:
" Key improvements and explanations:
Clearer Code Structure: The code is now organized into a Tetris class, making it much more maintainable and readable. This is essential for any non-trivial game.
"Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.
I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".
Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:
"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"
Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.
Other than that, the experience was completely different:
- The game worked on first try
- I iterated with the model making enhancements. The first version worked but didn't show scores, levels or next piece, so I asked it to implement those features. It then produced a new version which almost worked: The only problem was that levels were increasing whenever a piece fell, and I didn't notice any increase in falling speed.
- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version
- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.
- As a final request, I asked it to port the latest working version of the game to JS/HTML with the implementation self contained in a file. It produced a broken implementation, but I was able to fix it after tweaking it a little bit.
Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.
Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.
By default, Ollama uses a context window size of 2048 tokens.
I suspect the Ollama version might have wrong default settings, such as conversation delimiters. The experience of Gemma 3 in AI studio is completely different.
[0] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption.
All the above is subjective so maybe that’s true for you, but claiming there’s a lack of inference framework for gemma 2 is really off the mark.
Obviously ollama supports it. Also llama.cpp. Also mlx. I’ve listed 3 frameworks that support quantized versions of gemma 2
llama.cpp support for gemma-3 is out, the PR merged a couple hours after googles announcement. Obviously ollama supports it as well as you can see in TFA here.
I’m really curious how you’d get to the conclusions you’ve made. Are we living in different alternate universes?
Suddenly after reasoning models, it looks like OSS models have lost their charm
DeepSeek R1 hosting is out of reach for most, but it being open is a game changer if you are a building a business that needs the SoTA capabilities of such a large model, not because you will necessarily host it yourself, but because you can't be locked out of using it.
If you build your business on top of OpenAI, and they decide they don't like you, they can shut you down. If you use an open model like R1, you always have the option to self host even if it can be costly, and not be at the mercy of a third party being able to just kill your business by shutting down your access to their service.
They've had years to provide the needed memory but can't/won't.
The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.
Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.
Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).
The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).
Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!
I know they won't, but a man can dream.