For any model where you notice looping, tune the LLM settings. Reduce temperature and top_p, increase presence/frequency penalty, reduce context size. If you have a specific task to do, fine-tuning is the absolute best way to both reduce memory usage and boost performance and quality. Remember that tiny models are not designed for 0-shot/1-shot, they need lots of specific instruction and context in the prompt, with multi-shot prompts having a dramatic effect on output quality. Try to keep your prompt to specific tasks. Think of small models as children, SOTA models as experienced professionals, and middle-of-the-road models as an average adult; you give the bigger ones more responsibility/agency, but more rules and guardrails to the little ones.
For coding you do want the biggest model you can fit, so this is where larger RAM shines (32GB+ iGPU). If you can fit a dense model, do that. MoE is ok but will perform better on narrower tasks. Use the bleeding edge forks of llamacpp for turboquant/etc and Multi-Token Prediction.
The last thing is quants. If you're running something that isn't the bare model (like an unsloth dynamic quant), model performance is gonna suffer the smaller you go, and smaller models will be much more affected. So try to max out the amount of memory you can dedicate to the model, and pick larger quants like Q6/Q8. You can quant the k/v cache but that also may have a negative effect. And again, if you can fine-tune for a task, you will gain much more performance and quality and reduce memory.