There’s so many varieties, specialized to different tasks or simply different in performance.
Maybe we’ll get to a one-size fits all at some point, but for now trying out a few can pay off. It also starts to build a better sense of the ecosystem as a whole.
For running them: if you have an Nvidia GPU w/ 8GB of vram you’re probably able to run a bunch— quantized. It gets a bit esoteric when you start getting into quantization varieties but generally speaking you should find out the sort of integer & float math your gpu has optimized support for and then choose the largest quantized model that corresponds to support and still fits in vram. Most often that’s what will perform the best in both speed and quality, unless you need to run more than 1 model at a time.
To give you a reference point on model choice, performance, gpu, etc: one of my systems runs with an nvidia 4080 w/ 16GB VRAM. Using Qwen 3 Coder 30B, heavily quantized, I can get about 60 tokens per second.