> what model can i run on 1TB
With 1TB of RAM you can run nearly anything available (405B essentially being the largest ATM). Llama 405B in FP8 precision fits in H100x8 which is 640GB VRAM. Quantization is a very deep and involved well (far too much for an HN comment).
I'm aware it "works" but I don't bother with CPU, GGUF, even llama.cpp so I can't really speak to it. They're just not even remotely usable for my applications.
> tokens per second
Sloooowwww. With 405B it could very well be seconds per token but this is where a lot of system factors come in. You can find benchmarks out there but you'll see stuff like a very high spec AMD EPYC bare metal system with very fast DDR4/5, tons of memory channels, etc doing low single-digit tokens per second with 70B.
> ill get a GPU too but not sure how much VRAM I need for the best value for your buck
Most of my experience is top-end GPU so I can't really speak to this. You may want to pop in at https://www.reddit.com/r/LocalLLaMA/ - there is much more expertise there for this range of hardware (CPU and/or more VRAM limited GPU configs).