The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).
With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)
The primary bottleneck for now is compute.
They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)
When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).