If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.
If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.
I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.
3B Phi-3 mini is almost instantaneous in simple responses on my MBP.
When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.
But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.