(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)
The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...