The issue that I see is that Nvidia etc. are incentivised to perpetuate that so the open source community gets the table scraps of distills, fine-tunes etc.
The tradeoff with these unified LPDDR machines is compute and memory throughput. You'll have to live with the ~50 token/sec rate, and compact your prefix aggressively. That said, I'd take the effortless local model capability over outright speed any day.
Hope the popularity of these machines could prompt future models to offer perfect size fits: 80 GiB quantized on 128 GiB box, 480 GiB quantized on 512 GiB box, etc.
It's probably a trade secret, but what's the actual per-user resource requirement to run the model?
If the open weights models are good, there are people looking to sell commodity access to it, much like a cloud provider selling you compute.
Plus, most users don't want to host their own models. Most users don't care that OpenAI, Anthropic and Google have a monopoly on LLMs. ChatGPT is a household name, and most of the big businesses are forcing Copilot and/or Claude onto their employees for "real work."
This is "everyone will have an email server/web server/Diaspora node/lemmy instance/Mastodon server" all over again.