undefined | Better HN

0 pointshnuser1234568mo ago0 comments

Lots of people already have RTX 3090/4090/5090 for gaming and they can run 30b-class models at 40+ tok/sec. There is a huge field of models and finetunes of this size on huggingface. They are a little bit dumber than the big cloud models but not by much. And being able to run them 24/7 for just the price of electricity (and the privacy) is a big pull.

0 comments

nomel8mo ago

> they can run 30b-class models at 40+ tok/sec.

No, they can run quantized versions of those models, which are dumber than the base 30b models, which are much dumber than > 400b models (from my use).

> They are a little bit dumber than the big cloud models but not by much.

If this were true, we wouldn't see people paying the premiums for the bigger models (like Claude).

For every use case I've thrown at them, it's not a question of "a little dumber", it's the binary fact that the smaller models are incapable of doing what I need with any sort of consistency, and hallucinate at extreme rates.

What's the actual use case for these local models?

hnuser123456OP8mo ago

With quantization-aware-training techniques, q4 models are less than 1% off from bf16 models. And yes, if your use case hinges on the very latest and largest cloud-scale models, there are things they can do the local ones just can't. But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.

nomel8mo ago

> But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

Again, what's the use case? What would make sense to run, at high rates, where output quality isn't much of a concern? I'm genuinely interested in this question, because answering it always seems to be avoided.

hnuser123456OP8mo ago

Any sort of business that might want to serve from a customized LLM at scale and doesn't need the smartest model possible, or hobbyist/researcher experiments. If you can get an agentic framework to work on a problem with a local model, it'll almost certainly work just as well on a cloud model. Again, speaking mostly people to already have a xx90 class GPU sitting around. Smoke 'em if you've got 'em. If you don't have a 3090/4090/5090 already, and don't care about privacy, then just enjoy how the improvements in local models are driving down the price per token of non-bleeding-edge cloud models.

1 more reply

datameta8mo ago

What kind of interactions do you have? Brainstorming, knowledge framework, rubber duck debug plus? Help me understand please if you will because I have a 3090 sitting without a suitable rest of it all and I wonder invest or not?

j / k navigate · click thread line to collapse

0 comments

nomel8mo ago

> they can run 30b-class models at 40+ tok/sec.

No, they can run quantized versions of those models, which are dumber than the base 30b models, which are much dumber than > 400b models (from my use).

> They are a little bit dumber than the big cloud models but not by much.

If this were true, we wouldn't see people paying the premiums for the bigger models (like Claude).

What's the actual use case for these local models?

hnuser123456OP8mo ago

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.

nomel8mo ago

> But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

hnuser123456OP8mo ago

1 more reply

datameta8mo ago

j / k navigate · click thread line to collapse