undefined | Better HN

0 pointsbuu7001y ago0 comments

I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.

The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.

0 comments

Klaus231y ago

Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.

I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.

buu700OP1y ago

Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.

Klaus231y ago

GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.

Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.

j / k navigate · click thread line to collapse

0 pointsbuu7001y ago0 comments

0 comments

Klaus231y ago

Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.

buu700OP1y ago

Klaus231y ago

Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.

j / k navigate · click thread line to collapse