undefined | Better HN

0 pointsdoctorpangloss15d ago0 comments

you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.

it isn't that large of a model and the compressed kv implementation is not that complicated

the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.

vllm runs dsv4 flash fine right right now

dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.

so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.

0 comments

alfiedotwtf15d ago

Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).

zozbot23415d ago

What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.

alfiedotwtf15d ago

This might be my only hope until RAM prices come down to human levels again

zozbot23415d ago

If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.

j / k navigate · click thread line to collapse

0 comments

alfiedotwtf15d ago

zozbot23415d ago

alfiedotwtf15d ago

This might be my only hope until RAM prices come down to human levels again

zozbot23415d ago

j / k navigate · click thread line to collapse