> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
> such great performance that I've mostly given up on GPU for LLMs
I mean I used to run ollama on GPU, but llamafile was approximately the same performance on just CPU so I switched. Now that might just be because my GPU is weak by current standards, but that is in fact the comparison I was making.
Edit: Though to be clear, ollama would easily be my second pick; it also has minimal dependencies and is super easy to run locally.
Looks like there’s a typo, Windows is mentioned twice.
First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!
148 tokens predicted, 159 ms per token, 6.27 tokens per secondIt's impressive to realize how little code is needed to run these models at all.
Seems like torchchat is exactly what the author was looking for.
> And the 8B model typically gets killed by the OS for using too much memory.
Torchchat also provides some quantization options so you can reduce the model size to fit into memory.
This just imports the Llama reference implementation and patches the device FYI.
There are more robust implementations out there.