Run Llama locally with only PyTorch on CPU (opens in new tab)

(github.com)

168 pointsanordin951y ago34 comments

34 comments

yjftsjthsd-h1y ago

If your goal is

> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

Then this is great.

If your goal is

> Run and explore Llama models locally with minimal dependencies on CPU

then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

hedgehog1y ago

Ollama (also wrapping llama.cpp) has GPU support, unless you're really in love with the idea of bundling weights into the inference executable probably a better choice for most people.

jart1y ago

Ollama is great if you're really in love with the idea of having your multi gigabyte models (likely the majority of your disk space) stored in obfuscated UUID filenames. Ollama also still hasn't addressed the license violations I reported to them back in March. https://github.com/ollama/ollama/issues/3185

hedgehog1y ago

I wasn't aware of the license issue, wow. Not a good look especially considering how simple that is to resolve.

The model storage doesn't bother me but I also use Docker so I'm used to having a lot of tool-managed data to deal with. YMMV.

Edit: Removed question about GPU support.

codetrotter1y ago

I think this is also a problem in a lot of tools, that is never talked about.

Even myself I’ve not thought about this so deeply, even though I am also very concerned about honoring other people’s work and that licenses are followed.

I have some command line tools for example that I’ve written in Rust that depend on various libraries. But because I distribute my software in source form mostly, I haven’t really paid attention to how a command-line tool which is distributed as a compiled binary would make sure to include attribution and copies of the licenses of its dependencies.

And so the main place where I’ve given more thought to those concerns is for example in full-blown GUI apps. There they usually have an about menu that will include info about their dependencies. And the other part where I’ve thought about it is in commercial electronics making use of open source software in their firmware. In those physical products they usually include either some printed documents alongside the product where attributions and license texts are sometimes found, and sometimes if the product has a display, or a display output, they have a menu you can find somewhere with that sort of info.

I know that for example Debian is very good at being thorough with details about licenses, but I’ve never looked at what they do with command line tools that compile third-party code into them. Like does Debian package maintainers then for example dig up copies of the licenses from the source and dependencies and put them somewhere in /usr/share/ as plain text files? Or do the .deb files themselves contain license text copies you can view but which are not installed onto the system? Or they work with software authors to add a flag that will show licenses? Or something else?

1 more reply

gertop1y ago

Llamafile is great if you don't want to run any meaningful models because it's limited to 4GB.

1 more reply

yjftsjthsd-h1y ago

When I said

> such great performance that I've mostly given up on GPU for LLMs

I mean I used to run ollama on GPU, but llamafile was approximately the same performance on just CPU so I switched. Now that might just be because my GPU is weak by current standards, but that is in fact the comparison I was making.

Edit: Though to be clear, ollama would easily be my second pick; it also has minimal dependencies and is super easy to run locally.

jart1y ago

A great place to start is with the LLaMA 3.2 q6 llamafile I posted a few days ago. https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi... We have a new CLI chatbot interface that's really fun to use. Syntax highlighting and all. You can also use GPU by passing the -ngl 999 flag.

cromka1y ago

„On Windows, only the graphics card driver needs to be installed if you own an NVIDIA GPU. On Windows, if you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass the flags --recompile --gpu amd the first time you run your llamafile.”

Looks like there’s a typo, Windows is mentioned twice.

seu1y ago

> then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!

rmbyrro1y ago

Do you have a ballpark idea of how much RAM would be necessary to run llama 3.1 8b and 70b on 8-quant?

karolist1y ago

Roughly, at Q8 the model sizes translate to GB, so ~3 and ~70GB.

ekianjo1y ago

You mean 8, not 3?

1 more reply

anordin95OP1y ago

Thanks for the suggestion. I've added a link to llamafile in the repo's README. Though, my focus was on exploring the model itself.

yumraj1y ago

Can it use GPU if available, say on Apple silicon Macs

unkeen1y ago

> GPU on MacOS ARM64 is supported by compiling a small module using the Xcode Command Line Tools, which need to be installed. This is a one time cost that happens the first time you run your llamafile.

xyc1y ago

I wonder if it's possible for llamafile to distribute without the need for Xcode Command Line Tools, but perhaps it's necessary for the single cross-platform binary.

Loved llamafile and used it to build the first version of https://recurse.chat/, but live compilation using XCode Command Line Tool is a no-go for Mac App Store builds (runs in Mac App Sandbox). llama.cpp doesn't need compiling on user's machine fwiw.

bagels1y ago

How great is the performance? Tokens/s?

yjftsjthsd-h1y ago

Random sample query ("What shape should a kumquat be?") against a 7B model quantised to 4b running on an i7-9750H (so a good CPU, but also a good laptop CPU from 2019) gives:

  148 tokens predicted, 159 ms per token, 6.27 tokens per second

bagels1y ago

Thanks, that helps.

AlfredBarnes1y ago

Thanks for posting this!

yjftsjthsd-h1y ago

Very happy to have helped:)

littlestymaar1y ago

With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/

It's impressive to realize how little code is needed to run these models at all.

Ship_Star_10101y ago

PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1

ajaksalad1y ago

> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)

Seems like torchchat is exactly what the author was looking for.

> And the 8B model typically gets killed by the OS for using too much memory.

Torchchat also provides some quantization options so you can reduce the model size to fit into memory.

I_am_tiberius1y ago

Does anyone know what's the easiest way to finetune a model locally is today?

dartos1y ago

https://github.com/axolotl-ai-cloud/axolotl

tcdent1y ago

> from llama_models.llama3.reference_impl.model import Transformer

This just imports the Llama reference implementation and patches the device FYI.

There are more robust implementations out there.

anordin95OP1y ago

Peel back the layers of the onion and other gluey-mess to gain insight into these models.

klaussilveira1y ago

Fast enough for RPI5 ARM?

j / k navigate · click thread line to collapse

34 comments

yjftsjthsd-h1y ago

If your goal is

> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

Then this is great.

If your goal is

> Run and explore Llama models locally with minimal dependencies on CPU

hedgehog1y ago

Ollama (also wrapping llama.cpp) has GPU support, unless you're really in love with the idea of bundling weights into the inference executable probably a better choice for most people.

jart1y ago

hedgehog1y ago

I wasn't aware of the license issue, wow. Not a good look especially considering how simple that is to resolve.

The model storage doesn't bother me but I also use Docker so I'm used to having a lot of tool-managed data to deal with. YMMV.

Edit: Removed question about GPU support.

codetrotter1y ago

I think this is also a problem in a lot of tools, that is never talked about.

Even myself I’ve not thought about this so deeply, even though I am also very concerned about honoring other people’s work and that licenses are followed.

1 more reply

gertop1y ago

Llamafile is great if you don't want to run any meaningful models because it's limited to 4GB.

1 more reply

yjftsjthsd-h1y ago

When I said

> such great performance that I've mostly given up on GPU for LLMs

Edit: Though to be clear, ollama would easily be my second pick; it also has minimal dependencies and is super easy to run locally.

jart1y ago

cromka1y ago

Looks like there’s a typo, Windows is mentioned twice.

seu1y ago

First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!

rmbyrro1y ago

Do you have a ballpark idea of how much RAM would be necessary to run llama 3.1 8b and 70b on 8-quant?

karolist1y ago

Roughly, at Q8 the model sizes translate to GB, so ~3 and ~70GB.

ekianjo1y ago

You mean 8, not 3?

1 more reply

anordin95OP1y ago

Thanks for the suggestion. I've added a link to llamafile in the repo's README. Though, my focus was on exploring the model itself.

yumraj1y ago

Can it use GPU if available, say on Apple silicon Macs

unkeen1y ago

xyc1y ago

I wonder if it's possible for llamafile to distribute without the need for Xcode Command Line Tools, but perhaps it's necessary for the single cross-platform binary.

bagels1y ago

How great is the performance? Tokens/s?

yjftsjthsd-h1y ago

Random sample query ("What shape should a kumquat be?") against a 7B model quantised to 4b running on an i7-9750H (so a good CPU, but also a good laptop CPU from 2019) gives:

  148 tokens predicted, 159 ms per token, 6.27 tokens per second

bagels1y ago

Thanks, that helps.

AlfredBarnes1y ago

Thanks for posting this!

yjftsjthsd-h1y ago

Very happy to have helped:)

littlestymaar1y ago

With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/

It's impressive to realize how little code is needed to run these models at all.

Ship_Star_10101y ago

ajaksalad1y ago

> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)

Seems like torchchat is exactly what the author was looking for.

> And the 8B model typically gets killed by the OS for using too much memory.

Torchchat also provides some quantization options so you can reduce the model size to fit into memory.

I_am_tiberius1y ago

Does anyone know what's the easiest way to finetune a model locally is today?

dartos1y ago

https://github.com/axolotl-ai-cloud/axolotl

tcdent1y ago

> from llama_models.llama3.reference_impl.model import Transformer

This just imports the Llama reference implementation and patches the device FYI.

There are more robust implementations out there.

anordin95OP1y ago

Peel back the layers of the onion and other gluey-mess to gain insight into these models.

klaussilveira1y ago

Fast enough for RPI5 ARM?

j / k navigate · click thread line to collapse