Fork of Facebook’s LLaMa model to run on CPU (opens in new tab)

(github.com)

246 points__anon-2023__3y ago170 comments

170 comments

The thing I like the most about the current AI wave is the pressure is putting on computing hardware. Yes, mobile phones with long battery lives are cool and all of that, but most cool things I like are locked behind huge computational requirements.

TaylorAlexander3y ago

Agree. I work in robotics and we never have enough compute. I want to see us get to the point where the most advanced robot ever has all the compute it needs onboard, and that means huge growth in compute density and efficiency are needed.

ben_w3y ago

That's genuinely surprising.

What sort of on-board compute do you typically have today?

3 more replies

aqme283y ago

Crazy to me that as soon as one GPU wave is dying (crypto), another one is picking up slack.

atleastoptimal3y ago

Which is a good thing. So glad all that GPU compute is being used on cool stuff rather than running SHA-256 18 quintillion times

3 more replies

saurik3y ago

One day we'll find out that all of the VR, crypto, and maybe now AI bubbles were nothing but conspiracies being driven by big-GPU to keep their share price up.

4 more replies

messe3y ago

Charlie Stross (cstross on here) had a fun blog post[1] about this phenomenon just a week and a half ago.

> As for what you should look to invest in?

> I'm sure it's just a coincidence that training neural networks and mining cryptocurrencies are both applications that benefit from very large arrays of GPUs. [...]

> If I was a VC I'd be hiring complexity theory nerds to figure out what areas of research are promising once you have Yottaflops of numerical processing power available, then I'd be placing bets on the GPU manufacturers going there

[1]: https://www.antipope.org/charlie/blog-static/2023/02/place-y...

favaq3y ago

Gee, it's almost as if GPUs are useful.

1 more reply

nirav723y ago

For most part of the 20th, a bulk of the energy humanity was able to extract was used for industrialization. Now it seems that a vast bulk of the energy being extracted will go towards computation.

2 more replies

college_physics3y ago

GPU conspiracy or just the side-effect of the decline of Intel?

1 more reply

layer83y ago

Just imagine if Bitcoin, GPT and Half-Life had come out at the same time.

1 more reply

pjmlp3y ago

What we will get are specialized hardware, with not so open APIs anyway.

With a bunch of people trailling behind with "it kind of works" open alternatives.

dsign3y ago

It sounds like you are complaining about capitalism :-)

It's not so bad. Nvidia could come and say, "hey, I'm going to lock down your GPU so that you can only use it to render polygons in my whitelisted list of video-games, and then you pay us $$$$$$ to buy our 'datacenter' thingy for anything else." But if they do it, people will go and buy the competitor's product.

And yes, probably their 4090 are being bought by some rich kids with their parents' money, but I reckon most of it are sales to professionals, people who would justify their purchase decision with more than playing First-person-shooters. I for example play videogames with my gf, and we have equivalent GPUs. Hers is AMD and costs less than mine, even if it does the same, but I went for Nvidia so that PhysX were available and I could use Pytorch and Numba+GPU and even C++ CUDA. The moment Nvidia locks that down, I'll have to switch to AMD.

2 more replies

HeartStrings3y ago

John Hopkins are working on organoids that will replace silicon GPUs for AI.

TecoAndJix3y ago

Here is an article from JHU on that topic - https://hub.jhu.edu/2023/02/28/organoid-intelligence-biocomp...

colordrops3y ago

Reminds me of this Choose Your Own Adventure book from 1984. It was about how PCs had organic AI components and each was unique, and you happened to get your hands on a super intelligent one.

https://www.goodreads.com/en/book/show/755062

ben_w3y ago

As is one of the YouTubers I follow.

Meatcubator: https://youtu.be/Z_ZGq8Tah0k

Growing human brain cells: https://youtu.be/V2YDApNRK3g

pelagicAustral3y ago

If they can get pass these new ethical committees...

1 more reply

mrtksn3y ago

Unlike Stable Diffusion, I don't stumble upon people who actually use it. Are there examples of the output this can generate? What happens once you manage to run the model?

thot_experiment3y ago

I've been playing around with LLMs recently and it's definitely interesting stuff. I've mostly focused on roleplay/MUD applications and it's not quuitteee there but it's pretty good, and it's idiosyncrasies are often hilarious.

(when fed the leaked bing prompt, my AI decided it was Australian and started tossing in random shit like "but here in Australia, we'd call it limey green" when asked about chartreuse, i assume because the codename for bing chat is 'sydney')

sdrinf3y ago

This is very new, give it a few days. Here's one from Shawn: https://twitter.com/theshawwn/status/1632595934839177216

knaik943y ago

I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW and/or NSFW story competition.

https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.

https://koboldai.net/ is a way to run some of these models in the "cloud". There's no account required and the prompts are run on other people's hardware, with priority weighting based on how much compute you have used or donated. There's an anonymous api key and there's no expectation that the output can't be logged.

The models that run on hardware locally are very basic in the quality of output. Here's an example of a 6B output used to try to emulate chatgpt. https://mobile.twitter.com/Knaikk/status/1629711223863345154 The model was finetuned on story completion so it's not meaningfully comparable.

It's less popular because the hardware required for the great output is still above the top of line consumer specs. 24 gb vram is closer to a bare minimum to get meaningful output, and fine-tuning is still out of reach. There's some development with using services like runpod.

redox993y ago

On /g/ there's always a very active AI chatbot general that focuses on these models.

samvher3y ago

What's /g/?

1 more reply

yieldcrv3y ago

We just need some better GUIs

Stable Diffusion was in the same place as this in the same time frame of the model getting released. Its only been a few days.

KyeRussell3y ago

Pretty sure you wouldn’t see anyone using it commercially as IIRC it’s only public due to a leak.

input_sh3y ago

It's not a leak, it's a shortcut.

You can download it from Facebook, but it's behind "apply for access" form. Magnet links floating around are just a workaround around that form.

That said, commercial use is forbidden by the license specified in the form: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...

mrtksn3y ago

I wasn't looking for a commercial use but its an Interesting point. Would it be possible to prove that someone is using it commercially?

1) Spin it up on a cluster in Belarus

2) ???

3) Profit?

2 more replies

RugnirViking3y ago

ive used LLMS a lot for filling out details in my dnd worlds. Both openai products but also the open source GPT-J from euluther Things like writing the text of some books for players to read, of I have to curate just like people do with stable diffusion. Also used it to write songs, its surprisingly good at taking things like chord progressions written in notation and rolling with variations on them

visarga3y ago

It's useless before the model gets instruction and preference tunings. Won't even follow a simple ask, it will just assume it is a list of questions and generate more, or continue with slightly related comments.

FB trained a LLaMA-I (instruction tuned) variant for sports, just to show they can, but I don't think it got released.

cfcf143y ago

You have to prompt it correctly, non-instruction-aligned models don't behave like agent simulators by default.

qingdao993y ago

Surely it would work with a format like:

User: <question or task>

Assistant:

ShamelessC3y ago

Useless!? C'mon.

RugnirViking3y ago

it's not that useless, you just have to prompt it the right way (usually by offering an example of the kind of output you want)

notpushkin3y ago

So, you need to know how to tune it.

ma2rten3y ago

It's still useful, but you need to know how to use it.

skhm3y ago

Try giving it a simple request instead ;)

popol123y ago

0.35 words/s on my 11th gen i5 with 7B model (framework laptop)

not so bad !

fguerraz3y ago

How long did you have to wait for it to load? On my machine it's been running for 15mins, I'm still waiting for a prompt...

popol123y ago

You get the full answer after completion, so it’s normal if you don’t get an output immediately

I computed the speed by doing speed=number of words/ total run time

kristianp3y ago

How much RAM do you both have?

1 more reply

haolez3y ago

Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well?

It's not that hard to create a consumer-grade desktop with 256GB in 2023.

gpm3y ago

I don't know about this fork specifically, but in general yes absolutely.

Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

A lack of an absurd number of CPUs just means it's slow, not impossible.

https://github.com/gmorenz/llama/tree/ssd

haolez3y ago

Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.

downvotetruth3y ago

You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.

Tepix3y ago

All current Ryzen CPUs do not work with 48GB DDR5, right? That means if you want to go beyond 128GB you can get an old X399 board (there are some reports of people getting 256GB to work) or more recent Threadripper boards.

1 more reply

downvotetruth3y ago

Follow up: https://github.com/facebookresearch/llama/issues/79#issuecom... claims 65B was able to fit in 128 GB by unsharding & merging weights into a single file instead of the multiple pth with 172Gb max swap file usage & appears to stream to GPU.

haolez3y ago

Why? Is it a limitation of the model or just something with the configuration that you couldn't figure out for this test?

1 more reply

basch3y ago

I wonder if we will start to see complex prune functions and tools start to pop up.

So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.

benenglish3y ago

Wondering how difficult this would be to get running on a m1 max?

ComplexSystems3y ago

I got one token every 8 minutes or so.

popol123y ago

Using which model ? On a pretty mid range i5 11th gen I'm getting 0.35 token/s, using the 7B model. Haven't tried the bigger models.

2Gkashmiri3y ago

Is that good? Not good?

2 more replies

swyx3y ago

another commenter posted a fork that does it https://news.ycombinator.com/item?id=35067469

per the readme it looks like there a few bugs to figure out in case anyone here is a pytorch expert

Havoc3y ago

Would it not be possible to run on both gpu and cpu at same time in whatever proportion the hardware is available ?

Most gaming desktops have a solid gpu but not enough vram. Pity having the gpu idle here

ur-whale3y ago

> 1. Create a conda environment

Uh-oh, bad start.

_flux3y ago

Why is it a bad start?

It could be venv as well, I suppose, I haven't used conda.

disgruntledphd23y ago

Conda is gonna work much, much, much better for these kinds of applications, as that's what it's mostly used for, i.e. scientific/numerical computing with C/C++ dependencies.

1 more reply

Smith423y ago

Since this is pytorch it should run on cpu anyway. What am I missing?

progman323y ago

Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.

markasoftware3y ago

you don't actually need to switch to gloo, I just have no idea what I'm doing.

1 more reply

rajman1873y ago

Gloo is a communication protocol for distributed computation (think along the lines of MPI)

Zetobal3y ago

I guess the simple fact that it didn't before his patch?

cinntaile3y ago

Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.

2 more replies

tmalsburg23y ago

If someone else wrote this comment, would you find it useful?

LangIsAllWeNeed3y ago

Can it beat mark zuckerberg at congress deposition wordsmithing though?

toxicFork3y ago

Someone must have trained an LLM for that for sure.

LangIsAllWeNeed3y ago

Oh yes “”” Hackernews senator: “”Someone on the internet said meta aka Facebook is not considered a real data native, clean coder and high IQ company unless your new language model exceeds the elegance and slipperiness of mark Zuckerbergs (you) language output in senate hearings. he is smoother than a lake in the metaverse.“”

Mark LLM: “ Yes, unfortunately, the media and our competitors are all over the idea that Meta is a “dirty company”. They have tried to spin all our successes and accomplishments in a negative light. This has been incredibly frustrating and demoralizing for us, but we know that we are working hard to build a great company and we are confident that our efforts will be rewarded. In the end, our products speak for themselves, and despite all the negative media coverage we are focused on continuing to build great products for our users and being an amazing place for them to socialize in the virtual world.”

1 more reply

fsiefken3y ago

Would running on a cpu be more or less power efficient then running on a gpu with the same words per second rate?

frognumber3y ago

Less.

bilsbie3y ago

What’s the rough idea of how this is possible? I thought you need the parrelism of a gpu

raihansaputra3y ago

inference has less pressure of parallelism compared to training

crazysim3y ago

Could this fit into GitHub Codespaces's top VM?

DefineOutside3y ago

The 65 billion model is 160 GB so no - unless you request larger storage spaces from github. 7 billion and 13 billion should fit though.

meghan_rain3y ago

how long for one token to infer on an average cpu?

markasoftware3y ago

I tested on a decidedly above average CPU, and got several words per second on the 7B model. I'd guess maybe one word per second on a more average one?

raverbashing3y ago

Cool so we're back to the days of 2400 baud modems

2 more replies

kristianp3y ago

From the readme: On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!

2Gkashmiri3y ago

i have a friend who owns an macbook pro m1 max. what kind of performance can i get?

singularity20013y ago

dedicated fork: https://github.com/remixer-dec/llama-mps

kristianp3y ago

Mps = Metal Performance Shaders, for those out of the loop.

j / k navigate · click thread line to collapse

170 comments

dsign3y ago

TaylorAlexander3y ago

ben_w3y ago

That's genuinely surprising.

What sort of on-board compute do you typically have today?

3 more replies

aqme283y ago

Crazy to me that as soon as one GPU wave is dying (crypto), another one is picking up slack.

atleastoptimal3y ago

Which is a good thing. So glad all that GPU compute is being used on cool stuff rather than running SHA-256 18 quintillion times

3 more replies

saurik3y ago

One day we'll find out that all of the VR, crypto, and maybe now AI bubbles were nothing but conspiracies being driven by big-GPU to keep their share price up.

4 more replies

messe3y ago

Charlie Stross (cstross on here) had a fun blog post[1] about this phenomenon just a week and a half ago.

> As for what you should look to invest in?

> I'm sure it's just a coincidence that training neural networks and mining cryptocurrencies are both applications that benefit from very large arrays of GPUs. [...]

[1]: https://www.antipope.org/charlie/blog-static/2023/02/place-y...

favaq3y ago

Gee, it's almost as if GPUs are useful.

1 more reply

nirav723y ago

For most part of the 20th, a bulk of the energy humanity was able to extract was used for industrialization. Now it seems that a vast bulk of the energy being extracted will go towards computation.

2 more replies

college_physics3y ago

GPU conspiracy or just the side-effect of the decline of Intel?

1 more reply

layer83y ago

Just imagine if Bitcoin, GPT and Half-Life had come out at the same time.

1 more reply

pjmlp3y ago

What we will get are specialized hardware, with not so open APIs anyway.

With a bunch of people trailling behind with "it kind of works" open alternatives.

dsign3y ago

It sounds like you are complaining about capitalism :-)

2 more replies

HeartStrings3y ago

John Hopkins are working on organoids that will replace silicon GPUs for AI.

TecoAndJix3y ago

Here is an article from JHU on that topic - https://hub.jhu.edu/2023/02/28/organoid-intelligence-biocomp...

colordrops3y ago

Reminds me of this Choose Your Own Adventure book from 1984. It was about how PCs had organic AI components and each was unique, and you happened to get your hands on a super intelligent one.

https://www.goodreads.com/en/book/show/755062

ben_w3y ago

As is one of the YouTubers I follow.

Meatcubator: https://youtu.be/Z_ZGq8Tah0k

Growing human brain cells: https://youtu.be/V2YDApNRK3g

pelagicAustral3y ago

If they can get pass these new ethical committees...

1 more reply

mrtksn3y ago

Unlike Stable Diffusion, I don't stumble upon people who actually use it. Are there examples of the output this can generate? What happens once you manage to run the model?

thot_experiment3y ago

sdrinf3y ago

This is very new, give it a few days. Here's one from Shawn: https://twitter.com/theshawwn/status/1632595934839177216

knaik943y ago

I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW and/or NSFW story competition.

https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.

redox993y ago

On /g/ there's always a very active AI chatbot general that focuses on these models.

samvher3y ago

What's /g/?

1 more reply

yieldcrv3y ago

We just need some better GUIs

Stable Diffusion was in the same place as this in the same time frame of the model getting released. Its only been a few days.

KyeRussell3y ago

Pretty sure you wouldn’t see anyone using it commercially as IIRC it’s only public due to a leak.

input_sh3y ago

It's not a leak, it's a shortcut.

You can download it from Facebook, but it's behind "apply for access" form. Magnet links floating around are just a workaround around that form.

That said, commercial use is forbidden by the license specified in the form: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...

mrtksn3y ago

I wasn't looking for a commercial use but its an Interesting point. Would it be possible to prove that someone is using it commercially?

1) Spin it up on a cluster in Belarus

2) ???

3) Profit?

2 more replies

RugnirViking3y ago

visarga3y ago

FB trained a LLaMA-I (instruction tuned) variant for sports, just to show they can, but I don't think it got released.

cfcf143y ago

You have to prompt it correctly, non-instruction-aligned models don't behave like agent simulators by default.

qingdao993y ago

Surely it would work with a format like:

User: <question or task>

Assistant:

ShamelessC3y ago

Useless!? C'mon.

RugnirViking3y ago

it's not that useless, you just have to prompt it the right way (usually by offering an example of the kind of output you want)

notpushkin3y ago

So, you need to know how to tune it.

ma2rten3y ago

It's still useful, but you need to know how to use it.

skhm3y ago

Try giving it a simple request instead ;)

popol123y ago

0.35 words/s on my 11th gen i5 with 7B model (framework laptop)

not so bad !

fguerraz3y ago

How long did you have to wait for it to load? On my machine it's been running for 15mins, I'm still waiting for a prompt...

popol123y ago

You get the full answer after completion, so it’s normal if you don’t get an output immediately

I computed the speed by doing speed=number of words/ total run time

kristianp3y ago

How much RAM do you both have?

1 more reply

haolez3y ago

Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well?

It's not that hard to create a consumer-grade desktop with 256GB in 2023.

gpm3y ago

I don't know about this fork specifically, but in general yes absolutely.

Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

A lack of an absurd number of CPUs just means it's slow, not impossible.

https://github.com/gmorenz/llama/tree/ssd

haolez3y ago

Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

downvotetruth3y ago

Tepix3y ago

1 more reply

downvotetruth3y ago

haolez3y ago

Why? Is it a limitation of the model or just something with the configuration that you couldn't figure out for this test?

1 more reply

basch3y ago

I wonder if we will start to see complex prune functions and tools start to pop up.

benenglish3y ago

Wondering how difficult this would be to get running on a m1 max?

ComplexSystems3y ago

I got one token every 8 minutes or so.

popol123y ago

Using which model ? On a pretty mid range i5 11th gen I'm getting 0.35 token/s, using the 7B model. Haven't tried the bigger models.

2Gkashmiri3y ago

Is that good? Not good?

2 more replies

swyx3y ago

another commenter posted a fork that does it https://news.ycombinator.com/item?id=35067469

per the readme it looks like there a few bugs to figure out in case anyone here is a pytorch expert

Havoc3y ago

Would it not be possible to run on both gpu and cpu at same time in whatever proportion the hardware is available ?

Most gaming desktops have a solid gpu but not enough vram. Pity having the gpu idle here

ur-whale3y ago

> 1. Create a conda environment

Uh-oh, bad start.

_flux3y ago

Why is it a bad start?

It could be venv as well, I suppose, I haven't used conda.

disgruntledphd23y ago

Conda is gonna work much, much, much better for these kinds of applications, as that's what it's mostly used for, i.e. scientific/numerical computing with C/C++ dependencies.

1 more reply

Smith423y ago

Since this is pytorch it should run on cpu anyway. What am I missing?

progman323y ago

Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.

markasoftware3y ago

you don't actually need to switch to gloo, I just have no idea what I'm doing.

1 more reply

rajman1873y ago

Gloo is a communication protocol for distributed computation (think along the lines of MPI)

Zetobal3y ago

I guess the simple fact that it didn't before his patch?

cinntaile3y ago

Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.

2 more replies

tmalsburg23y ago

If someone else wrote this comment, would you find it useful?

LangIsAllWeNeed3y ago

Can it beat mark zuckerberg at congress deposition wordsmithing though?

toxicFork3y ago

Someone must have trained an LLM for that for sure.

LangIsAllWeNeed3y ago

1 more reply

fsiefken3y ago

Would running on a cpu be more or less power efficient then running on a gpu with the same words per second rate?

frognumber3y ago

Less.

bilsbie3y ago

What’s the rough idea of how this is possible? I thought you need the parrelism of a gpu

raihansaputra3y ago

inference has less pressure of parallelism compared to training

crazysim3y ago

Could this fit into GitHub Codespaces's top VM?

DefineOutside3y ago

The 65 billion model is 160 GB so no - unless you request larger storage spaces from github. 7 billion and 13 billion should fit though.

meghan_rain3y ago

how long for one token to infer on an average cpu?

markasoftware3y ago

I tested on a decidedly above average CPU, and got several words per second on the 7B model. I'd guess maybe one word per second on a more average one?

raverbashing3y ago

Cool so we're back to the days of 2400 baud modems

2 more replies

kristianp3y ago

From the readme: On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!

2Gkashmiri3y ago

i have a friend who owns an macbook pro m1 max. what kind of performance can i get?

singularity20013y ago

dedicated fork: https://github.com/remixer-dec/llama-mps

kristianp3y ago

Mps = Metal Performance Shaders, for those out of the loop.

j / k navigate · click thread line to collapse