A brief history of LLaMA models (opens in new tab)

(agi-sphere.com)

245 pointsandrewon3y ago84 comments

84 comments

It keeps saying the phrase “model you can run locally”, but despite days of trying, I failed to compile any of the GitHub repos associated with these models.

None of the Python dependencies are strongly versioned, and “something” happened to the CUDA compatibility of one of them about a month ago. The original developers “got lucky” but now nobody else can compile this stuff.

After years of using only C# and Rust, both of which have sane package managers with semantic versioning, lock files, reproducible builds, and even SHA checksums the Python package ecosystem looks ridiculously immature and even childish.

Seriously, can anyone here build a docker image for running these models on CUDA? I think right now it’s borderline impossible, but I’d be happy to be corrected…

valine3y ago

I’ve got 4 different llama models running locally with CUDA and can freely switch between them, including LLaVA which is a multimodal LLaMA variant.

None of them are particularly difficult to get running, the trick is to search the project’s github issue tracker. 99% of the time your problem will be in there with steps to fix it.

jiggawatts3y ago

> the trick is to search the project’s github issue tracker.

What ever happened to the crazy notion of Dockerfiles that simply build successfully?

Isn’t half the point of containerisation that it papers over the madness of the Python module ecosystem?

avereveard3y ago

The problem with that is gpu access from the docker image and iirc Nvidia doesn't have a Windows passtrough cuda driver

UncleEntity3y ago

They require someone to take time out of their busy schedule to build?

Complaining that people won’t work for you for free is a bit much, don’t you think?

1 more reply

MakeUsersWant3y ago

Could you publish a set of known-good versions (pip freeze, OS version, etc)?

nl3y ago

Use the HuggingFace Transformer library. Unlike random github repos they are professionally maintained with proper versioning.

Here's the docs: https://huggingface.co/docs/transformers/main/model_doc/llam...

int_19h3y ago

All of these things exist in the Python package ecosystem, and are generally much more common outside of ML/DS stuff. The latter... well, it reminds me of coding in early PHP days. Basically, anything goes so long as it works.

kmod3y ago

I believe the cuda stuff, via Nvidia licensing restrictions, is forced to live outside of these packaging systems (so that you sign a Nvidia eula). Not saying this is a good thing but I think that none of the systems you mentioned would handle this well either

emikulic3y ago

This used to be the case but nowadays you can just pip install e.g. https://libraries.io/pypi/nvidia-cuda-runtime-cu11

crowwork3y ago

https://mlc.ai/mlc-llm/

jiggawatts3y ago

I love how the response to a complaint about unreproducible builds without any versions being specified is an install script that straight up clones the "current commit" of a Git repo instead of a specific working commit id or tag.

Astonishing.

Spivak3y ago

There's some truth to the "Arch" philosophy for some types of software it's actually more stable to just pull from master.

1 more reply

Taek3y ago

I have it running locally using the oobabooga webui, setup was moderately annoying but I'm definitely no python expert and I didn't have too much trouble.

DustinBrett3y ago

I had it running before with Dalai (https://github.com/cocktailpeanut/dalai) but have since moved to using the browser based WebGPU method (https://mlc.ai/web-llm/) which uses Vicuna 7B and is quite good.

KETpXDDzR3y ago

llama.cpp was easy to setup IMO

jiggawatts3y ago

Can you link to a working Dockerfile?

I've heard several people say that it is easy, but then surely it ought to be trivial to set script the build so that it works reliable in a container!

PostOnce3y ago

No need to drag a gigabyte of docker stuff into this, just extract the zip file from github and type make into your terminal

congratulations, it now works.

If you're not a developer, maybe you'll have to type sudo apt install build-essential first. Congratulations, now you too, a non-developer, are running it locally.

https://github.com/ggerganov/llama.cpp

1 more reply

kolinko3y ago

but what for?

rch3y ago

Just use Nixpkgs already.

microtonal3y ago

Upstream Hydra doesn't build packages with CUDA because it uses a non-FLOSS license. So they are not in the binary cache. You'll end up rebuilding every CUDA-using package every time a transitive dependency is changed. Yeah, I know, pin the world. But you'll still have to build these packages on every machine. So, you have to run your own binary cache. As you see, the rabbit hole gets deep pretty quickly.

The only recourse is using the -bin flavors of PyTorch, etc. which will just download the precompiled upstream versions. Sadly, the result will still be much slower than other distributions. First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.

So, you spend hours fixing derivations so that they build on many core machines and upstream all the diffs. Or YOLO and you disable unit tests altogether. A few hours/days later (depending on your knowledge of Nix), you finally have a built of all packages that you want, you launch whatever you are doing on your CUDA-capable GPU. Turns out that it is 30-50% slower. Finding out why is another multi-day expedition in profiling and tinkering.

In the end pyenv (or a Docker container) on a boring distribution doesn't look so bad.

(Disclaimer: I initially added the PyTorch/libtorch bin packages to nixpkgs and was co-maintainer of the PyTorch derivation for a while.)

alchemist1e93y ago

As a heavy nixpkgs user your comment resonates and makes me nervous.

I was thinking if it is possible in nixpkgs to create a branch that attempts to create a version match to specific distributions, especially Ubuntu as the ML world is most using it. My idea is to somehow use the deb package information to “shadow” another distribution.

> First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.

I understand your comments including above and the one about CUDA binaries. Just one clarification on the concurrency in tests failure, do you mean it overloads the machine running multi process tests that then tests fail due to assumptions by the package authors?

My main point is that Nix as a system is so incredibly powerful that perhaps there is an ability to “shadow” boring distributions, especially debian based, in some automated way. The we would have the best of both, baseline stability from the distribution and extensibility of nix.

1 more reply

rch3y ago

Granted, but I already have to run isolated container registries, pypi, maven, terraform, CI/CD, etc., etc., and so locally addressing the problems you've described is unavoidable and will realize significant efficiencies in any case. Everything about working in partially or fully air gapped environments is painful - no surprises there.

But I also think it's fine for individuals and researchers working in ML to expect some extra compiling, as long as the outcome is reliable. I'm stuck at home this weekend resurrecting an analysis from 10+ years ago, complete with Python, R, Java, and Fortran dependencies^, and I'm definitely wishing I'd known about Nix back then.

^btw, thanks to whomever included hdf5-mpi in Nixpkgs. Your work is greatly appreciated.

1 more reply

UncleEntity3y ago

I learned the very hard way not to mess with the python version the system depends on.

If you absolutely must then build it separately and link (or use) that exactly like blender does with their binaries. Campbell (one of the core blender devs) used to love to bump the python version as soon as it was released and if you wanted to do any dev work you’d have to run another python environment until the distro version caught up. Being as I liked to use the fedora libs as a sort of sanity check this was a bit of a hassle to say the least.

1 more reply

dagenix3y ago

There is a lot that can be improved with python packaging, but calling it "childish" is itself a pretty immature comment.

jiggawatts3y ago

Is it?

Literally every example I've seen so far is completely unversioned and mere weeks after being written simply doesn't work as a direct consequence.

E.g: https://github.com/oobabooga/text-generation-webui/blob/ee68...

Take this line:

    pip3 install torch torchvision torchaudio

Which version of torch is this? The latest.

    FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

Which version of CUDA is this? An incompatible one, apparently. Game over.

Check out "requirements.txt":

    accelerate==0.18.0
    colorama
    datasets
    flexgen==0.1.7
    gradio==3.25.0
    markdown
    numpy
    pandas
    Pillow>=9.5.0
    pyyaml
    requests
    rwkv==0.7.3
    safetensors==0.3.0
    sentencepiece
    tqdm

Wow. Less than half of those have any version specified. The rest? "Meh, I don't care, whatever."

Then this beauty:

    git+https://github.com/huggingface/peft

I love reaching out to the Internet in the middle of a build pipeline to pull the latest commit of a random repo, because that's so nice and safe, scalable, and cacheable in an artefact repository!

The NPM ecosystem gets regularly excoriated for the exact same mistakes, which by now are so well known, so often warned against, so often exploited, so regularly broken that it's getting boring.

It's like SQL injection. If you're still doing it in 2023, if your site is still getting hacked because of it, then you absolutely deserve to be labelled immature and even childish.

dagenix3y ago

> you absolutely deserve to be labelled immature and even childish

Do you appreciate that people aren't making technical mistakes on purpose just to spite you? Or that maybe some of the folks writing these libraries are experts in fields other than dependency management? Are you an expert in all things? Would you find it helpful if someone identifies one thing that you aren't great at and then calls you names on the internet over that one thing?

There is a pretty significant difference between making a technical critique and just being rude. And being right about the former doesn't make the latter ok.

forgingahead3y ago

You're right, Python's ecosystem and dependency management is a shitshow and everybody involved should be ashamed of themselves. But of course there are many "you're holding it wrong" commenters here who are in denial of this fact. It's an absolute pity that Python has become the de-facto language for public ML projects.

1 more reply

doodlesdev3y ago

   > Our system thinks you might be a robot!
   We're really sorry about this, but it's getting harder and harder to tell the difference between humans and bots these days.

Yeah, fuck you too. Come on, really, why put this in front of a _blog post_? Is it that hard to keep up with the bot requests when serving a static page?

api3y ago

A lot of people just stick cloudflare in front of anything because of cargo cultism.

A $5/mo VPS can serve a blog to tens of thousands of people unless you are running something stupidly inefficient. If it’s a static blog make that hundreds of thousands. For millions you might need to splurge on the $10 or $20 per month VPS.

hewlett3y ago

You can either spend $5 per month for VPS for a webserver for your static blog which you now have to secure properly, or you can just stick it on Cloudflare Pages for free

esquire_9003y ago

Cloudflare bot protection adds a slight delay (as in seconds) at best, and completely blocks users like the parent comment at worst. It costs no money, but isn't free either.

1 more reply

actualwitch3y ago

You don't need to host in on cloudflare, just use github pages - it's cdn is plenty fast.

Spivak3y ago

Or you use the free thing and never think about it?

vessenes3y ago

Most places that recommend llama.cpp for mac fail to mention https://github.com/jankais3r/LLaMA_MPS, which runs unquantized 7b and 13b models on the M1/M2 GPU directly. It's slightly slower, (not a lot), and significantly lower energy usage. To me the win not having to quantize while not melting a hole in my lap is huge; I wish more people knew about it.

noman-land3y ago

I've been meaning to ask this question as an LLM noob but what exactly is quantizing in this context and why do people do it? I know of quantizing in the digital audio context only.

detrites3y ago

Models in this context are just a big list of numbers. The numbers will have a native "type", for example 32-bit floats. These are numbers like 0.7373663777 or -1.000003663. The 32-bit float type can represent something like 4.3 billion numbers (sort of).

It was discovered though, that while models may need this level of precision when creating them ("training"), they don't need it nearly as much after the fact, when simply running them to get results ("inference").

So quantisation is the process of getting that big set of, say, 32-bit floats, and "mapping" them to a much smaller number type. Eg, an 8-bit integer ("INT8"). This is a number in the range 0-255 (or -128 to +127).

So, to quantise a list of 32-bit floats, you could go through the list and analyse. Maybe they're all in the range -1.0 to +1.0. Maybe there are many around the value of 0.99999 and 0.998 etc, so you decide to assign those the value "255" instead.

Repeat this until you've squashed that bunch of 32-bit values into 8-bits each. (Eg, maybe 0.750000 could be 192, etc.)

This could give a saving in memory footprint for the model of 4x smaller, and also makes it able to be run faster. So while you needed 16GB to run it before, now you might only need 4GB.

The expense is the model won't be as accurate. But, typically this is on the order of values like 90%, versus the memory savings of 4x. So it's deemed worth it.

It's through this process folks can run models that would normally require a 5-figure GPU to run, on their home machine, or even on the CPU, as it might be able to process integers easier and faster than floating point.

hardwaresofton3y ago

This is an excellent explanation, thank you

noman-land3y ago

Thanks so much for this explanation. It makes perfect sense and is really simple.

dragonwriter3y ago

Quantization is reducing the precision (and size) of values.

https://huggingface.co/docs/optimum/concept_guides/quantizat...

Labo3333y ago

Can you explain why they have a "significantly lower energy usage"? Thanks!

vessenes3y ago

Yes. Llama.cpp uses the CPU to do inference. MPS is the GPU for the macbook, so it has highly performant cores which can be used to do the computation. When you get inference done on the GPU, there's no (less?) energy wasted on general compute type work. :)

simonw3y ago

I'm running Vicuna (a LLaMA variant) on my iPhone right now. https://twitter.com/simonw/status/1652358994214928384

The same team that built that iPhone app - MLC - also got Vicuna running directly in a web browser using Web GPU: https://simonwillison.net/2023/Apr/16/web-llm/

newswasboring3y ago

With all these new AI models, both stable diffusion and llama specially, I'm considering switching to iPhone. I don't think I fully understand why iPhones and Macs are getting so many implementations but it seems like it's hardware based.

simonw3y ago

My understanding is that part of it is that Apple Silicon shares all available RAM between CPU and GPU.

I'm not sure how many of these models are actively taking advantage of that architecture yet though.

int_19h3y ago

The GPU isn't actually used by llama.cpp. What makes it that much faster is that the workload, either on CPU or on GPU, is very memory-intensive, so it benefits greatly from fast RAM. And Apple is using DDR5 running at very high clock speeds for this shared memory stuff.

It's still noticeably slower than GPU, though.

AnthonyMouse3y ago

Most of these implementations are not platform-specific. I've been running llama.cpp on x86_64 hardware and the performance is fine. The small models are fast and the quantized 65B model generates about a token per second on a system with dual-channel DDR4, which isn't unusable.

The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).

spudlyo3y ago

If you want to spend $4800.00 on just the computer, you can get a Mac Studio with 128G of memory with 400GB/s bandwidth. There are sparse reports out there of folks running 65B models on it. I've seen no performance measurements though.

1 more reply

sp3323y ago

iPhones leaned in to "computational photography" a long time ago. Eventually they added custom hardware to handle all the matrix multiplies efficiently. They exposed some of it to apps with an API called CoreML. They've been adding more features like on-device photo tagging, voice recognition, VR stuff.

sagarm3y ago

Google was the leader on computational smartphone photography. They released their "night sight" mode before Samsung and Apple had anything competitive.

1 more reply

bkm3y ago

Homogenized hardware I assume, this is why iOS had so many photography Apps too.

brucethemoose23y ago

There is also CodyCapybara (7B finetuned on code competitions), the "uncensored" Vicuna, OpenAssistant 13B (which is said to be very good), various non English tunes, medalpaca... the release pace maddening.

acapybara3y ago

And let's not forget about Alpacino (offensive/unfiltered model).

brianjking3y ago

I'll never understand why everyone is spending so much time on a model you cannot use commercially (at all).

Secondly, most of us can't even use the model for research or personal use, given the license.

nullc3y ago

The notion that model weights are copyrightable is absurd on its face. In the US you cannot gain a copyright though sweat of the brow, there must be substantial creative work. Nor does mere collection (e.g. feist v rural) create a copyright. Feeding common crawl to a standard network structure and letting an optimizer do its thing isn't creative, it's just expensive. It requires expertise and skill, sure but so does creating a phone book.

The companies working on AI would be foolish to argue for more copyrightability are because it would be hard to conclude the models were copyrightable works without also concluding that the models are unlawful derivatives of the material they were trained on. "Congrats, models can be owned, but regrets: you're bankrupt now because you just committed 4.6 billion acts of copyright infringement carrying statutory damages of $250k each."

You might argue that this is far from sure, OKAY-- but parties that take this view will out-compete ones that don't. If it does turn out to be problematic, the people that had something to work from now will pivot to backing their work on something else and will still be ahead of people sitting on their hands.

You could see it as a calculated risk, but it seems at least as safe as the one behind the underlying authors of the model weights training on material they're not licensed to distribute.

kolinko3y ago

If you can use it for research/purpose purposes, you just do, and if you manage to build something good enough, you can get funding to train your own model then.

Also, with business there are few "can do / can't do" - it's about managing risks. If a penalty for doing is negligible (FB cannot catch you abusing license in private), from a business standpoint there is no issue in doing so - especially with things that are ethically kind-of-ok.

Blahah3y ago

Because none of the interesting things I want to do with LLMs have anything to do with commercial uses. And because I don't care what a license says when I'm doing stuff in private. I just case whether I am practically able to do it.

DustinBrett3y ago

For me it's because most of what I am learning and trying to do is applicable to LLM's in general. One day the right model will come along, until then I want to play.

joshvm3y ago

There are efforts to provide an open source replica of the training dataset and independently trained models. So far the dataset has been recreated following the original paper (allowing for some vagueness that Meta researchers didn't specify):

https://github.com/togethercomputer/RedPajama-Data/

https://twitter.com/togethercompute/status/16479179892645191...

UncleEntity3y ago

Why can’t you use it for personal use?

I doubt the Facebook Police are going to bust down your door at 3am.

…or are they? peeks through curtains

yoquan3y ago

Because it's fun :-) And effort to bring up a could-be-commercial version is on going.

https://www.together.xyz/blog/redpajama

FloatArtifact3y ago

There needs to be a slight dedicated to tracking all these models with regular updates.

mdaniel3y ago

Heh, there is, and you're on it. But a slightly more serious answer is that would be a good feature for Huggingface to add since they're the GitHub of models. I actually suggested to GitHub that they should allow contributions of the repo topics since a lot of developers don't know or don't bother to add topics to their repos, making discoverability harder than necessary. GH ignored it but maybe Huggingface could implement such a thing

foobarbecue3y ago

Ok I gotta know... what's the art?

j / k navigate · click thread line to collapse

84 comments

jiggawatts3y ago

It keeps saying the phrase “model you can run locally”, but despite days of trying, I failed to compile any of the GitHub repos associated with these models.

Seriously, can anyone here build a docker image for running these models on CUDA? I think right now it’s borderline impossible, but I’d be happy to be corrected…

valine3y ago

I’ve got 4 different llama models running locally with CUDA and can freely switch between them, including LLaVA which is a multimodal LLaMA variant.

None of them are particularly difficult to get running, the trick is to search the project’s github issue tracker. 99% of the time your problem will be in there with steps to fix it.

jiggawatts3y ago

> the trick is to search the project’s github issue tracker.

What ever happened to the crazy notion of Dockerfiles that simply build successfully?

Isn’t half the point of containerisation that it papers over the madness of the Python module ecosystem?

avereveard3y ago

The problem with that is gpu access from the docker image and iirc Nvidia doesn't have a Windows passtrough cuda driver

UncleEntity3y ago

They require someone to take time out of their busy schedule to build?

Complaining that people won’t work for you for free is a bit much, don’t you think?

1 more reply

MakeUsersWant3y ago

Could you publish a set of known-good versions (pip freeze, OS version, etc)?

nl3y ago

Use the HuggingFace Transformer library. Unlike random github repos they are professionally maintained with proper versioning.

Here's the docs: https://huggingface.co/docs/transformers/main/model_doc/llam...

int_19h3y ago

kmod3y ago

emikulic3y ago

This used to be the case but nowadays you can just pip install e.g. https://libraries.io/pypi/nvidia-cuda-runtime-cu11

crowwork3y ago

https://mlc.ai/mlc-llm/

jiggawatts3y ago

Astonishing.

Spivak3y ago

There's some truth to the "Arch" philosophy for some types of software it's actually more stable to just pull from master.

1 more reply

Taek3y ago

I have it running locally using the oobabooga webui, setup was moderately annoying but I'm definitely no python expert and I didn't have too much trouble.

DustinBrett3y ago

KETpXDDzR3y ago

llama.cpp was easy to setup IMO

jiggawatts3y ago

Can you link to a working Dockerfile?

I've heard several people say that it is easy, but then surely it ought to be trivial to set script the build so that it works reliable in a container!

PostOnce3y ago

No need to drag a gigabyte of docker stuff into this, just extract the zip file from github and type make into your terminal

congratulations, it now works.

If you're not a developer, maybe you'll have to type sudo apt install build-essential first. Congratulations, now you too, a non-developer, are running it locally.

https://github.com/ggerganov/llama.cpp

1 more reply

kolinko3y ago

but what for?

rch3y ago

Just use Nixpkgs already.

microtonal3y ago

In the end pyenv (or a Docker container) on a boring distribution doesn't look so bad.

(Disclaimer: I initially added the PyTorch/libtorch bin packages to nixpkgs and was co-maintainer of the PyTorch derivation for a while.)

alchemist1e93y ago

As a heavy nixpkgs user your comment resonates and makes me nervous.

1 more reply

rch3y ago

^btw, thanks to whomever included hdf5-mpi in Nixpkgs. Your work is greatly appreciated.

1 more reply

UncleEntity3y ago

I learned the very hard way not to mess with the python version the system depends on.

1 more reply

dagenix3y ago

There is a lot that can be improved with python packaging, but calling it "childish" is itself a pretty immature comment.

jiggawatts3y ago

Is it?

Literally every example I've seen so far is completely unversioned and mere weeks after being written simply doesn't work as a direct consequence.

E.g: https://github.com/oobabooga/text-generation-webui/blob/ee68...

Take this line:

    pip3 install torch torchvision torchaudio

Which version of torch is this? The latest.

    FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

Which version of CUDA is this? An incompatible one, apparently. Game over.

Check out "requirements.txt":

    accelerate==0.18.0
    colorama
    datasets
    flexgen==0.1.7
    gradio==3.25.0
    markdown
    numpy
    pandas
    Pillow>=9.5.0
    pyyaml
    requests
    rwkv==0.7.3
    safetensors==0.3.0
    sentencepiece
    tqdm

Wow. Less than half of those have any version specified. The rest? "Meh, I don't care, whatever."

Then this beauty:

    git+https://github.com/huggingface/peft

I love reaching out to the Internet in the middle of a build pipeline to pull the latest commit of a random repo, because that's so nice and safe, scalable, and cacheable in an artefact repository!

The NPM ecosystem gets regularly excoriated for the exact same mistakes, which by now are so well known, so often warned against, so often exploited, so regularly broken that it's getting boring.

It's like SQL injection. If you're still doing it in 2023, if your site is still getting hacked because of it, then you absolutely deserve to be labelled immature and even childish.

dagenix3y ago

> you absolutely deserve to be labelled immature and even childish

There is a pretty significant difference between making a technical critique and just being rude. And being right about the former doesn't make the latter ok.

forgingahead3y ago

1 more reply

doodlesdev3y ago

   > Our system thinks you might be a robot!
   We're really sorry about this, but it's getting harder and harder to tell the difference between humans and bots these days.

Yeah, fuck you too. Come on, really, why put this in front of a _blog post_? Is it that hard to keep up with the bot requests when serving a static page?

api3y ago

A lot of people just stick cloudflare in front of anything because of cargo cultism.

hewlett3y ago

You can either spend $5 per month for VPS for a webserver for your static blog which you now have to secure properly, or you can just stick it on Cloudflare Pages for free

esquire_9003y ago

Cloudflare bot protection adds a slight delay (as in seconds) at best, and completely blocks users like the parent comment at worst. It costs no money, but isn't free either.

1 more reply

actualwitch3y ago

You don't need to host in on cloudflare, just use github pages - it's cdn is plenty fast.

Spivak3y ago

Or you use the free thing and never think about it?

vessenes3y ago

noman-land3y ago

I've been meaning to ask this question as an LLM noob but what exactly is quantizing in this context and why do people do it? I know of quantizing in the digital audio context only.

detrites3y ago

Repeat this until you've squashed that bunch of 32-bit values into 8-bits each. (Eg, maybe 0.750000 could be 192, etc.)

This could give a saving in memory footprint for the model of 4x smaller, and also makes it able to be run faster. So while you needed 16GB to run it before, now you might only need 4GB.

The expense is the model won't be as accurate. But, typically this is on the order of values like 90%, versus the memory savings of 4x. So it's deemed worth it.

hardwaresofton3y ago

This is an excellent explanation, thank you

noman-land3y ago

Thanks so much for this explanation. It makes perfect sense and is really simple.

dragonwriter3y ago

Quantization is reducing the precision (and size) of values.

https://huggingface.co/docs/optimum/concept_guides/quantizat...

Labo3333y ago

Can you explain why they have a "significantly lower energy usage"? Thanks!

vessenes3y ago

simonw3y ago

I'm running Vicuna (a LLaMA variant) on my iPhone right now. https://twitter.com/simonw/status/1652358994214928384

The same team that built that iPhone app - MLC - also got Vicuna running directly in a web browser using Web GPU: https://simonwillison.net/2023/Apr/16/web-llm/

newswasboring3y ago

simonw3y ago

My understanding is that part of it is that Apple Silicon shares all available RAM between CPU and GPU.

I'm not sure how many of these models are actively taking advantage of that architecture yet though.

int_19h3y ago

It's still noticeably slower than GPU, though.

AnthonyMouse3y ago

spudlyo3y ago

1 more reply

sp3323y ago

sagarm3y ago

Google was the leader on computational smartphone photography. They released their "night sight" mode before Samsung and Apple had anything competitive.

1 more reply

bkm3y ago

Homogenized hardware I assume, this is why iOS had so many photography Apps too.

brucethemoose23y ago

acapybara3y ago

And let's not forget about Alpacino (offensive/unfiltered model).

brianjking3y ago

I'll never understand why everyone is spending so much time on a model you cannot use commercially (at all).

Secondly, most of us can't even use the model for research or personal use, given the license.

nullc3y ago

You could see it as a calculated risk, but it seems at least as safe as the one behind the underlying authors of the model weights training on material they're not licensed to distribute.

kolinko3y ago

If you can use it for research/purpose purposes, you just do, and if you manage to build something good enough, you can get funding to train your own model then.

Blahah3y ago

DustinBrett3y ago

For me it's because most of what I am learning and trying to do is applicable to LLM's in general. One day the right model will come along, until then I want to play.

joshvm3y ago

https://github.com/togethercomputer/RedPajama-Data/

https://twitter.com/togethercompute/status/16479179892645191...

UncleEntity3y ago

Why can’t you use it for personal use?

I doubt the Facebook Police are going to bust down your door at 3am.

…or are they? peeks through curtains

yoquan3y ago

Because it's fun :-) And effort to bring up a could-be-commercial version is on going.

https://www.together.xyz/blog/redpajama

FloatArtifact3y ago

There needs to be a slight dedicated to tracking all these models with regular updates.

mdaniel3y ago

foobarbecue3y ago

Ok I gotta know... what's the art?

j / k navigate · click thread line to collapse