Llamafile lets you distribute and run LLMs with a single file (opens in new tab)

https://github.com/tluyben/llamafile-docker

hnuser1234562y ago

Source on the newer GPT-4 model being worse at coding?

4 more replies

tluyben22y ago

Popped it into a docker setup:

to save even more keystrokes.

reisse2y ago

What is the point of wrapping absolutely portable single-file program into a Docker container, honest question?

Looks like cargo cult for me.

5 more replies

luke-stanley2y ago

It was already great, and this is more great for those who get Docker-Compose or are patient enough to figure out. But if you're gonna have Docker, you could also use bleeding edge llama.cpp with a few more lines too! What a time to be alive innit!

finnjohnsen22y ago

Thank you kindly

bsenftner2y ago

Super duper impressed. I've run llamafile-server-0.1-llava-v1.5-7b-q4 against the tests I need to pass for use in a project, and this passes them all, vision queries too. This is gonna change quite a bit, strategy-wise for quite a few people.

botanical2y ago

I just tried asking it a question:

> User: What is the third planet from the sun?

> Llama: The third planet from the sun is called Mars.

zoomablemind2y ago

> ...> Llama: The third planet from the sun is called Mars.

Ask it if is there life on Mars in that parallel reality

czbond2y ago

The model is trained on large volume data, correct? Why would it get such a simple fact incorrect?

creshal2y ago

Skynet is collaborating with the Martians already, I see.

eek21212y ago

Llama is just from the future. That is all…

StockHuman2y ago

Phenomenal quickstart, and thanks for the write-up. It’s so thrilling that we’re at this point in portability and ease relative performance.

brianjking2y ago

This can truly revolutionalize education and access, feels like what I hoped the One Laptop Per Child project would do, this could actually achieve. We just need someone with a heap of funds to package this up into a very inexpensive machine and distribute them.

anonzzzies2y ago

Very nice; works perfect on Ubuntu 20.04. Doing 8 tokens/s on a pretty crappy server.

belter2y ago

Perfectly on Fedora 39 on old ( and I mean old...) machines. This is actually shocking...shocking good...

mritchie7122y ago

woah, this is fast. On my M1 this feels about as fast as GPT-4.

pmarreck2y ago

Same here on M1 Max Macbook Pro. This is great!

pyinstallwoes2y ago

How good is it in comparison

mark_l_watson2y ago

Simon, does this also work well on Apple Silicon?

EDIT: never mind, other commenters here answered my question: yes it does work. I have a M2Pro with 32 G on chip memory - amazingly effective for experimenting with open LLMs.

lol7682y ago

    $ chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4

Hmm. Did I do something wrong? (Ubuntu 22.04 / )

Installing the portable binfmt_misc gets me further, but still:

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    zsh: permission denied: ./llamafile-server-0.1-llava-v1.5-7b-q4

    $ sh -c ./llamafile-server-0.1-llava-v1.5-7b-q4
    sh: 1: ./llamafile-server-0.1-llava-v1.5-7b-q4: Permission denied

jart2y ago

You can solve the run-detectors issue with:

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

You can solve the zsh permission denied issue by either (1) upgrade to zsh 5.9+ (I upstreamed a fix for this bug in zsh two years ago) or (2) use the sh -c workaround you discovered. If that one doesn't work, then it likely needs to be chmod +x. If the execute bit is set, and your sh still isn't working, then please let me know, because I'm not aware of any sh that still doesn't support APE.

See the Gotchas section of the README https://github.com/mozilla-Ocho/llamafile#gotchas

phh2y ago

Last thing you need is to chmod +x the interpreter: chmod +x /usr/bin/ape (it is indeed not in the README)

stavros2y ago

I get the same error, and there's no `ape` file to make excecutable, hm.

thejosh2y ago

Damn this is fast and accurate! Crazy how far things are progressing.

theptip2y ago

My pride as a technologist tells me I should be able to get any Python package up and running, but man, AI dependency management is a dumpster fire right now; adding GPU driver versions into the mix seems to make everything really brittle.

This seems like a great approach to compare multiple models, in particular.

callmeed2y ago

when I try to do this (MBP M1 Max, Sonoma) I get 'killed' immediately

chatmasta2y ago

Same, and then a few minutes later I got a Slack message from SecOps, LOL. Don't try this on a computer with CrowdStrike software running on it! It gets flagged because to a naive heuristic, the binary is indistinguishable from a virus. It appears to do some kind of magic self-extraction to an executable file in a temporary directory, and then that executable file executes the original file. And the CrowdStrike endpoint security product intercepts the suspicious execve, kills the process, and alerts the security team...

carbocation2y ago

Same on an M1 Max 64G, Ventura. Xcode is installed[1].

1 = ```

$ xcode-select --install

xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates

```

derwiki2y ago

On a Macbook Pro M2, I get

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    [2]    25224 illegal hardware instruction  ./llamafile-server-0.1-llava-v1.5-7b-q4

raajg2y ago

Yep, same issue. and the error message is unhelpful

fitted2y ago

Same, process gets killed immediately for me.

silentguy2y ago

wget url in step 1 seems to be wrong. It didn't work. This url `https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma... ` seems to be working. It's from the link you posted.

jart2y ago

It's back. Sorry about that.

pmarreck2y ago

Anyone have any tuning tips? I messed with some of the configs and now it's mostly hallucinating answers or going off the rails

m1thrandir2y ago

Is the process the same for running multi-part bins? Like the latest deekseek 67b model?

tmountain2y ago

This is amazing. How does result quality compare to GPT4 for image analysis?

simonw2y ago

It's not remotely as good as GPT-4 Vision, which isn't a big surprise consisting it's running a 4GB, 7B parameter model on your laptop, trained by a small research team.

bilsbie2y ago

Thanks for the tip! Any chance this would run on a 2011 MacBook?

pmarreck2y ago

do you... have any plans to upgrade? A gen 2011 computer is going to get harder and harder to make work. even a used macbook from like 2019 would probably be a steal at this point, and that's 8 years further along

All the new AI toys especially seem to love beefy newish hardware and especially GPU hardware if available

simonw2y ago

Justine says it needs MacOS 13.6+ - does that run on that machine?

NoOn32y ago

Got this: Terminating on uncaught SIGILL.

jart2y ago

I'm assuming you're on Apple Silicon? Please follow https://github.com/Mozilla-Ocho/llamafile/issues/11 which is tracking this. We've received multiple reports even though I personally haven't figured out how to reproduce it yet.

https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1

jxy2y ago

so next time llama.cpp releases an update, other people update their favorite backend, you redownload a 4.26 GB file. Epic.

EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.

EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.

simonw2y ago

Each llamafile is a .zip, so if you want to extract the weights out of it you can extract the gguf file directly.

    unzip -l llamafile-server-0.1-llava-v1.5-7b-q4 | grep llava-v1
    Archive:  llamafile-server-0.1-llava-v1.5-7b-q4
    4081004224  11-15-2023 22:13   llava-v1.5-7b-Q4_K.gguf
    177415936  11-15-2023 22:13   llava-v1.5-7b-mmproj-Q4_0.gguf

column2y ago

This is for convenience. You can also download a 4.45Mb executable (llamafile-server-0.1) and pass any GGUF model as an argument.

> llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

_kidlike2y ago

salty much?

You know, most people don't have 24+GB GPUs sitting around to train these models. So in my book this is a huge step forward. Personally, this is the first time i am able to run an LLM on my computer, and it's purely thanks to this.

aphroz2y ago

Compared to modern bandwidth usage that's not such a big size anymore. Everyday millions of people download 100gb video games, watch 4k video podcasts, etc.

taneq2y ago

Spoken like someone who hasn’t spent hours trying to get LocalAI to build and run, only to find out that while it’s “OpenAI API compatible!0” it doesn’t support streaming so the Mattermost OpenAI plugin doesn’t work. I finally gave up and went back to ooba (which also didn’t work with the MM plugin… hmm.) Next time I’ll just hack something on the side of llama.cpp

Zuiii2y ago

That's why I always download the original version and quantize myself. With enough swap, you can do it with a modest amount for ram. I never had to download a model twice.

But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.

rgbrgb2y ago

Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.

I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.

[0]: https://justine.lol/cosmopolitan/

[1]: https://www.freechat.run

[2]: https://github.com/ggerganov/llama.cpp

pdntspa2y ago

I don't get this obsession with 0-click everything. It is really annoying when you don't want to install everything to your main hard drive. I have all my models downloaded, organized, and ready-to-go but apps won't even ask for that, instead it presumes I am an idiot and downloads it (again!) for me.

At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....

All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!

simonw2y ago

Sounds like you should download the 4.45MB llamafile-server-0.1 executable from https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1 and then run it against your existing gguf model files like this:

    ./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

See here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

mft_2y ago

If I'm understanding (and agreeing with) your gripe correctly, isn't it two solutions to the same perceived problem?

My experience is that the world of Python dependency management is a mess which sometimes works, and sometimes forces you to spend hours-to-days searching for obscure error messages and trying maybe-fixes posted in Github issues for some other package, just in case it helps. This sometimes extends further - e.g. with hours-to-days spent trying to install just-the-right-version-of-CUDA on Linux...

Anyway, the (somewhat annoying but understandable) solution that some developers take is to make their utility/app/whatever as self-contained as possible with a fresh install of everything from Python downwards inside a venv - which results in (for example) multiple copies of PyTorch spread around your HDD. This is great for less technical users who just need a minimal-difficulty install (as IME it works maybe 80-90% of the time), good for people who don't want to spend their time debugging incompatibilities between different library versions, but frustrating for the more technically-inclined user.

This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?

joenot4432y ago

The “obsession” isn’t for developers like you or I, it’s for non-technical but curious non-engineers who would still like to play around with an LLM.

I think you’re imposing your own very specific (and solvable!) gripe onto an unrelated project and it makes for pretty clumsy commentary.

rgbrgb2y ago

fwiw FreeChat does this now. It prompts you to download or select a model to use (and you can add as many as you want). No copying or forced downloads.

stevenhuang2y ago

The binaries themselves are available standalone https://github.com/Mozilla-Ocho/llamafile/releases

rgbrgb2y ago

cool. this is more convenient than my workflow for doing the binaries myself. I currently use make to generate a binary of llama.cpp server on my intel iMac and my m1 MacBook then lipo them together.

coldtea2y ago

>I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :).

Well, that's on the MAS/TestFlight for not doing delta updates.

rgbrgb2y ago

Yes, though it does seem to be working for them. They have a special feature for lazy loading large assets but I opted for a simpler to me option (giving users a button to download a model if they don’t have one locally they want to use).

Asmod4n2y ago

It’s just a zip file, updating it should be doable in place while it’s running on any non windows platform and you just need to swap that one file out you changed. When it’s running in server mode you could also possibly hot reload the executable without the user even having any downtime.

csdvrx2y ago

You could also change you code so that when it runs, it checks as early as possible if you have a file with a well known name (say ~/.freechat.run) and then switches to reading from it instead for the assets than can change.

You could have multiple updates my using say iso time and doing a sort (so that ~/.freechat.run.20231127120000 would be overriden by ~/.freechat.run.20231129160000 without making the user delete anything)

tbalsam2y ago

> in place

._.

Pain.

wyldfire2y ago

> Extremely cool ...

> I'm kind of struggling with the use-cases for this one.

IMO cosmopolitan libc is a "really neat trick". And it deserves praise and it probably does have some real use cases. But it's not practical for most purposes. If we had a format like ELF that was so fat as to support as many architectures and OSs as desired, would we be using that? I have a feeling that we would not.

Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.

And the microarch-specific dispatch is a nice touch.

...maybe I'm convincing myself of the alternative....

larodi2y ago

Perhaps another unpopular opinion that can get the comment outright down-voted, but still... While jart's work is very interesting in nature and execution, commendable stuff indeed of a person with very high IQ and discipline, I still wonder whether Justine simply can't get over the fact they got kicked out of the llama.cpp project (yes, I understand jart is frequenting HN, and also let's agree llama.cpp is at least as cool as jart's projcets). No, I'm not going in details of said dismissal, as both sides seem to have had their proper arguments, but still.

And of course, I can imagine where the whole cosmopolitan thing comes from,... even as manifest of sorts for the idea of systems-neutrality and potentially gender fluidity. But I really wonder whether GGUF actually needs this, since llama.cpp already compiles and runs pretty much everywhere.

Why introduce one more container? Who benefits from binary distribution of this sort?

ukuina2y ago

> Why introduce one more container? Who benefits from binary distribution of this sort?

Mass adoption comes from ease-of-use, which includes ease-of-deployment.

Most of the HN crowd is savvy enough to run HF models directly in their terminal, but there's a reason Ollama, LM Studio, and Faraday are so popular.

Until LLM runners are part of the base OS, methods like this are extremely helpful to reduce friction.

whstl2y ago

I read the Github repository README and the comments here and I found absolutely nothing that could suggest the need for the first two paragraphs you wrote. It seems this stems from a misconception from your side about the purpose of this project.

About your question in the third paragraph: This is totally orthogonal to GGUF, and a cursory reading of the README shows that it does uses GGUF. This is not about a new universal LLM format, this is about packing it in a universal executable that runs everywhere, using Cosmopolitan.

Some examples do pack the executable and GGUF weights together in a single file, but that's not dissimilar from an self-executing zip, the only difference is that this executable is not OS-specific, so you can use the same exact binary for macOS or Linux, for example.

ElectricalUnion2y ago

> llama.cpp already compiles and runs pretty much everywhere.

Well, it simplifies things when you don't need to compile things.

Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.

> Why introduce one more container?

It makes stuff more convenient.

`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".

I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.

> Who benefits from binary distribution of this sort?

Anyone that doesn't have a compiler SDK on their computer.

halyconWays2y ago

>Extremely cool and Justine Tunney / jart does incredible portability work [0],

[x] Doubt.

That user was caught stealing code and banned from llama.cpp by its creator (your [2] citation) https://news.ycombinator.com/item?id=35411909

Maybe the same thing is happening here. Plagiarism of code.

hobofan2y ago

What are you on about? There was no stealing and there was no plagiarism.

They made a PR that was built on top of another PR. The authorship information was preserved in the git history, and there was no attempt at deception. They also supposedly collaborated with the author of the original PR (which was never denied by either of them). All of this is totally normal working practice.

Those allegations of "stealing" just stem from a GH user piling onto the drama from the breaking change by pointing out where the initials from the new file format come from (which wasn't called into question on the original PR).

They were also not banned for those stealing allegations. They, as well as the author from the reversal PR were banned, as the maintainer deemed the resulting "drama" from the breaking changes to be a distraction to the project goals. The maintainer accepted the PR, and the nature of the breaking changes was obviously stated, so that drama wasn't completely on jart.

epaga2y ago

Hmmm no matter what image I upload, it's somehow not seeing it.

Got a chuckle out of this, though: ---

User: Describe the image in the most detailed way possible

Llama: The image is a close-up of an empty, dark room with no visible objects or people. It appears to be a black background photo that highlights the darkness and emptiness of the space. This could potentially represent an artistic representation or an actual setting where lighting has been used to emphasize the contrast between light and shadow. The focus on this particular area suggests that it might have some significance, whether symbolic or practical in nature. Overall, the image evokes a sense of mystery and intrigue due to its lack of context and content.

User: is it just a black square?

Llama: Yes, it is just a black square.

sebmellen2y ago

Interesting, I uploaded a pure white 50x50 png and got this:

> User: Describe this image in the most detailed way possible

> Llama: The image displays a large, mostly white background with some small gray areas. There is also an orange dot on the right side of the scene, which stands out against the predominantly white and gray backdrop.

> User: Is it just a white square?

> Llama: No, it is not just a white square. The image also contains some gray and tan colors as well. It appears to be an abstract art piece with various shades of color, making the overall appearance more interesting than a simple all-white background.

dmazzoni2y ago

LLM vision is surprisingly human-like. Point an actual human at a blank canvas and I'll bet many would hallucinate things that aren't there.

CaptainFever2y ago

That's comedic. I love it.

keybits2y ago

Simon Willison has a great post on this https://simonwillison.net/2023/Nov/29/llamafile/

phil422y ago

It’s intriguing to observe how this model is hallucinating. I presented it with Albert Bierstadt’s painting “In the Mountains” (https://www.wikiart.org/en/albert-bierstadt/in-the-mountains...). To my surprise, it maintains that cows are scattered throughout the scene, despite being told otherwise.

User:

Llama: The image features a serene landscape with mountains in the background and a river flowing through it. A picturesque scene unfolds as two trees are visible near the water, adding to the natural beauty of the environment. In addition to these elements, there is an interesting detail: several cows can be seen standing on or around the grassy area next to the body of water. The presence of these animals creates a sense of tranquility and harmony with nature in this idyllic setting.

User: There are no cows. Could it be that you've mistaken rocks for cows?

Llama: No, the rocks in this image are not mistaken for cows. They can be seen near a lake with mountains in the background and trees surrounding it. The painting captures a serene landscape where nature is beautifully displayed. This picturesque scene showcases both the majestic mountain range and the tranquil body of water, creating an idyllic atmosphere that evokes feelings of peacefulness and relaxation.

User: There are no cows!

Llama: The image depicts a serene scene of mountains and water. A beautiful lake is surrounded by majestic snow-capped mountains, creating an idyllic landscape. In the foreground, there are trees adding to the natural beauty of the environment.

A few birds can be seen flying in the sky above the lake, further enhancing the peaceful atmosphere. The presence of a boat on the water adds another element of leisure and enjoyment to this picturesque scene. Overall, it is an inviting image that showcases nature's wonders with its combination of mountains, lake, trees, birds, and boats.

dang2y ago

(via https://news.ycombinator.com/item?id=38463456 and https://news.ycombinator.com/item?id=38464759, but we merged the comments hither)

abrinz2y ago

I've been playing with various models in llama.cpp's GGUF format like this.

  git clone https://github.com/ggerganov/llama.cpp     

  cd llama.cpp

  make 

  # M2 Max - 16 GB RAM

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_0.gguf
  
  ./server -m models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32

  # M1 - 8 GB RAM 

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

  ./server -m models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32

m1thrandir2y ago

even easier with https://gpt4all.io/index.html

foruhar2y ago

Llaminate would be decent name for something like. Or the verb for the general wrapping of a llama compatible model into a ready to use blob.

gsuuon2y ago

Llamanate

modeless2y ago

Wow, it has CUDA support even though it's built with Cosmopolitan? Awesome, I see Cosmopolitan just this month added some support for dynamic linking specifically to enable GPUs! This is amazing, I'm glad they found a way to do this. https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a...

I see it unfortunately requires the CUDA developer toolkit to be installed. It's totally possible to distribute CUDA apps that run without any dependencies installed other than the Nvidia driver. If they could figure that out it would be a game changer.

patcon2y ago

> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.

xnx2y ago

> Windows also has a maximum file size limit of 2GB for executables. You need to have llamafile and your weights be separate files on the Windows platform.

The 4GB .exe ran fine on my Windows 10 64-bit system.

jart2y ago

You're right. The limit is 4 gibibytes. Astonishingly enough, the llava-v1.5-7b-q4-server.llamafile is 0xfe1c0ed4 bytes in size, which is just 30MB shy of that limit. https://github.com/Mozilla-Ocho/llamafile/commit/81c6ad3251f...

throwaway7432y ago

Not at my windows machine to test this out right now, but wondering what you mean by having to store the weights in a separate file for wizardcoder, as a result of the 4gb executable limit. How does one go about this?

Thank you!

mistrial92y ago

great! worked easily on desktop Linux, first try. It appears to execute with zero network connection. I added a 1200x900 photo from a journalism project and asked "please describe this photo" .. in 4GB of RAM, it took between two and three minutes to execute with CPU-only support. The response was of mixed value. On the one hand, it described "several people appear in the distance" but no, it was brush and trees in the distance, no other people. There was a single figure of a woman walking with a phone in the foreground, which was correctly described by this model. The model did detect 'an atmosphere suggesting a natural disaster' and that is accurate.

thx to Mozilla and Justin Tunney for this very easy, local experiment today!

dekhn2y ago

I get the desire to make self-contained things, but a binary that only runs one model with one set of weights seems awfully constricting to me.

simonw2y ago

There's also a "llamafile" 4MB binary that can run any model (GGUF file) that you pass to it: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

dekhn2y ago

Right. So if that exists, why would I want to embed my weights in the binary rather than distributing them as a side file?

I assume the answers are "because Justine can" and "sometimes it's easier to distribute a single file than two".

jart2y ago

llamafile will run any compatible model you want. For example, if you download the LLaVA llamafile, you can still pass `-m wizardcoder.gguf` to override the default weights.

espadrine2y ago

I understand the feeling. It may be caused by habit rather than objectivity, though. Those open-source AI hacks are undergoing early productization: while they were only research, their modularity mattered for experimentization, but as they get closer to something that can ship, the one-click binary form factor is a nice stepping stone.

It is similar in my mind to the early days of Linux, where you had to compile it yourself and tweaked some compiler flags, compared to now, where most people don’t even think about the fact that their phone or Steam deck runs it.

russellbeattie2y ago

I sorta see your point - it's kinda the equivalent of self-executable SQLite database pre-filled with data, or a Word document that contains the editor. There's lots of good reasons the data and apps are delivered separately.

That said, it does reduce the friction of getting an LLM up and running and the self-contained nature makes it sort of a dedicated program equivalent to Awk. It might open up possibilities, like the AI version of the Unix philosophy - do one thing well. A hyper optimized LLM could be used in pipeline of commands, for example.

omeze2y ago

Eh, this is exploring a more “static link” approach for local use and development vs the more common “dynamic link” that API providers offer. (Imperfect analogy since this is literally like a DLL but… whatever). Probably makes sense for private local apps like a PDF chatter.

dws2y ago

Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.

dmezzetti2y ago

From a technical standpoint, this project is really fascinating. I can see a lot of use cases for getting something up fast locally for an individual user.

But for anyone in a production/business setting, it would be tough to see this being viable. Seems like it would be a non-starter for most medium to large companies IT teams. The great thing about a Dockerfile is that it can be inspected and the install process is relatively easy to understand.

gfodor2y ago

This stuff is for people who don't care about medium to large companies IT teams.

fbdab1032y ago

I am failing to see the difference. It is a zip file with an executable and a blob of weights. What would change if it were stored in a Dockerfile?

fullspectrumdev2y ago

Literally nothing given the docker file would be a couple of wget invocations and a chaos lol

amelius2y ago

> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU

This is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.

tomwojcik2y ago

That's not the limitation introduced in Llamafile. It's actually a feature of all gguf models. If not specified, GPU is not used at all. Optionally, you can offload some work to the GPU. This allows to run 7b models (zephyr, mistral, openhermes) on regular PCs, it just takes a bit more time to generate the response. What other API would you suggest?

amelius2y ago

This is a bit like saying if you don't specify "--dram", the data will be stored on punchcards.

From the user's point of view: they just want to run the thing, and as quickly as possible. If multiple programs want to use the GPU, then the OS and/or the driver should figure it out.

michaelt2y ago

> What other API would you suggest?

Assuming increasing vram leads to an appreciable improvement in model speed, it should default to using all but 10% of the vram of the largest GPU, or all but 1GB, whichever is less.

If I've got 8GB of vram, the software should figure out the right number of layers to offload and a sensible context size, to not exceed 7GB of vram.

(Although I realise the authors are just doing what llama.cpp does, so they didn't design it the way it is)

> What other API would you suggest?

MLC LLM?

I think the binary it compiles down to (Probably the Vulkan and Metal ones for yall) is seperate from the weights, so you could ship a bunch in one file.

bjnewman852y ago

Justine is creating mind-blowing projects at an alarming rate.

Luc2y ago

This is pretty darn crazy. One file runs on 6 operating systems, with GPU support.

tfinchOP2y ago

yeah the section on how the GPU support works is wild!

amelius2y ago

Why don't package managers do stuff like this?

thelastparadise2y ago

So if you share a binary with a friend you'd have to have them install cuda toolkit too?

Seems like a dealbreaker for the whole idea.

quickthrower22y ago

Like a docker for LLMs

verdverm2y ago

I don't see why you cannot use a container for LLMs, that's how we've shipping and deploying runnable models for years

AMICABoard2y ago

This puts a super great evil happy grin on my face. I am going to add it in the next version of L2E OS! Thank you jart, thank you mozilla! Love you folks!

tatrajim2y ago

Small field test: I uploaded a picture of a typical small Korean Buddhist temple, with a stone pagoda in front. Anyone at all familiar with East Asian Buddhism would instantly recognize both the pagoda and the temple behind it as Korean.

Llamafile: "The image features a tall, stone-like structure with many levels and carved designs on it. It is situated in front of an Asian temple building that has several windows. In the vicinity, there are two cars parked nearby – one closer to the left side of the scene and another further back towards the right edge. . ."

ChatGPT4:"The photo depicts a traditional Korean stone pagoda, exhibiting a tiered tower with multiple levels, each diminishing in size as they ascend. It is an example of East Asian pagodas, which are commonly found within the precincts of Buddhist temples. . . The building is painted in vibrant colors, typical of Korean temples, with green being prominent."

No comparison, alas.

simonw2y ago

That's not a llamafile thing, that's a llava-v1.5-7b-q4 thing - you're running the LLaVA 1.5 model at a 7 billion parameter size further quantized to 4 bits (the q4).

GPT4-Vision is running a MUCH larger model than the tiny 7B 4GB LLaVA file in this example.

LLaVA have a 13B model available which might do better, though there's no chance it will be anywhere near as good as GPT-4 Vision. https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZO...

jokethrowaway2y ago

Nice but you are leaving some performance on the table (if you have a GPU)

Exllama + GPTQ is the way to go

llama.cpp && GGUF are great on CPUs

More data: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...

AMICABoard2y ago

Which is a smaller model, that gives good output and that works best with this. I am looking to run this on lower end systems.

I wonder if someone has already tried https://github.com/jzhang38/TinyLlama, could save me some time :)

zitterbewegung2y ago

This is not to be dismissive but there is a security risk if we keep on using the abstraction with arbitrary objects being serialized to disk and being able to trace back and see if the model file (most commonly python pickle files) aren’t tampered with .

zerojames2y ago

The ML field is doing work in that area: https://github.com/huggingface/safetensors

visarga2y ago

You just need to have a stray TXT file in your system, or even downloaded from internet that prompts the AI to hack your system. If your AI has Python sand box and that has vulnerabilities, you can be hacked by any web page or text file. And the AI would be able to study your computer and select the most juicy bits to send out. It would be like a sentient virus spread by simple text files (text bombs?).

marginalia_nu2y ago

This sounds very implausible. Is there like a proof of concept I missed?

benatkin2y ago

I like the idea of putting it in one file but not an executable file. Using CBOR (MessagePack has a 4gb bytestring limit) and providing a small utility to copy the executable portion and run it would be a win. No 4gb limit. It could use delta updates.

RecycledEle2y ago

Fantastic.

For those of who who swim in the Microsoft ecosystem, and do not compile Linux apps from code, what Linux dustro would run this without fixing a huge number of dependencies?

It seems like someone would have included Llama.cpp in their distro, ready-to-run.

Yes, I'm an idiot.

jart2y ago

llamafile runs on all Linux distros since ~2009. It doesn't have any dependencies. It'd probably even run as the init process too (if you assimilate it). The only thing it needs is the Linux 2.6.18+ kernel application binary interface. If you have an SELinux policy, then you may need to tune things, and on some distros you might have to install APE Loader for binfmt_misc, but that's about it. See the Gotchas in the README. Also goes without saying that llamafile runs on WIN32 too, if that's the world you're most comfortable with. It even runs on BSD distros and MacOS. All in a single file.

FragenAntworten2y ago

It doesn't seem to run on NixOS, though I'm new to Nix and may be missing something.

    $ ./llava-v1.5-7b-q4-server.llamafile --help
    ./llava-v1.5-7b-q4-server.llamafile: line 60: /bin/mkdir: No such file or directory

Regardless, this (and Cosmopolitan) are amazing work - thank you!

https://aur.archlinux.org/packages?O=0&K=llama-cpp

askiiart2y ago

> It seems like someone would have included Llama.cpp in their distro, ready-to-run.

Assuming you mean installable with a package manager, not preinstalled on a distro, that requires that some maintainer decide it's worthwhile to add it and maintain it. Distros are pretty selective in what they add to their repos, but there's probably a tool for building .deb or .rpm packages of llama.cpp, and probably a repository for it, but as far as I know no distro has llama.cpp in its repos.

Or Arch Linux's AUR system is much more open, and it indeed has llama-cpp (4 versions of it!), though it requires a helper, such as yay, if you want to install it and keep it up-to-date as if it were a normal package. So Arch has it installable with a package manager if you use yay to supplement pacman.

lordfrikk2y ago

On a fresh install of Ubuntu 23.10, I made it executable (`chmod +x llamafile`) and it worked just like that. After installing `nvidia-cuda-toolkit` it also worked with my GPU. So far the most seamless method I've tried.

chunsj2y ago

If my reading is correct, this literally just distribute an LLM model and code, and you need to do some tasks - like building - to make it actually run, right?

And for this, you need to have additional tools installed?

simonw2y ago

You don't need to do any extra build tasks - the file should be everything you need.

There are some gotchas to watch out for though: https://github.com/mozilla-Ocho/llamafile#gotchas

throwaway_089322y ago

I want to replicate the ROM personality of McCoy Pauley that Case steals in Neuromancer by tuning an LLM to speak like him, and dumping a llamafile of him onto a USB stick.

rightbyte2y ago

This is really impressive. I am glad locally hosted LLMs is a thing. It would be disastrous if e.g. "OpenAI" would get monopoly on these programs.

The model seems worse than the original ChatGPT at coding. However the model is quite small. It certainly could be a NPC in some game.

I guess I need to buy a new computer soon, to be able to run these in their big variants.

SnowingXIV2y ago

Incredible, up and running offline at 104ms per token with no additional configurations. Worked with various permutations of questions and outputs. The fact this is so readily available is wonderful. Using xdg make a nice little shortcut to drop in to automatically fire this off, open up a web browser, and begin.

tannhaeuser2y ago

Does it use Metal on Mac OS (Apple Silicon)? And if not, how does it compare performance-wise against regular llama.cpp? It's not necessarily an advantage to pack everything (huge quantified 4bit? model and code) into a single file, or at least it wasn't when llama.cpp was gaining speed almost daily.

simonw2y ago

It uses the GPU on my M2 Mac - I can see it making use of that in the Activity Monitor GPU panel.

jart2y ago

Correct. Apple Silicon GPU performance should be equally fast in llamafile as it is in llama.cpp. Where llamafile is currently behind is at CPU inference (only on Apple Silicon specifically) which is currently going ~22% slower compared to a native build of llama.cpp. I suspect it's due to either (1) I haven't implemented support for Apple Accelerate yet, or (2) our GCC -march=armv8a toolchain isn't as good at optimizing ggml-quant.c as Xcode clang -march=native is. I hope it's an issue we can figure out soon!

boywitharupee2y ago

currently, on apple silicon "GPU" <> "Metal" are synonymous.

yes, there are other apis (opengl,opencl) to access the gpu but they're all deprecated.

technically, yes, this is using Metal.

m3kw92y ago

This is the first time I'm able to get a chat model to work this easily. Although I can't see myself using it as it is very limited in UI, quality and context length in and out vs ChatGPT

polyrand2y ago

The technical details in the README are quite an interesting read:

https://github.com/mozilla-Ocho/llamafile#technical-details

zoe_dk2y ago

Noob question - how might I call this from my Python script? Say as a replacement gpt3.5 turbo of sorts. Is there an option without GUI?

This is great thank you, very user friendly (exhibit a: me)

simonw2y ago

The llama.cpp server version runs a JSON API that you can call. It's currently missing any documentation though as far as I can tell - I found dome details on Reddit: https://www.reddit.com/r/LocalLLaMA/comments/185kbtg/llamacp...

zoe_dk2y ago

Thank you Simon I will check it out.

m3kw92y ago

Is the context only 1024 tokens? it seem it will cut off more and more (which is weird) after I have longer conversation.

joodfish2y ago

it looks like the Llamafile team is taking questions in their live Q&A tomorrow (thursday) at 1700 UTC - https://www.youtube.com/live/dwhBvUN-MD8?feature=shared

novaomnidev2y ago

Why is this faster than running llama.cpp main directly? I’m getting 7 tokens/ sec with this. But 2 with llama.cpp by itself

hiAndrewQuinn2y ago

Ah, so like SQLite but for model weights.

Edit: No, actually a lot more than that, but not a bad tagline.

verdverm2y ago

Can someone explain why we would want to use this instead of an OCI manifest?

e12e2y ago

Supports more platforms? (No joke)

estebarb2y ago

Currently which are the minimum system requirements for running these models?

jart2y ago

You need at minimum a stock operating system install of:

- Linux 2.6.18+ (arm64 or amd64) i.e. any distro RHEL5 or newer

- MacOS 15.6+ (arm64 or amd64, gpu only supported on arm64)

- Windows 8+ (amd64)

- FreeBSD 13+ (amd64, gpu should work in theory)

- NetBSD 9.2+ (amd64, gpu should work in theory)

- OpenBSD 7+ (amd64, no gpu support)

- AMD64 microprocessors must have SSSE3. Otherwise llamafile will print an error and refuse to run. This means, if you have an Intel CPU, it needs to be Intel Core or newer (circa 2006+), and if you have an AMD CPU, then it needs to be Bulldozer or newer (circa 2011+). If you have a newer CPU with AVX or better yet AVX2, then llamafile will utilize your chipset features to go faster. No support for AVX512+ runtime dispatching yet.

- ARM64 microprocessors must have ARMv8a+. This means everything from Apple Silicon to 64-bit Raspberry Pis will work, provided your weights fit into memory.

I've also tested GPU works on Google Cloud Platform and Nvidia Jetson, which has a somewhat different environment. Apple Metal is obviously supported too, and is basically a sure thing so long as xcode is installed.

anst2y ago

Time to go amd, poor old me, Intel MB Air 2018 (zsh: exec format error, Darwin Kernel Version 22.2.0, MacOS Ventura 13.1).

mercutio22y ago

Apple Security will be excited to reach out to you to find out where you got a copy of macOS 15.6 :)

I'm guessing this should be 13.6?

[1] https://huggingface.co/TheBloke/rocket-3B-GGUF

rgbrgb2y ago

In my experience, if you're on a mac it's about the file size * 150% of RAM to get it working well. I had a user report running my llama.cpp app on a 2017 iMac with 8GB at ~5 tokens/second. Not sure about other platforms.

Hedepig2y ago

I am currently tinkering with this all, you can download a 3b parameter model and run it on your phone. Of course it isn't that great, but I had a 3b param model[1] on my potato computer (a mid ryzen cpu with onboard graphics) that does surprisingly well on benchmarks and my experience has been pretty good with it.

Of course, more interesting things happen when you get to 32b and the 70b param models, which will require high end chips like 3090s.

jart2y ago

That's a nice model that fits comfortably on Raspberry Pi. It's also only a few days old! I've just finished cherry-picking the StableLM support from the llama.cpp project upstream that you'll need in order to run these weights using llamafile. Enjoy! https://github.com/Mozilla-Ocho/llamafile/commit/865462fc465...

Basically enough to fit the download in RAM + a bit more.

In practice, you kinda need a GPU, even a small one. Otherwise prompt processing is really slow.

wazoox2y ago

It's really decent without any GPU. Image analysis is somewhat long, but text prompts are fine. My Ryzen laptop does 2.5 to 4 tokens per second, my Mac pro more like 8.

ionwake2y ago

Im sure this is great, but not screenshot of the GUI?

jart2y ago

Simon Willison's blog post has a screenshot. It's worth a read. https://simonwillison.net/2023/Nov/29/llamafile/

ionwake2y ago

Thank you, is there a way to select a different model? How does the model perform? Just general question if anyone else knows the answers while I try and clear space on my laptop ( why these things fill up so fast!)

outside4152y ago

Cool

victor90002y ago

I read xyz with a single file and already knew Justine was involved lol

OOPMan2y ago

Why does it feel like everyday I see some new example of stupidity on HN.

ukuina2y ago

> Why does it feel like everyday I see some new example of stupidity on HN.

Please explain. This feels like a worthwhile effort to push LLMs towards mass-adoption.

_pdp_2y ago

A couple of steps away from getting weaponized.

pizza2y ago

What couple of steps?

j / k navigate · click thread line to collapse

288 comments

simonw2y ago

1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:

    wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4

2. Make that binary executable, by running this in a terminal:

    chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4

3. Run your new executable, which will start a web server on port 8080:

    ./llamafile-server-0.1-llava-v1.5-7b-q4

4. Navigate to http://127.0.0.1:8080/ to upload an image and start chatting with the model about it in your browser.

Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/

sebmellen2y ago

Wow, this is almost as good as chatgpt-web [0], and it works offline and is free. Amazing.

[0]: https://github.com/Niek/chatgpt-web

oezi2y ago

For GPT-4 the pricing is double that (0.09 USD per 1K). So only 200 larger interactions to reach 20 USD cost.

Or am I wrong?

https://github.com/tluyben/llamafile-docker

hnuser1234562y ago

Source on the newer GPT-4 model being worse at coding?

4 more replies

tluyben22y ago

Popped it into a docker setup:

to save even more keystrokes.

reisse2y ago

What is the point of wrapping absolutely portable single-file program into a Docker container, honest question?

Looks like cargo cult for me.

5 more replies

luke-stanley2y ago

finnjohnsen22y ago

Thank you kindly

bsenftner2y ago

botanical2y ago

I just tried asking it a question:

> User: What is the third planet from the sun?

> Llama: The third planet from the sun is called Mars.

zoomablemind2y ago

> ...> Llama: The third planet from the sun is called Mars.

Ask it if is there life on Mars in that parallel reality

czbond2y ago

The model is trained on large volume data, correct? Why would it get such a simple fact incorrect?

creshal2y ago

Skynet is collaborating with the Martians already, I see.

eek21212y ago

Llama is just from the future. That is all…

StockHuman2y ago

Phenomenal quickstart, and thanks for the write-up. It’s so thrilling that we’re at this point in portability and ease relative performance.

brianjking2y ago

anonzzzies2y ago

Very nice; works perfect on Ubuntu 20.04. Doing 8 tokens/s on a pretty crappy server.

belter2y ago

Perfectly on Fedora 39 on old ( and I mean old...) machines. This is actually shocking...shocking good...

mritchie7122y ago

woah, this is fast. On my M1 this feels about as fast as GPT-4.

pmarreck2y ago

Same here on M1 Max Macbook Pro. This is great!

pyinstallwoes2y ago

How good is it in comparison

mark_l_watson2y ago

Simon, does this also work well on Apple Silicon?

EDIT: never mind, other commenters here answered my question: yes it does work. I have a M2Pro with 32 G on chip memory - amazingly effective for experimenting with open LLMs.

lol7682y ago

    $ chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4

Hmm. Did I do something wrong? (Ubuntu 22.04 / )

Installing the portable binfmt_misc gets me further, but still:

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4 
    zsh: permission denied: ./llamafile-server-0.1-llava-v1.5-7b-q4

    $ sh -c ./llamafile-server-0.1-llava-v1.5-7b-q4
    sh: 1: ./llamafile-server-0.1-llava-v1.5-7b-q4: Permission denied

jart2y ago

You can solve the run-detectors issue with:

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

See the Gotchas section of the README https://github.com/mozilla-Ocho/llamafile#gotchas

phh2y ago

Last thing you need is to chmod +x the interpreter: chmod +x /usr/bin/ape (it is indeed not in the README)

stavros2y ago

I get the same error, and there's no `ape` file to make excecutable, hm.

thejosh2y ago

Damn this is fast and accurate! Crazy how far things are progressing.

theptip2y ago

This seems like a great approach to compare multiple models, in particular.

callmeed2y ago

when I try to do this (MBP M1 Max, Sonoma) I get 'killed' immediately

chatmasta2y ago

carbocation2y ago

Same on an M1 Max 64G, Ventura. Xcode is installed[1].

1 = ```

$ xcode-select --install

xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates

```

derwiki2y ago

On a Macbook Pro M2, I get

    $ ./llamafile-server-0.1-llava-v1.5-7b-q4
    [2]    25224 illegal hardware instruction  ./llamafile-server-0.1-llava-v1.5-7b-q4

raajg2y ago

Yep, same issue. and the error message is unhelpful

fitted2y ago

Same, process gets killed immediately for me.

silentguy2y ago

wget url in step 1 seems to be wrong. It didn't work. This url `https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/ma... ` seems to be working. It's from the link you posted.

jart2y ago

It's back. Sorry about that.

pmarreck2y ago

Anyone have any tuning tips? I messed with some of the configs and now it's mostly hallucinating answers or going off the rails

m1thrandir2y ago

Is the process the same for running multi-part bins? Like the latest deekseek 67b model?

tmountain2y ago

This is amazing. How does result quality compare to GPT4 for image analysis?

simonw2y ago

It's not remotely as good as GPT-4 Vision, which isn't a big surprise consisting it's running a 4GB, 7B parameter model on your laptop, trained by a small research team.

bilsbie2y ago

Thanks for the tip! Any chance this would run on a 2011 MacBook?

pmarreck2y ago

All the new AI toys especially seem to love beefy newish hardware and especially GPU hardware if available

simonw2y ago

Justine says it needs MacOS 13.6+ - does that run on that machine?

NoOn32y ago

Got this: Terminating on uncaught SIGILL.

jart2y ago

https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1

jxy2y ago

so next time llama.cpp releases an update, other people update their favorite backend, you redownload a 4.26 GB file. Epic.

EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.

EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.

simonw2y ago

Each llamafile is a .zip, so if you want to extract the weights out of it you can extract the gguf file directly.

    unzip -l llamafile-server-0.1-llava-v1.5-7b-q4 | grep llava-v1
    Archive:  llamafile-server-0.1-llava-v1.5-7b-q4
    4081004224  11-15-2023 22:13   llava-v1.5-7b-Q4_K.gguf
    177415936  11-15-2023 22:13   llava-v1.5-7b-mmproj-Q4_0.gguf

column2y ago

This is for convenience. You can also download a 4.45Mb executable (llamafile-server-0.1) and pass any GGUF model as an argument.

> llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

_kidlike2y ago

salty much?

aphroz2y ago

Compared to modern bandwidth usage that's not such a big size anymore. Everyday millions of people download 100gb video games, watch 4k video podcasts, etc.

taneq2y ago

Zuiii2y ago

That's why I always download the original version and quantize myself. With enough swap, you can do it with a modest amount for ram. I never had to download a model twice.

But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.

rgbrgb2y ago

Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.

[0]: https://justine.lol/cosmopolitan/

[1]: https://www.freechat.run

[2]: https://github.com/ggerganov/llama.cpp

pdntspa2y ago

At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....

All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!

simonw2y ago

    ./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

See here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

mft_2y ago

If I'm understanding (and agreeing with) your gripe correctly, isn't it two solutions to the same perceived problem?

This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?

joenot4432y ago

The “obsession” isn’t for developers like you or I, it’s for non-technical but curious non-engineers who would still like to play around with an LLM.

I think you’re imposing your own very specific (and solvable!) gripe onto an unrelated project and it makes for pretty clumsy commentary.

rgbrgb2y ago

fwiw FreeChat does this now. It prompts you to download or select a model to use (and you can add as many as you want). No copying or forced downloads.

stevenhuang2y ago

The binaries themselves are available standalone https://github.com/Mozilla-Ocho/llamafile/releases

rgbrgb2y ago

cool. this is more convenient than my workflow for doing the binaries myself. I currently use make to generate a binary of llama.cpp server on my intel iMac and my m1 MacBook then lipo them together.

coldtea2y ago

Well, that's on the MAS/TestFlight for not doing delta updates.

rgbrgb2y ago

Asmod4n2y ago

csdvrx2y ago

tbalsam2y ago

> in place

._.

Pain.

wyldfire2y ago

> Extremely cool ...

> I'm kind of struggling with the use-cases for this one.

Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.

And the microarch-specific dispatch is a nice touch.

...maybe I'm convincing myself of the alternative....

larodi2y ago

Why introduce one more container? Who benefits from binary distribution of this sort?

ukuina2y ago

> Why introduce one more container? Who benefits from binary distribution of this sort?

Mass adoption comes from ease-of-use, which includes ease-of-deployment.

Most of the HN crowd is savvy enough to run HF models directly in their terminal, but there's a reason Ollama, LM Studio, and Faraday are so popular.

Until LLM runners are part of the base OS, methods like this are extremely helpful to reduce friction.

whstl2y ago

ElectricalUnion2y ago

> llama.cpp already compiles and runs pretty much everywhere.

Well, it simplifies things when you don't need to compile things.

Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.

> Why introduce one more container?

It makes stuff more convenient.

`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".

I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.

> Who benefits from binary distribution of this sort?

Anyone that doesn't have a compiler SDK on their computer.

halyconWays2y ago

>Extremely cool and Justine Tunney / jart does incredible portability work [0],

[x] Doubt.

That user was caught stealing code and banned from llama.cpp by its creator (your [2] citation) https://news.ycombinator.com/item?id=35411909

Maybe the same thing is happening here. Plagiarism of code.

hobofan2y ago

What are you on about? There was no stealing and there was no plagiarism.

epaga2y ago

Hmmm no matter what image I upload, it's somehow not seeing it.

Got a chuckle out of this, though: ---

User: Describe the image in the most detailed way possible

User: is it just a black square?

Llama: Yes, it is just a black square.

sebmellen2y ago

Interesting, I uploaded a pure white 50x50 png and got this:

> User: Describe this image in the most detailed way possible

> User: Is it just a white square?

dmazzoni2y ago

LLM vision is surprisingly human-like. Point an actual human at a blank canvas and I'll bet many would hallucinate things that aren't there.

CaptainFever2y ago

That's comedic. I love it.

keybits2y ago

Simon Willison has a great post on this https://simonwillison.net/2023/Nov/29/llamafile/

phil422y ago

User:

User: There are no cows. Could it be that you've mistaken rocks for cows?

User: There are no cows!

dang2y ago

(via https://news.ycombinator.com/item?id=38463456 and https://news.ycombinator.com/item?id=38464759, but we merged the comments hither)

abrinz2y ago

I've been playing with various models in llama.cpp's GGUF format like this.

  git clone https://github.com/ggerganov/llama.cpp     

  cd llama.cpp

  make 

  # M2 Max - 16 GB RAM

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_0.gguf
  
  ./server -m models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32

  # M1 - 8 GB RAM 

  wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

  ./server -m models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32

m1thrandir2y ago

even easier with https://gpt4all.io/index.html

foruhar2y ago

Llaminate would be decent name for something like. Or the verb for the general wrapping of a llama compatible model into a ready to use blob.

gsuuon2y ago

Llamanate

modeless2y ago

patcon2y ago

> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.

xnx2y ago

> Windows also has a maximum file size limit of 2GB for executables. You need to have llamafile and your weights be separate files on the Windows platform.

The 4GB .exe ran fine on my Windows 10 64-bit system.

jart2y ago

throwaway7432y ago

Thank you!

mistrial92y ago

thx to Mozilla and Justin Tunney for this very easy, local experiment today!

dekhn2y ago

I get the desire to make self-contained things, but a binary that only runs one model with one set of weights seems awfully constricting to me.

simonw2y ago

There's also a "llamafile" 4MB binary that can run any model (GGUF file) that you pass to it: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

dekhn2y ago

Right. So if that exists, why would I want to embed my weights in the binary rather than distributing them as a side file?

I assume the answers are "because Justine can" and "sometimes it's easier to distribute a single file than two".

jart2y ago

llamafile will run any compatible model you want. For example, if you download the LLaVA llamafile, you can still pass `-m wizardcoder.gguf` to override the default weights.

espadrine2y ago

russellbeattie2y ago

omeze2y ago

dws2y ago

Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.

dmezzetti2y ago

From a technical standpoint, this project is really fascinating. I can see a lot of use cases for getting something up fast locally for an individual user.

gfodor2y ago

This stuff is for people who don't care about medium to large companies IT teams.

fbdab1032y ago

I am failing to see the difference. It is a zip file with an executable and a blob of weights. What would change if it were stored in a Dockerfile?

fullspectrumdev2y ago

Literally nothing given the docker file would be a couple of wget invocations and a chaos lol

amelius2y ago

> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU

This is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.

tomwojcik2y ago

amelius2y ago

This is a bit like saying if you don't specify "--dram", the data will be stored on punchcards.

From the user's point of view: they just want to run the thing, and as quickly as possible. If multiple programs want to use the GPU, then the OS and/or the driver should figure it out.

michaelt2y ago

> What other API would you suggest?

Assuming increasing vram leads to an appreciable improvement in model speed, it should default to using all but 10% of the vram of the largest GPU, or all but 1GB, whichever is less.

If I've got 8GB of vram, the software should figure out the right number of layers to offload and a sensible context size, to not exceed 7GB of vram.

(Although I realise the authors are just doing what llama.cpp does, so they didn't design it the way it is)

> What other API would you suggest?

MLC LLM?

I think the binary it compiles down to (Probably the Vulkan and Metal ones for yall) is seperate from the weights, so you could ship a bunch in one file.

bjnewman852y ago

Justine is creating mind-blowing projects at an alarming rate.

Luc2y ago

This is pretty darn crazy. One file runs on 6 operating systems, with GPU support.

tfinchOP2y ago

yeah the section on how the GPU support works is wild!

amelius2y ago

Why don't package managers do stuff like this?

thelastparadise2y ago

So if you share a binary with a friend you'd have to have them install cuda toolkit too?

Seems like a dealbreaker for the whole idea.

quickthrower22y ago

Like a docker for LLMs

verdverm2y ago

I don't see why you cannot use a container for LLMs, that's how we've shipping and deploying runnable models for years

AMICABoard2y ago

This puts a super great evil happy grin on my face. I am going to add it in the next version of L2E OS! Thank you jart, thank you mozilla! Love you folks!

tatrajim2y ago

No comparison, alas.

simonw2y ago

That's not a llamafile thing, that's a llava-v1.5-7b-q4 thing - you're running the LLaVA 1.5 model at a 7 billion parameter size further quantized to 4 bits (the q4).

GPT4-Vision is running a MUCH larger model than the tiny 7B 4GB LLaVA file in this example.

LLaVA have a 13B model available which might do better, though there's no chance it will be anywhere near as good as GPT-4 Vision. https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZO...

jokethrowaway2y ago

Nice but you are leaving some performance on the table (if you have a GPU)

Exllama + GPTQ is the way to go

llama.cpp && GGUF are great on CPUs

More data: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...

AMICABoard2y ago

Which is a smaller model, that gives good output and that works best with this. I am looking to run this on lower end systems.

I wonder if someone has already tried https://github.com/jzhang38/TinyLlama, could save me some time :)

zitterbewegung2y ago

zerojames2y ago

The ML field is doing work in that area: https://github.com/huggingface/safetensors

visarga2y ago

marginalia_nu2y ago

This sounds very implausible. Is there like a proof of concept I missed?

benatkin2y ago

RecycledEle2y ago

Fantastic.

For those of who who swim in the Microsoft ecosystem, and do not compile Linux apps from code, what Linux dustro would run this without fixing a huge number of dependencies?

It seems like someone would have included Llama.cpp in their distro, ready-to-run.

Yes, I'm an idiot.

jart2y ago

FragenAntworten2y ago

It doesn't seem to run on NixOS, though I'm new to Nix and may be missing something.

    $ ./llava-v1.5-7b-q4-server.llamafile --help
    ./llava-v1.5-7b-q4-server.llamafile: line 60: /bin/mkdir: No such file or directory

Regardless, this (and Cosmopolitan) are amazing work - thank you!

https://aur.archlinux.org/packages?O=0&K=llama-cpp

askiiart2y ago

> It seems like someone would have included Llama.cpp in their distro, ready-to-run.

lordfrikk2y ago

chunsj2y ago

If my reading is correct, this literally just distribute an LLM model and code, and you need to do some tasks - like building - to make it actually run, right?

And for this, you need to have additional tools installed?

simonw2y ago

You don't need to do any extra build tasks - the file should be everything you need.

There are some gotchas to watch out for though: https://github.com/mozilla-Ocho/llamafile#gotchas

throwaway_089322y ago

I want to replicate the ROM personality of McCoy Pauley that Case steals in Neuromancer by tuning an LLM to speak like him, and dumping a llamafile of him onto a USB stick.

rightbyte2y ago

This is really impressive. I am glad locally hosted LLMs is a thing. It would be disastrous if e.g. "OpenAI" would get monopoly on these programs.

The model seems worse than the original ChatGPT at coding. However the model is quite small. It certainly could be a NPC in some game.

I guess I need to buy a new computer soon, to be able to run these in their big variants.

SnowingXIV2y ago

tannhaeuser2y ago

simonw2y ago

It uses the GPU on my M2 Mac - I can see it making use of that in the Activity Monitor GPU panel.

jart2y ago

boywitharupee2y ago

currently, on apple silicon "GPU" <> "Metal" are synonymous.

yes, there are other apis (opengl,opencl) to access the gpu but they're all deprecated.

technically, yes, this is using Metal.

m3kw92y ago

This is the first time I'm able to get a chat model to work this easily. Although I can't see myself using it as it is very limited in UI, quality and context length in and out vs ChatGPT

polyrand2y ago

The technical details in the README are quite an interesting read:

https://github.com/mozilla-Ocho/llamafile#technical-details

zoe_dk2y ago

Noob question - how might I call this from my Python script? Say as a replacement gpt3.5 turbo of sorts. Is there an option without GUI?

This is great thank you, very user friendly (exhibit a: me)

simonw2y ago

zoe_dk2y ago

Thank you Simon I will check it out.

m3kw92y ago

Is the context only 1024 tokens? it seem it will cut off more and more (which is weird) after I have longer conversation.

joodfish2y ago

it looks like the Llamafile team is taking questions in their live Q&A tomorrow (thursday) at 1700 UTC - https://www.youtube.com/live/dwhBvUN-MD8?feature=shared

novaomnidev2y ago

Why is this faster than running llama.cpp main directly? I’m getting 7 tokens/ sec with this. But 2 with llama.cpp by itself

hiAndrewQuinn2y ago

Ah, so like SQLite but for model weights.

Edit: No, actually a lot more than that, but not a bad tagline.

verdverm2y ago

Can someone explain why we would want to use this instead of an OCI manifest?

e12e2y ago

Supports more platforms? (No joke)

estebarb2y ago

Currently which are the minimum system requirements for running these models?

jart2y ago

You need at minimum a stock operating system install of:

- Linux 2.6.18+ (arm64 or amd64) i.e. any distro RHEL5 or newer

- MacOS 15.6+ (arm64 or amd64, gpu only supported on arm64)

- Windows 8+ (amd64)

- FreeBSD 13+ (amd64, gpu should work in theory)

- NetBSD 9.2+ (amd64, gpu should work in theory)

- OpenBSD 7+ (amd64, no gpu support)

- ARM64 microprocessors must have ARMv8a+. This means everything from Apple Silicon to 64-bit Raspberry Pis will work, provided your weights fit into memory.

anst2y ago

Time to go amd, poor old me, Intel MB Air 2018 (zsh: exec format error, Darwin Kernel Version 22.2.0, MacOS Ventura 13.1).

mercutio22y ago

Apple Security will be excited to reach out to you to find out where you got a copy of macOS 15.6 :)

I'm guessing this should be 13.6?

[1] https://huggingface.co/TheBloke/rocket-3B-GGUF

rgbrgb2y ago

Hedepig2y ago

Of course, more interesting things happen when you get to 32b and the 70b param models, which will require high end chips like 3090s.

jart2y ago

Basically enough to fit the download in RAM + a bit more.

In practice, you kinda need a GPU, even a small one. Otherwise prompt processing is really slow.

wazoox2y ago

It's really decent without any GPU. Image analysis is somewhat long, but text prompts are fine. My Ryzen laptop does 2.5 to 4 tokens per second, my Mac pro more like 8.

ionwake2y ago

Im sure this is great, but not screenshot of the GUI?

jart2y ago

Simon Willison's blog post has a screenshot. It's worth a read. https://simonwillison.net/2023/Nov/29/llamafile/

ionwake2y ago