1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4
2. Make that binary executable, by running this in a terminal: chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4
3. Run your new executable, which will start a web server on port 8080: ./llamafile-server-0.1-llava-v1.5-7b-q4
4. Navigate to http://127.0.0.1:8080/ to upload an image and start chatting with the model about it in your browser.Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/
In case anyone here hasn't used chatgpt-web, I recommend trying it out. With the new GPT-4 models you can chat for way cheaper than paying for ChatGPT Plus, and you can also switch back to the older (non-nerfed) GPT-4 models that can still actually code.
For GPT-4 the pricing is double that (0.09 USD per 1K). So only 200 larger interactions to reach 20 USD cost.
Or am I wrong?
https://github.com/tluyben/llamafile-docker
to save even more keystrokes.
Looks like cargo cult for me.
> User: What is the third planet from the sun?
> Llama: The third planet from the sun is called Mars.
Ask it if is there life on Mars in that parallel reality
EDIT: never mind, other commenters here answered my question: yes it does work. I have a M2Pro with 32 G on chip memory - amazingly effective for experimenting with open LLMs.
$ chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
$ ./llamafile-server-0.1-llava-v1.5-7b-q4
run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4
Hmm. Did I do something wrong? (Ubuntu 22.04 / )Installing the portable binfmt_misc gets me further, but still:
$ ./llamafile-server-0.1-llava-v1.5-7b-q4
zsh: permission denied: ./llamafile-server-0.1-llava-v1.5-7b-q4
$ sh -c ./llamafile-server-0.1-llava-v1.5-7b-q4
sh: 1: ./llamafile-server-0.1-llava-v1.5-7b-q4: Permission denied sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
You can solve the zsh permission denied issue by either (1) upgrade to zsh 5.9+ (I upstreamed a fix for this bug in zsh two years ago) or (2) use the sh -c workaround you discovered. If that one doesn't work, then it likely needs to be chmod +x. If the execute bit is set, and your sh still isn't working, then please let me know, because I'm not aware of any sh that still doesn't support APE.See the Gotchas section of the README https://github.com/mozilla-Ocho/llamafile#gotchas
This seems like a great approach to compare multiple models, in particular.
1 = ```
$ xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates
```
$ ./llamafile-server-0.1-llava-v1.5-7b-q4
[2] 25224 illegal hardware instruction ./llamafile-server-0.1-llava-v1.5-7b-q4All the new AI toys especially seem to love beefy newish hardware and especially GPU hardware if available
EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.
EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.
unzip -l llamafile-server-0.1-llava-v1.5-7b-q4 | grep llava-v1
Archive: llamafile-server-0.1-llava-v1.5-7b-q4
4081004224 11-15-2023 22:13 llava-v1.5-7b-Q4_K.gguf
177415936 11-15-2023 22:13 llava-v1.5-7b-mmproj-Q4_0.gguf> llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf
You know, most people don't have 24+GB GPUs sitting around to train these models. So in my book this is a huge step forward. Personally, this is the first time i am able to run an LLM on my computer, and it's purely thanks to this.
But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.
I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.
At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....
All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!
./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf
See here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...My experience is that the world of Python dependency management is a mess which sometimes works, and sometimes forces you to spend hours-to-days searching for obscure error messages and trying maybe-fixes posted in Github issues for some other package, just in case it helps. This sometimes extends further - e.g. with hours-to-days spent trying to install just-the-right-version-of-CUDA on Linux...
Anyway, the (somewhat annoying but understandable) solution that some developers take is to make their utility/app/whatever as self-contained as possible with a fresh install of everything from Python downwards inside a venv - which results in (for example) multiple copies of PyTorch spread around your HDD. This is great for less technical users who just need a minimal-difficulty install (as IME it works maybe 80-90% of the time), good for people who don't want to spend their time debugging incompatibilities between different library versions, but frustrating for the more technically-inclined user.
This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?
I think you’re imposing your own very specific (and solvable!) gripe onto an unrelated project and it makes for pretty clumsy commentary.
Well, that's on the MAS/TestFlight for not doing delta updates.
You could have multiple updates my using say iso time and doing a sort (so that ~/.freechat.run.20231127120000 would be overriden by ~/.freechat.run.20231129160000 without making the user delete anything)
._.
Pain.
> I'm kind of struggling with the use-cases for this one.
IMO cosmopolitan libc is a "really neat trick". And it deserves praise and it probably does have some real use cases. But it's not practical for most purposes. If we had a format like ELF that was so fat as to support as many architectures and OSs as desired, would we be using that? I have a feeling that we would not.
Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.
And the microarch-specific dispatch is a nice touch.
...maybe I'm convincing myself of the alternative....
And of course, I can imagine where the whole cosmopolitan thing comes from,... even as manifest of sorts for the idea of systems-neutrality and potentially gender fluidity. But I really wonder whether GGUF actually needs this, since llama.cpp already compiles and runs pretty much everywhere.
Why introduce one more container? Who benefits from binary distribution of this sort?
Mass adoption comes from ease-of-use, which includes ease-of-deployment.
Most of the HN crowd is savvy enough to run HF models directly in their terminal, but there's a reason Ollama, LM Studio, and Faraday are so popular.
Until LLM runners are part of the base OS, methods like this are extremely helpful to reduce friction.
About your question in the third paragraph: This is totally orthogonal to GGUF, and a cursory reading of the README shows that it does uses GGUF. This is not about a new universal LLM format, this is about packing it in a universal executable that runs everywhere, using Cosmopolitan.
Some examples do pack the executable and GGUF weights together in a single file, but that's not dissimilar from an self-executing zip, the only difference is that this executable is not OS-specific, so you can use the same exact binary for macOS or Linux, for example.
Well, it simplifies things when you don't need to compile things.
Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.
> Why introduce one more container?
It makes stuff more convenient.
`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".
I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.
> Who benefits from binary distribution of this sort?
Anyone that doesn't have a compiler SDK on their computer.
[x] Doubt.
That user was caught stealing code and banned from llama.cpp by its creator (your [2] citation) https://news.ycombinator.com/item?id=35411909
Maybe the same thing is happening here. Plagiarism of code.
They made a PR that was built on top of another PR. The authorship information was preserved in the git history, and there was no attempt at deception. They also supposedly collaborated with the author of the original PR (which was never denied by either of them). All of this is totally normal working practice.
Those allegations of "stealing" just stem from a GH user piling onto the drama from the breaking change by pointing out where the initials from the new file format come from (which wasn't called into question on the original PR).
They were also not banned for those stealing allegations. They, as well as the author from the reversal PR were banned, as the maintainer deemed the resulting "drama" from the breaking changes to be a distraction to the project goals. The maintainer accepted the PR, and the nature of the breaking changes was obviously stated, so that drama wasn't completely on jart.
Got a chuckle out of this, though: ---
User: Describe the image in the most detailed way possible
Llama: The image is a close-up of an empty, dark room with no visible objects or people. It appears to be a black background photo that highlights the darkness and emptiness of the space. This could potentially represent an artistic representation or an actual setting where lighting has been used to emphasize the contrast between light and shadow. The focus on this particular area suggests that it might have some significance, whether symbolic or practical in nature. Overall, the image evokes a sense of mystery and intrigue due to its lack of context and content.
User: is it just a black square?
Llama: Yes, it is just a black square.
> User: Describe this image in the most detailed way possible
> Llama: The image displays a large, mostly white background with some small gray areas. There is also an orange dot on the right side of the scene, which stands out against the predominantly white and gray backdrop.
> User: Is it just a white square?
> Llama: No, it is not just a white square. The image also contains some gray and tan colors as well. It appears to be an abstract art piece with various shades of color, making the overall appearance more interesting than a simple all-white background.
User:
Llama: The image features a serene landscape with mountains in the background and a river flowing through it. A picturesque scene unfolds as two trees are visible near the water, adding to the natural beauty of the environment. In addition to these elements, there is an interesting detail: several cows can be seen standing on or around the grassy area next to the body of water. The presence of these animals creates a sense of tranquility and harmony with nature in this idyllic setting.
User: There are no cows. Could it be that you've mistaken rocks for cows?
Llama: No, the rocks in this image are not mistaken for cows. They can be seen near a lake with mountains in the background and trees surrounding it. The painting captures a serene landscape where nature is beautifully displayed. This picturesque scene showcases both the majestic mountain range and the tranquil body of water, creating an idyllic atmosphere that evokes feelings of peacefulness and relaxation.
User: There are no cows!
Llama: The image depicts a serene scene of mountains and water. A beautiful lake is surrounded by majestic snow-capped mountains, creating an idyllic landscape. In the foreground, there are trees adding to the natural beauty of the environment.
A few birds can be seen flying in the sky above the lake, further enhancing the peaceful atmosphere. The presence of a boat on the water adds another element of leisure and enjoyment to this picturesque scene. Overall, it is an inviting image that showcases nature's wonders with its combination of mountains, lake, trees, birds, and boats.
(via https://news.ycombinator.com/item?id=38463456 and https://news.ycombinator.com/item?id=38464759, but we merged the comments hither)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# M2 Max - 16 GB RAM
wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_0.gguf
./server -m models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32
# M1 - 8 GB RAM
wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
./server -m models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32I see it unfortunately requires the CUDA developer toolkit to be installed. It's totally possible to distribute CUDA apps that run without any dependencies installed other than the Nvidia driver. If they could figure that out it would be a game changer.
<3
The 4GB .exe ran fine on my Windows 10 64-bit system.
Thank you!
thx to Mozilla and Justin Tunney for this very easy, local experiment today!
I assume the answers are "because Justine can" and "sometimes it's easier to distribute a single file than two".
It is similar in my mind to the early days of Linux, where you had to compile it yourself and tweaked some compiler flags, compared to now, where most people don’t even think about the fact that their phone or Steam deck runs it.
That said, it does reduce the friction of getting an LLM up and running and the self-contained nature makes it sort of a dedicated program equivalent to Awk. It might open up possibilities, like the AI version of the Unix philosophy - do one thing well. A hyper optimized LLM could be used in pipeline of commands, for example.
But for anyone in a production/business setting, it would be tough to see this being viable. Seems like it would be a non-starter for most medium to large companies IT teams. The great thing about a Dockerfile is that it can be inspected and the install process is relatively easy to understand.
This is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.
From the user's point of view: they just want to run the thing, and as quickly as possible. If multiple programs want to use the GPU, then the OS and/or the driver should figure it out.
Assuming increasing vram leads to an appreciable improvement in model speed, it should default to using all but 10% of the vram of the largest GPU, or all but 1GB, whichever is less.
If I've got 8GB of vram, the software should figure out the right number of layers to offload and a sensible context size, to not exceed 7GB of vram.
(Although I realise the authors are just doing what llama.cpp does, so they didn't design it the way it is)
MLC LLM?
I think the binary it compiles down to (Probably the Vulkan and Metal ones for yall) is seperate from the weights, so you could ship a bunch in one file.
Seems like a dealbreaker for the whole idea.
Llamafile: "The image features a tall, stone-like structure with many levels and carved designs on it. It is situated in front of an Asian temple building that has several windows. In the vicinity, there are two cars parked nearby – one closer to the left side of the scene and another further back towards the right edge. . ."
ChatGPT4:"The photo depicts a traditional Korean stone pagoda, exhibiting a tiered tower with multiple levels, each diminishing in size as they ascend. It is an example of East Asian pagodas, which are commonly found within the precincts of Buddhist temples. . . The building is painted in vibrant colors, typical of Korean temples, with green being prominent."
No comparison, alas.
GPT4-Vision is running a MUCH larger model than the tiny 7B 4GB LLaVA file in this example.
LLaVA have a 13B model available which might do better, though there's no chance it will be anywhere near as good as GPT-4 Vision. https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZO...
Exllama + GPTQ is the way to go
llama.cpp && GGUF are great on CPUs
More data: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...
I wonder if someone has already tried https://github.com/jzhang38/TinyLlama, could save me some time :)
For those of who who swim in the Microsoft ecosystem, and do not compile Linux apps from code, what Linux dustro would run this without fixing a huge number of dependencies?
It seems like someone would have included Llama.cpp in their distro, ready-to-run.
Yes, I'm an idiot.
$ ./llava-v1.5-7b-q4-server.llamafile --help
./llava-v1.5-7b-q4-server.llamafile: line 60: /bin/mkdir: No such file or directory
Regardless, this (and Cosmopolitan) are amazing work - thank you!Assuming you mean installable with a package manager, not preinstalled on a distro, that requires that some maintainer decide it's worthwhile to add it and maintain it. Distros are pretty selective in what they add to their repos, but there's probably a tool for building .deb or .rpm packages of llama.cpp, and probably a repository for it, but as far as I know no distro has llama.cpp in its repos.
Or Arch Linux's AUR system is much more open, and it indeed has llama-cpp (4 versions of it!), though it requires a helper, such as yay, if you want to install it and keep it up-to-date as if it were a normal package. So Arch has it installable with a package manager if you use yay to supplement pacman.
And for this, you need to have additional tools installed?
There are some gotchas to watch out for though: https://github.com/mozilla-Ocho/llamafile#gotchas
The model seems worse than the original ChatGPT at coding. However the model is quite small. It certainly could be a NPC in some game.
I guess I need to buy a new computer soon, to be able to run these in their big variants.
yes, there are other apis (opengl,opencl) to access the gpu but they're all deprecated.
technically, yes, this is using Metal.
This is great thank you, very user friendly (exhibit a: me)
Edit: No, actually a lot more than that, but not a bad tagline.
- Linux 2.6.18+ (arm64 or amd64) i.e. any distro RHEL5 or newer
- MacOS 15.6+ (arm64 or amd64, gpu only supported on arm64)
- Windows 8+ (amd64)
- FreeBSD 13+ (amd64, gpu should work in theory)
- NetBSD 9.2+ (amd64, gpu should work in theory)
- OpenBSD 7+ (amd64, no gpu support)
- AMD64 microprocessors must have SSSE3. Otherwise llamafile will print an error and refuse to run. This means, if you have an Intel CPU, it needs to be Intel Core or newer (circa 2006+), and if you have an AMD CPU, then it needs to be Bulldozer or newer (circa 2011+). If you have a newer CPU with AVX or better yet AVX2, then llamafile will utilize your chipset features to go faster. No support for AVX512+ runtime dispatching yet.
- ARM64 microprocessors must have ARMv8a+. This means everything from Apple Silicon to 64-bit Raspberry Pis will work, provided your weights fit into memory.
I've also tested GPU works on Google Cloud Platform and Nvidia Jetson, which has a somewhat different environment. Apple Metal is obviously supported too, and is basically a sure thing so long as xcode is installed.
Of course, more interesting things happen when you get to 32b and the 70b param models, which will require high end chips like 3090s.
In practice, you kinda need a GPU, even a small one. Otherwise prompt processing is really slow.
Please explain. This feels like a worthwhile effort to push LLMs towards mass-adoption.