The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago - tons of people innovating and debating approaches, lots of projects popping up, so much energy!
Hmm. If LLMs turned out like JS frameworks, that would mean that in ten years people will be saying:
“Maybe we don’t really need all this expensive ceremony, honestly this could be done with vanilla if/else heuristics…?”
At that time, there could be complaints on hacker news about messaging apps with autocomplete models that take up gigabytes.
I do that with orca-mini-3b in ggml format and it's pretty good at it, at twice the speed. Of all the LLMs I've tried, this one gave me the best results. It just requires a properly written prompt.
I kind of have the same feeling as well. With all this energy it's really hard to keep up with all new ideas, implementations, frameworks and services.
Really excited for what this will bring us the next coming years
The majority of them are mostly irrelevant. You just need to figure out which.
Not sure about this. atm, the cost of any cloud GPU (spot or not) far exceeds the cost of OpenAI's API. I'd be glad to be proven wrong because I, too, want to run L2 (the 70b model).
Also, buying a GPU, even 4090, is not feasible for most people. And it's not just about GPU—you'd have to build a PC for it to work, and there's the hidden maintenance cost of running desktop Linux (to use GPTQ for instance). It's not surprising that most users prefer someone else (OpenAI) to do it for them.
Sure, you can run something comparable to OpenAI's flagship product at home, but it's moderately expensive and slightly inconvenient so people will still pay for the convenience.
To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):
sudo apt-get update -y
sudo apt-get install build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
sudo apt-get install python3-pip -y
sudo pip install --upgrade huggingface_hub
# skip using token as git credential
huggingface-cli login (for Meta model access paste token from HF[2])
sudo pip install vllm # ~8 minutes
Then, edit the test code for a 7b Llama 2 model (paste into llama.py): from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("The capital of Brazil is called")
print(output)
Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens
See table 10 (page 22) of the whitepaper for the numbers: https://ai.meta.com/research/publications/llama-2-open-found...
Are there other downloadable models which can be used in a multilingual environment that people here are aware of?
edit: To make the implied question explicit, I guess it might do well on other similar Germanic languages (say Norwegian) but struggle beyond that? Or?
It is nowhere near usable.
Perhaps the 70B model performs better, but 13B produces translations that are garbage.
And 70B will no doubt be much better.
More projects in this space:
- llama.cpp which is a fast, low level runner (with bindings in several languages)
- llm by Simon Willison which supports different backends and has a really elegant CLI interface
- The MLC.ai and Apache TVM projects
Previous discussion on HN that might be helpful from an article by the great folks at replicate: https://news.ycombinator.com/item?id=36865495
Further, the 70B for llama.cpp is still under development as far as I know.
Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.
To run the 70B model you can try:
ollama run llama2:70b
Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charmYou can even extend context with RoPE!
Sadly 7b is not very good for SQL tasks. I think even with RAG it would struggle.
Here’s the stream - https://www.youtube.com/live/LitybCiLhSc?feature=share
One is with LoRa and the other QLoRa I also do a breakdown on each fine-tuning method. I wanted to make these since I myself have had issues running LLMs locally and Colab is the cheapest GPU I can find haha.
Unfortunately with all of the hype it seems that unless you have a REALLY beefy machine the better 70B model feels out of reach for most to run locally leaving the 7B and 13B as the only viable options outside of some quantization trickery. Or am I wrong in that?
I want to focus more on larger context windows since it seems like RAG has a lot of promise so it seems like the 7B with giant context window is the best path to explore rather than focusing on getting the 70B to work locally
More reading on that problem if you're curious: https://arxiv.org/pdf/2307.03172.pdf
Running on a 3090. The 13b chat model quantized to fp8 is giving about 42 tok/s.
It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).
The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.
For example, I use only 6 cores from 10 on my M1 Pro laptop.
Is it using mmap and concealing the actual memory usage?
I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.
On Linux, ExLlama and MLC LLM have native ROCm support, and there is a HIPified fork of llama.cpp as well.
...But I am also a bit out of the loop. For instance, I have not kept up with the CFG/negative prompt or grammar implementations in the UIs.