git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
python3 -m pip install -r requirements.txt
cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlockThough I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:
python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
Traceback (most recent call last):
File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
File "/l/llama.cpp/convert.py", line 1129, in main
model_plus = load_some_model(args.model)
File "/l/llama.cpp/convert.py", line 1055, in load_some_model
models_plus.append(lazy_load_file(path))
File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
raise ValueError(f"unknown format: {path}")
ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.
It’s just way more expensive to train larger models.
They specifically note they are training a smaller 3B model In the future.
So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.
This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.
30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.
No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.
They're kidding right, there's no way that thing will be more useful than one of those flan models.
[1]https://github.com/openlm-research/open_llama#future-plans
I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.
"weights" - synapses in the AI brain
"tokens" - word fragments
"model" - of course, the model is the AI brain
"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context
"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts
"LoRA" - a lightweight plug-in model for tweaking the big model
"loss" - a score telling how bad is the output
"training" - change the model until it fits the data
"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute
"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned
It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.
I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.
My advice to folks is, if you actually want to know how this stuff works at some basic level, put in some time learning how basic linear and logistic regression work, including how to train it using back propagation. From there you'll have a solid foundation that gives enough context to understand most deep learning concepts at a high level.
when it can hallucinate content, why do that instead of reading a blog post from an expert?
After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.
Just curious, didn't see any date...
A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.
Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.
At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!
> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.
Because data doesn't really flow through weight matrices, though perhaps this is true if you squint at very simple models. Deep learning architectures are generally more complicated than multiplying values by weights and pushing the results to the next layer, though which architecture to use depends heavily on context.
> Tokens are sort of "words" in a sentence
Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)
To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.
Example, download the code try to get it to work. Why is it not working? Oh it's trying to look for the model. Search for how to get the model and set it up. Then key step, recursively look up every single thing in the guide or set up. Don't try to set something up or fix some thing without truly understanding what it is you are doing (e.g. copy and paste). This gives you a structured why to fill in the foundations of what it is you are trying to get to work in a more focused and productive manner. At the end you might realize that their approach or yours is not optimal "oh it was telling me to download the 65k model when I can only run 7k on my machine bc ..."
> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run
1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.
Assuming "on-demand" pricing [1] that's about $500,000 training cost.
Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.
The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.
I might be misreading it. It might be just 12 GPUs.
1. Get a machine with decent GPU, probably rent cloud GPU.
2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...
3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.
4. Install EasyLM:
conda env create -f scripts/gpu_environment.yml
conda activate EasyLM
5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md: python -m EasyLM.models.llama.llama_serve \
--mesh_dim='1,1,-1' \
--load_llama_config='13B' \
--load_checkpoint='params::path/to/easylm/llama/checkpoint' \
Am I even close?If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.
The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat
The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui
I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.
https://github.com/modal-labs/modal-examples/blob/main/06_gp...
Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.
Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.
I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.
Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.
Source: https://twitter.com/togethercompute/status/16527350961501757...
Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)
https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1
What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.
Can someone explain what this means? I don't understand.
After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:
if the model somewhere has "GGML" in its name, it doesn't require a GPU.
GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).
Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:
pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python
So that's 1.2 trillion tokens. Nice.
I have no formal idea how this is done, but my assumption is that "something like that" should work.
Please disabuse me of any silly ideas.
Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].
While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.
Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.
This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.
Only there is a big problem, context sizes. Damn. 4k tokens?
We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.
This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.
So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).
1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf
2: Embeddings https://arxiv.org/pdf/2201.10005.pdf
3: https://twitter.com/marktenenholtz/status/165156810719298355...
Hope that helps!