OpenLLaMA: An Open Reproduction of LLaMA (opens in new tab)

(github.com)

484 pointssadiq3y ago180 comments

180 comments

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock

gigel823y ago

You the real MVP!

Though I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:

   python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
   Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
   Traceback (most recent call last):
    File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
      convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
    File "/l/llama.cpp/convert.py", line 1129, in main
       model_plus = load_some_model(args.model)
     File "/l/llama.cpp/convert.py", line 1055, in load_some_model
       models_plus.append(lazy_load_file(path))
     File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
       raise ValueError(f"unknown format: {path}")
   ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

sebastianhoitz3y ago

I had the same issue and then noticed that I need git lfs - otherwise just cloning the repo will not download the weights.

2 more replies

kdtsh3y ago

I get the same error on an M series MacBook (Ventura). However from the repo README.md it looks like make should work instead of cmake, I’ll give that a try.

logicchains3y ago

It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.

wokwokwok3y ago

There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

b33j0r3y ago

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

1 more reply

scotty793y ago

Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?

2 more replies

moffkalast3y ago

> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

ummonk3y ago

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.

tarruda3y ago

I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.

logicchains3y ago

That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?

5 more replies

simion3143y ago

slow could be useful if you do not want to chat with it, and instead you could code it to do a long running job, like code review your entire project like a code analysis tool. Or summarize a lot of content.

bagels3y ago

How low? I think everybody has different requirements there.

2 more replies

quickthrower23y ago

If I rent an A100 what kind of speed could I expect?

1 more reply

newswasboring3y ago

At least for now they are focused on 7B and then 3B[1].

[1]https://github.com/openlm-research/open_llama#future-plans

Silverback_VII3y ago

I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.

cubefox3y ago

The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See

https://dynomight.net/scaling/

1 more reply

jjice3y ago

Does anyone have any resources they recommend for just understanding the base terminology of models like this? I always see the terms "weights", "tokens", "model", etc. I feel like I understand what these mean, but I have no idea what I need to care about them for in open models like this? If I were to download an open model to run on my machine, would I download the weights? I'm just ignorant in the ML space I guess but not sure where to start.

visarga3y ago

Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this? Get those personalised explanations and enjoy its unlimited patience.

I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.

"weights" - synapses in the AI brain

"tokens" - word fragments

"model" - of course, the model is the AI brain

"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context

"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts

"LoRA" - a lightweight plug-in model for tweaking the big model

"loss" - a score telling how bad is the output

"training" - change the model until it fits the data

"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute

"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned

rodoxcasta3y ago

But, this isn't a bad ideia when you don't know even the basics? Because you wouldn't be able to separate genuine information to subtle or not so subtle hallucinations.

It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.

5 more replies

cogitoergofutuo3y ago

> Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.

1 more reply

Salgat3y ago

These are explanations that make sense to people who already know how deep learning works but don't really explain much to beginners beyond giving them a grossly oversimplified misrepresentation of what is being discussed (while not actually explaining anything).

My advice to folks is, if you actually want to know how this stuff works at some basic level, put in some time learning how basic linear and logistic regression work, including how to train it using back propagation. From there you'll have a solid foundation that gives enough context to understand most deep learning concepts at a high level.

1 more reply

agentdrtran3y ago

> why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

when it can hallucinate content, why do that instead of reading a blog post from an expert?

1 more reply

unethical_ban3y ago

Not the OP, I'm still hesitant because it infuriates me I have to give them my identity which they will then log every prompt against. You think they aren't building profiles on people? AI moties(more in gods eye reference )is what they are.

tikkun3y ago

I think this is the right answer, ChatGPT is an excellent 1-1 tutor.

zoogeny3y ago

Andrej Karpathy's Zero to Hero video series [1] is a good middle ground. It isn't super low-level but it also isn't super high-level. I think seeing how the pieces actually fit together in a working project is valuable to get a real understanding.

After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.

1. https://karpathy.ai/zero-to-hero.html

CamperBob23y ago

I'm working my way through that series now. He really is a good teacher -- I keep waiting for the inevitable "Next, draw the rest of the fucking owl" moment, but so far he does seem to be sticking to his commitment to a from-scratch approach.

data_maan3y ago

When was this published? Is this an older tutorial by Karpathy?

Just curious, didn't see any date...

2 more replies

mabbo3y ago

Weights are basically number/float variables. In neural networks, vectors of values are multiplied (or math'd in some way) by weights to get new vectors of values. A 500 billion weight model has 500 billion variables, all carefully chosen via training.

A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.

At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!

mrtranscendence3y ago

At a first approximation this is pretty good. I wouldn't say this exactly:

> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Because data doesn't really flow through weight matrices, though perhaps this is true if you squint at very simple models. Deep learning architectures are generally more complicated than multiplying values by weights and pushing the results to the next layer, though which architecture to use depends heavily on context.

> Tokens are sort of "words" in a sentence

Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)

1 more reply

heliophobicdude3y ago

Probably not the answer you would like but I think your approach to download them and figure out how to run them on your machine is a good one. You don't need to understand everything to get something working. It can be overwhelming and unproductive to know everything before getting started.

To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.

Example, download the code try to get it to work. Why is it not working? Oh it's trying to look for the model. Search for how to get the model and set it up. Then key step, recursively look up every single thing in the guide or set up. Don't try to set something up or fix some thing without truly understanding what it is you are doing (e.g. copy and paste). This gives you a structured why to fill in the foundations of what it is you are trying to get to work in a more focused and productive manner. At the end you might realize that their approach or yours is not optimal "oh it was telling me to download the 65k model when I can only run 7k on my machine bc ..."

2devnull3y ago

For a good general non-technical introduction I recommend the YouTube computerphile series related to language models, transformers and other general concepts. If you are interested in actually doing stuff there’s an over abundance of material out there already, if you try looking.

bobbyi3y ago

I haven't watched it yet, but the Practical Deep Learning for Coders course that's available on YouTube is often recommended

https://course.fast.ai/

mhh__3y ago

A book about AI. (Norvig and Russell comes to mind)

superpope993y ago

I'm always curious about the cost of these training runs. Some back of the envelope calculations:

> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run

1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.

Assuming "on-demand" pricing [1] that's about $500,000 training cost.

[1] https://cloud.google.com/tpu/pricing

p1esk3y ago

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

execveat3y ago

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

simonw3y ago

I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.

2 more replies

qeternity3y ago

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

1 more reply

superpope993y ago

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

bravura3y ago

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.

lostmsu3y ago

Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

jeron3y ago

also, https://brev.dev/

jerrygenser3y ago

They haven't trained a 1 trillion token model yet. They have only done 200bn so far

YetAnotherNick3y ago

Google is generous for giving TPU for free for research, so likely it is using this. The more representative number is one from meta which required 87k A100 hours, which is close to $100-200k for 7B model training.

quickthrower23y ago

I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM

5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \

Am I even close?

jbandela13y ago

I think llama.cpp might be easier to set up and get running.

https://github.com/ggerganov/llama.cpp

loudmax3y ago

I second this recommendation to start with llama.cpp. It can run on a regular laptop and it gives a sense of what's possible.

If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.

The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat

The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui

I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.

JLCarveth3y ago

Yes, I can clone this and get into a prompt in less than 5 minutes on an M2 MBA.

quickthrower23y ago

might try it first. seems to be only CPU?

2 more replies

thundergolfer3y ago

You can get it running with one Python script on Modal.com :)

https://github.com/modal-labs/modal-examples/blob/main/06_gp...

quickthrower23y ago

Ok you lot! Will try out modal.

1 more reply

newswasboring3y ago

How is this model performing better than LLaMa in a lot of tasks[1] even though its trained on a fifth of the data (1 trillion vs 200 billion).

[1]https://github.com/openlm-research/open_llama#evaluation

YetAnotherNick3y ago

They are likely doing some interpolation for 200B or benchmarking it in wrong way. e.g. Hellaswag accuracy for llama 7b is 0.76[1], but it is written 0.56 in the repo. Even at 200B tokens, it is higher than 0.56 for llama looking at the charts.

[1]: https://arxiv.org/pdf/2302.13971.pdf

byefruit3y ago

They ran lm-evaluation-harness on both this model and the original llama weights, which is the correct way to do it.

Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.

slekker3y ago

Nobody knows :^)

tarruda3y ago

Maybe it uses a higher quality dataset

logicchains3y ago

Would be very interesting to see https://github.com/BlinkDL/RWKV-LM trained on the same data

leobg3y ago

Interesting. Have you done anything with RWKV?

vessenes3y ago

I evaluated RWKV recently, and it's interesting for sure. It's undertrained, and has a quirky architect, so some parts of it are different than playing with the llama ecosystem. The huge context length is super appealing, and in my tests, long prompts do seem to work and get coherent results.

Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.

I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.

logicchains3y ago

Nope, not yet, the current 14B version is much worse than LLaMA 65B. But there are apparently plans to train a RWKV-65B by the end of the year, and if including the LLaMA training dataset results in something like LLaMA-65B but with infinite context then that'd be really amazing.

Taek3y ago

How is this different from what RedPajamas is doing?

Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.

execveat3y ago

RedPajama is creating a dataset. This is a permissively licensed model trained on that dataset.

slama3y ago

RedPajama is also training both foundation and instruct-tuned models

Source: https://twitter.com/togethercompute/status/16527350961501757...

bradleyjg3y ago

I agree with this. For a lot of companies hundreds of thousands of dollars or single digit millions on fine tuning, inference, and so on is entirely feasible but using model weights with clouded legal status isn’t.

bluecoconut3y ago

Really exciting how fast fully pre-trained new models are appearing.

Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)

https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1

LudwigNagasena3y ago

Is anyone familiar with the BOINC-style grid computing scene for ML and, specifically, LLM? Is there something interesting going on, or is it infeasible? Will things like OpenLLaMA help it?

literalAardvark3y ago

They seem to scale up, not out, so grids don't really work.

What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.

pmoriarty3y ago

"They seem to scale up, not out, so grids don't really work."

Can someone explain what this means? I don't understand.

4 more replies

sigmar3y ago

I haven't looked into it or tried it yet, but there is https://petals.ml/

Eduard3y ago

Can someone explain how to tell if a model doesn't require a GPU and can run on a CPU?

After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:

if the model somewhere has "GGML" in its name, it doesn't require a GPU.

execveat3y ago

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python

martythemaniak3y ago

Has anyone successfully used embeddings with anything other than OpenAI's APIs? I've seen lots of debates on using embeddings vs fine-tuning for things like chatbots on private data, but is there a reason why you can't use both? IE, fine-tune LLaMA on your data, then run the same embeddings approach on top of your own fine-tuned model?

ianpurton3y ago

> We are currently focused on completing the training process on the entire RedPajama dataset.

So that's 1.2 trillion tokens. Nice.

jasonm233y ago

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.

heliophobicdude3y ago

Hi Jason! I have a few thoughts on this!

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].

While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.

Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.

This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.

Only there is a big problem, context sizes. Damn. 4k tokens?

We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.

This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.

So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

2: Embeddings https://arxiv.org/pdf/2201.10005.pdf

3: https://twitter.com/marktenenholtz/status/165156810719298355...

jasonm233y ago

Much appreciated, Sun fearing dude, much appreciated.

charcircuit3y ago

You can train the model on more training data after it has been released.

quickthrower23y ago

So is this free as in “do what you f’ing like with it”?

mkl3y ago

Mostly, yes. It's Apache License 2.0: https://github.com/openlm-research/open_llama/blob/main/LICE...

venelin_valkov3y ago

I made a YouTube video on how to run OpenLLaMa on Google Colab with Hugging Face Transformers (using a T4 GPU): https://www.youtube.com/watch?v=1NOPciKuQb8

Hope that helps!

version_five3y ago

Has anyone actually used this? I poked around and it's so poorly documented that I don't see how one can readily, short of trying to go through the code, understand how to do a minimal run.

gigel823y ago

I've used it with llama.cpp; results are not great, but not entirely terrible (I'd say somewhere between GPT-2 and GPT-3). Still, totally free and open source is great and I'm looking forward to more development from them (and others building on top like an RLHF / alpaca / chat kind of thing).

version_five3y ago

Thanks for answering! In my skim of the thread I only saw people mention trying it with llama.cpp. I tried to get his EasyML framework going but could not figure out the parameters I needed. Definitely agree it's great to see real open source models being built.

scotty793y ago

Motivation?

igravious3y ago

Happily, licensing.

newswasboring3y ago

why the hell will you be happy about duplicate work?

2 more replies

newswasboring3y ago

Sadly, licensing.

j / k navigate · click thread line to collapse

180 comments

diimdeep3y ago

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock

gigel823y ago

You the real MVP!

Though I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:

   python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
   Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
   Traceback (most recent call last):
    File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
      convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
    File "/l/llama.cpp/convert.py", line 1129, in main
       model_plus = load_some_model(args.model)
     File "/l/llama.cpp/convert.py", line 1055, in load_some_model
       models_plus.append(lazy_load_file(path))
     File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
       raise ValueError(f"unknown format: {path}")
   ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

sebastianhoitz3y ago

I had the same issue and then noticed that I need git lfs - otherwise just cloning the repo will not download the weights.

2 more replies

kdtsh3y ago

I get the same error on an M series MacBook (Ventura). However from the repo README.md it looks like make should work instead of cmake, I’ll give that a try.

logicchains3y ago

wokwokwok3y ago

There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

b33j0r3y ago

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

1 more reply

scotty793y ago

Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?

2 more replies

moffkalast3y ago

> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

ummonk3y ago

tarruda3y ago

I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.

logicchains3y ago

5 more replies

simion3143y ago

bagels3y ago

How low? I think everybody has different requirements there.

2 more replies

quickthrower23y ago

If I rent an A100 what kind of speed could I expect?

1 more reply

newswasboring3y ago

At least for now they are focused on 7B and then 3B[1].

[1]https://github.com/openlm-research/open_llama#future-plans

Silverback_VII3y ago

cubefox3y ago

The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See

https://dynomight.net/scaling/

1 more reply

jjice3y ago

visarga3y ago

Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this? Get those personalised explanations and enjoy its unlimited patience.

I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.

"weights" - synapses in the AI brain

"tokens" - word fragments

"model" - of course, the model is the AI brain

"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context

"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts

"LoRA" - a lightweight plug-in model for tweaking the big model

"loss" - a score telling how bad is the output

"training" - change the model until it fits the data

"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute

"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned

rodoxcasta3y ago

But, this isn't a bad ideia when you don't know even the basics? Because you wouldn't be able to separate genuine information to subtle or not so subtle hallucinations.

It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.

5 more replies

cogitoergofutuo3y ago

> Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.

1 more reply

Salgat3y ago

1 more reply

agentdrtran3y ago

> why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

when it can hallucinate content, why do that instead of reading a blog post from an expert?

1 more reply

unethical_ban3y ago

tikkun3y ago

I think this is the right answer, ChatGPT is an excellent 1-1 tutor.

zoogeny3y ago

After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.

1. https://karpathy.ai/zero-to-hero.html

CamperBob23y ago

data_maan3y ago

When was this published? Is this an older tutorial by Karpathy?

Just curious, didn't see any date...

2 more replies

mabbo3y ago

A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.

At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!

mrtranscendence3y ago

At a first approximation this is pretty good. I wouldn't say this exactly:

> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

> Tokens are sort of "words" in a sentence

Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)

1 more reply

heliophobicdude3y ago

To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.

2devnull3y ago

bobbyi3y ago

I haven't watched it yet, but the Practical Deep Learning for Coders course that's available on YouTube is often recommended

https://course.fast.ai/

mhh__3y ago

A book about AI. (Norvig and Russell comes to mind)

superpope993y ago

I'm always curious about the cost of these training runs. Some back of the envelope calculations:

> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run

1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.

Assuming "on-demand" pricing [1] that's about $500,000 training cost.

[1] https://cloud.google.com/tpu/pricing

p1esk3y ago

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

execveat3y ago

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

simonw3y ago

2 more replies

qeternity3y ago

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

1 more reply

superpope993y ago

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

bravura3y ago

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.

lostmsu3y ago

Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

jeron3y ago

also, https://brev.dev/

jerrygenser3y ago

They haven't trained a 1 trillion token model yet. They have only done 200bn so far

YetAnotherNick3y ago

quickthrower23y ago

I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM

5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \

Am I even close?

jbandela13y ago

I think llama.cpp might be easier to set up and get running.

https://github.com/ggerganov/llama.cpp

loudmax3y ago

I second this recommendation to start with llama.cpp. It can run on a regular laptop and it gives a sense of what's possible.

The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat

The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui

I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.

JLCarveth3y ago

Yes, I can clone this and get into a prompt in less than 5 minutes on an M2 MBA.

quickthrower23y ago

might try it first. seems to be only CPU?

2 more replies

thundergolfer3y ago

You can get it running with one Python script on Modal.com :)

https://github.com/modal-labs/modal-examples/blob/main/06_gp...

quickthrower23y ago

Ok you lot! Will try out modal.

1 more reply

newswasboring3y ago

How is this model performing better than LLaMa in a lot of tasks[1] even though its trained on a fifth of the data (1 trillion vs 200 billion).

[1]https://github.com/openlm-research/open_llama#evaluation

YetAnotherNick3y ago

[1]: https://arxiv.org/pdf/2302.13971.pdf

byefruit3y ago

They ran lm-evaluation-harness on both this model and the original llama weights, which is the correct way to do it.

Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.

slekker3y ago

Nobody knows :^)

tarruda3y ago

Maybe it uses a higher quality dataset

logicchains3y ago

Would be very interesting to see https://github.com/BlinkDL/RWKV-LM trained on the same data

leobg3y ago

Interesting. Have you done anything with RWKV?

vessenes3y ago

I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.

logicchains3y ago

Taek3y ago

How is this different from what RedPajamas is doing?

execveat3y ago

RedPajama is creating a dataset. This is a permissively licensed model trained on that dataset.

slama3y ago

RedPajama is also training both foundation and instruct-tuned models

Source: https://twitter.com/togethercompute/status/16527350961501757...

bradleyjg3y ago

bluecoconut3y ago

Really exciting how fast fully pre-trained new models are appearing.

Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)

https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1

LudwigNagasena3y ago

Is anyone familiar with the BOINC-style grid computing scene for ML and, specifically, LLM? Is there something interesting going on, or is it infeasible? Will things like OpenLLaMA help it?

literalAardvark3y ago

They seem to scale up, not out, so grids don't really work.

What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.

pmoriarty3y ago

"They seem to scale up, not out, so grids don't really work."

Can someone explain what this means? I don't understand.

4 more replies

sigmar3y ago

I haven't looked into it or tried it yet, but there is https://petals.ml/

Eduard3y ago

Can someone explain how to tell if a model doesn't require a GPU and can run on a CPU?

After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:

if the model somewhere has "GGML" in its name, it doesn't require a GPU.

execveat3y ago

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python

martythemaniak3y ago

ianpurton3y ago

> We are currently focused on completing the training process on the entire RedPajama dataset.

So that's 1.2 trillion tokens. Nice.

jasonm233y ago

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.

heliophobicdude3y ago

Hi Jason! I have a few thoughts on this!

Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.

Only there is a big problem, context sizes. Damn. 4k tokens?

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

2: Embeddings https://arxiv.org/pdf/2201.10005.pdf

3: https://twitter.com/marktenenholtz/status/165156810719298355...

jasonm233y ago

Much appreciated, Sun fearing dude, much appreciated.

charcircuit3y ago

You can train the model on more training data after it has been released.

quickthrower23y ago

So is this free as in “do what you f’ing like with it”?

mkl3y ago

Mostly, yes. It's Apache License 2.0: https://github.com/openlm-research/open_llama/blob/main/LICE...

venelin_valkov3y ago

I made a YouTube video on how to run OpenLLaMa on Google Colab with Hugging Face Transformers (using a T4 GPU): https://www.youtube.com/watch?v=1NOPciKuQb8

Hope that helps!

version_five3y ago

Has anyone actually used this? I poked around and it's so poorly documented that I don't see how one can readily, short of trying to go through the code, understand how to do a minimal run.

gigel823y ago

version_five3y ago

scotty793y ago

Motivation?

igravious3y ago

Happily, licensing.

newswasboring3y ago

why the hell will you be happy about duplicate work?

2 more replies

newswasboring3y ago

Sadly, licensing.

j / k navigate · click thread line to collapse