What's new in Llama 2 and how to run it locally (opens in new tab)

(agi-sphere.com)

176 pointsandrewon2y ago82 comments

82 comments

In my tests LLaMa2-13B is useable for information extraction tasks and LLaMA2-70B is almost as good as GPT-4 (for IE). These models are the real thing. We can fine-tune LLaMAs, unlike OpenAI's models. Now we can have privacy, control and lower prices. We can introduce guidance, KV caching and other tricks to improve the models.

The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago - tons of people innovating and debating approaches, lots of projects popping up, so much energy!

pavlov2y ago

> “The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago”

Hmm. If LLMs turned out like JS frameworks, that would mean that in ten years people will be saying:

“Maybe we don’t really need all this expensive ceremony, honestly this could be done with vanilla if/else heuristics…?”

spacebanana72y ago

I can imagine a bloated world where 500B param models are used for tasks where 7B param modes perform adequately.

At that time, there could be complaints on hacker news about messaging apps with autocomplete models that take up gigabytes.

pavlov2y ago

I could see a world where people turn to an LLM for a task where the hand-rolled solution could be a simple state machine or some nested switch statements.

The irony would be that the LLM could write you that code, but if you don’t know to ask…

1 more reply

phillipcarter2y ago

A key difference is that the impact of these enormous models is obscured away. I don't feel the impact of having to run GPT-3.5-turbo at scale because it's just an API call away, and it's even more reliable now.

The main critiques outside of data privacy I've read are related to energy consumption, but even then, it's...not compelling? I read an article[0] that estimated the training of ChatGPT (3.5) to emit as much C02 as more than 3 round-trip flights between SF and NYC. That's not good! But also, really highlights that if we're to reduce emissions, there's clearly bigger targets than the largest ML models in the world.

[0]: https://themarkup.org/news/2023/07/06/ai-is-hurting-the-clim...

dmd2y ago

"How can I sum this column of numbers?"

"IDK, throw it at the LLM"

1 more reply

why_only_152y ago

That would be awesome, but we've tried for decades and haven't gotten there with basic if/else. I do think it's pretty plausible that if you combine some very slimmed down models with strong heuristics you could get far though. At the moment I think expense doesn't really matter -- an hour of a knowledge worker's time is worth 416,000 tokens of GPT-4, the most expensive model out there. For llama-2 it's even less time. Unless you're processing truly epic numbers of tokens, by far the most important is whether we can use these things for real.

pavlov2y ago

> "That would be awesome, but we've tried for decades and haven't gotten there with basic if/else."

Oh, I know — I was trying to throw some shade on the state of JS frameworks rather than LLMs. With the pendulum now swinging back to vanilla DOM manipulation, it feels like the enormous effort spent on devising ways to wrap web UIs in endless variations of abstractions might have been somewhat of a waste.

antupis2y ago

Well these models at least tell in name what they do eg llama-2-70b-Guanaco-QLoRA-fp16.

intelVISA2y ago

Subtle :)

broast2y ago

Any information about how long it takes to achieve a fine tune comparable to OpenAI's current fine tunes of ada models? On consumer hardware vs cloud? OpenAI's fine tune times are on the order of hours for tens of thousands of samples, but expensive. Any information on the effort and time involved in fine tuning compared to OpenAI's current process would be appreciated.

MrYellowP2y ago

> information extraction task

I do that with orca-mini-3b in ggml format and it's pretty good at it, at twice the speed. Of all the LLMs I've tried, this one gave me the best results. It just requires a properly written prompt.

nsomaru2y ago

Could you elaborate on the prompting strategies you have used that are more effective?

asabla2y ago

> The enthusiasm around it reminds me of Javascript wars 10 years ago... so much energy!

I kind of have the same feeling as well. With all this energy it's really hard to keep up with all new ideas, implementations, frameworks and services.

Really excited for what this will bring us the next coming years

sva_2y ago

> With all this energy it's really hard to keep up with all new ideas, implementations, frameworks and services.

The majority of them are mostly irrelevant. You just need to figure out which.

dchuk2y ago

Can you share an example of information extraction prompts? I specifically am interested in using LLMs as basically general purpose web scrapers that can take html and extract matching data per a prompt into structured json schema…do you think this is possible with llama 2?

behnamoh2y ago

> ... and lower prices.

Not sure about this. atm, the cost of any cloud GPU (spot or not) far exceeds the cost of OpenAI's API. I'd be glad to be proven wrong because I, too, want to run L2 (the 70b model).

Also, buying a GPU, even 4090, is not feasible for most people. And it's not just about GPU—you'd have to build a PC for it to work, and there's the hidden maintenance cost of running desktop Linux (to use GPTQ for instance). It's not surprising that most users prefer someone else (OpenAI) to do it for them.

jddj2y ago

I have to admit, I wouldn't have imagined even a few months ago that I'd be reading this comment.

Sure, you can run something comparable to OpenAI's flagship product at home, but it's moderately expensive and slightly inconvenient so people will still pay for the convenience.

worldsavior2y ago

It looks like it will always be a war like Android VS iOS only now it's with AI models.

kordlessagain2y ago

I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.

To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):

  sudo apt-get update -y
  sudo apt-get install build-essential -y
  sudo apt-get install linux-headers-$(uname -r) -y
  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
  sudo apt-get install python3-pip -y
  sudo pip install --upgrade huggingface_hub 
  
  # skip using token as git credential
  huggingface-cli login (for Meta model access paste token from HF[2])
  
  sudo pip install vllm # ~8 minutes

Then, edit the test code for a 7b Llama 2 model (paste into llama.py):

  from vllm import LLM
  llm = LLM(model="meta-llama/Llama-2-7b-hf")
  output = llm.generate("The capital of Brazil is called")
  print(output)

Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.

[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens

drusepth2y ago

This looks promising, after looking at Azure/AWS/GC/Linode GPU instances all day. When you say "eventually terminated", what magnitude of time are you referring to? Hours? Days? Weeks? Months? Years?

kordlessagain2y ago

You can set it to a time, or expect to have the spot instance terminated after 24 hours. That said, Google will terminate instances as needed for the zone you deploy in, so your mileage will vary.

jurmous2y ago

Did anybody try the Llama 2 model with languages other than English? The paper notes that it works best with English and the amount of training data for other languages is only a fraction. Which likely would make it unusable for me..

See table 10 (page 22) of the whitepaper for the numbers: https://ai.meta.com/research/publications/llama-2-open-found...

Are there other downloadable models which can be used in a multilingual environment that people here are aware of?

Gijs4g2y ago

At Mirage Studio we have successfully finetuned Llama 2 7B on a Dutch dataset to get it to output Dutch in a coherent way: https://huggingface.co/Mirage-Studio/llama-gaan-2-7b-chat-hf...

magicalhippo2y ago

Dutch is a low-hanging fruit though, ain't it? Closely related Germanic language with heavy English influence post-war?

edit: To make the implied question explicit, I guess it might do well on other similar Germanic languages (say Norwegian) but struggle beyond that? Or?

barbazoo2y ago

Isn't it only a matter of languages of the input that the model was trained on? If we want it to spit out Klingon, I'd have to be trained on Klingon input, no?

jwr2y ago

I tried to run llama-2 (13B) locally for translations. Comparing to gpt-3.5-turbo and gpt-4-0613, which I use a lot, and I tried with several languages that I know.

It is nowhere near usable.

Perhaps the 70B model performs better, but 13B produces translations that are garbage.

cypress662y ago

What language? 70B English - Spanish is pretty strong. But that's almost surely the best case scenario.

And 70B will no doubt be much better.

bigcloud12992y ago

I am doing STT using whisper on my local laptop and recording all the meetings. I am then asking llama and chstgpt to summarize them. ChatGPT does 10x better job. It’s been good to be able to get the main points out of otherwise fuxking boring a. garbage meetings that are at least 30-45 mins long.

Ambix2y ago

Both 1 and 2 versions are good enough in Russian even for some real use in production environments. But you should tolerate some crazy / weird typos and mis-wording they'll produce. Russian is complicated.

jmorgan2y ago

If you're looking to run Llama 2 locally via a CLI or REST API (vs the web ui this article highlights), there's an open-source project some folks and I have been working on over the last few weeks: https://github.com/jmorganca/ollama

More projects in this space:

- llama.cpp which is a fast, low level runner (with bindings in several languages)

- llm by Simon Willison which supports different backends and has a really elegant CLI interface

- The MLC.ai and Apache TVM projects

Previous discussion on HN that might be helpful from an article by the great folks at replicate: https://news.ycombinator.com/item?id=36865495

qudat2y ago

I just wanted to call out that some of these quick-to-start tools are CPU only (eg ollama) which is great to play with but if you want your GPU you’ve gotta go to llama.cpp

Further, the 70B for llama.cpp is still under development as far as I know.

jmorgan2y ago

Indeed, many tools in this space don't maximize resource utilization at runtime. Even the quantized models are massive resource hogs.. so you need all the performance you can get!

Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.

To run the 70B model you can try:

  ollama run llama2:70b

Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charm

technovangelist2y ago

I am using ollama today on a MacBook Pro M1Max with 64GB. Using a llama2 70b model, I am getting about 7 tokens/second with the onboard gpu. Before ollama used gpu, that was much slower. To compare, the 7b model gets me closer to 55 tokens/second. There is no way it could achieve those numbers without the gpu.

vczf2y ago

70B llama.cpp works now. You need the temporary `-gqa 8` flag for 70B.

You can even extend context with RoPE!

simonw2y ago

If you want to try Llama 2 on a Mac and have Homebrew (or Python/pip) you may find my LLM CLI tool interesting: https://simonwillison.net/2023/Aug/1/llama-2-mac/

bravura2y ago

Does it support Metal / MPS acceleration?

SOLAR_FIELDS2y ago

I’ve gotten mine running with FastChat - they have a Metal/MPS option.

Sadly 7b is not very good for SQL tasks. I think even with RAG it would struggle.

petulla2y ago

What's the inference time without gpu?

lm2s2y ago

It might the time mentioned at the bottom of the page since the author isn't sure that the GPU is being used:

>How to speed this up—right now my Llama prompts often take 20+ seconds to complete.

jawerty2y ago

If you’re someone who wants to fine-tune Llama 2 on Google Colab, I have a couple live coding streams I did this past week where I fine tune Llama on my own dataset

Here’s the stream - https://www.youtube.com/live/LitybCiLhSc?feature=share

One is with LoRa and the other QLoRa I also do a breakdown on each fine-tuning method. I wanted to make these since I myself have had issues running LLMs locally and Colab is the cheapest GPU I can find haha.

chuckhend2y ago

Cool stream! Subbed!

jawerty2y ago

Thanks I appreciate it, many more to come

SOLAR_FIELDS2y ago

So I tried getting Longchat running (a 32k context llama 2 7b model released a few days ago) with FastChat and I was able to successfully get it running. It seems what I was trying to use it for (Langchain SQL agent) it is not good enough out of the box. Part of this is that I think Langchain is kind of biased towards OpenAi’s models and perhaps Llamaindex would perform better. However Llamaindex uses a newer version of sqlalchemy that a bunch of data warehouse clients don’t support yet.

Unfortunately with all of the hype it seems that unless you have a REALLY beefy machine the better 70B model feels out of reach for most to run locally leaving the 7B and 13B as the only viable options outside of some quantization trickery. Or am I wrong in that?

I want to focus more on larger context windows since it seems like RAG has a lot of promise so it seems like the 7B with giant context window is the best path to explore rather than focusing on getting the 70B to work locally

cube22222y ago

In the Llama 2 paper benchmarks they did mention that Llama 2 is much worse at any kind of code generation than the OpenAI models, they were optimizing for conversational / natural language use-cases.

SOLAR_FIELDS2y ago

Interesting, what other openly licensed models are better at codegen? Or perhaps there is a version of llama 2 already fine tuned for codegen? There is starcoder but I had also not had great results with that one in my brief experiments

lhl2y ago

WizardCoder-15B (an evol-instruct starcoder fine-tune) is probably the best performing open model atm: https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder

spmurrayzzz2y ago

I haven't tested the newest implementations of every large context window model, so I'm not sure how prevalent this issue still is, but generally speaking the context window tends to be U-shaped. In other words, the model seems to forget/ignore everything in the middle. So YMMV if you're trying to implement RAG-esque methods with them.

More reading on that problem if you're curious: https://arxiv.org/pdf/2307.03172.pdf

carom2y ago

Just set things up locally last night. If you're a developer, llama.cpp was a pleasure to build and run. I wanted to run the weights from Meta and couldn't figure out text generation web ui. It seemed that one was optimized for grabbing something off HuggingFace.

Running on a 3090. The 13b chat model quantized to fp8 is giving about 42 tok/s.

growt2y ago

Is there an overview somewhere how much RAM is needed for which model? Is it possible at all to run 4bit 70B on CPU and RAM?

cfn2y ago

Yes, I run the 4bit, 70B on a threadripper 32 core using llama.cpp. It uses around 37Gb of RAM and I get 4-5 tokens per second (slow but usable). Core usage is very uneven with many cores at 0% so maybe there's some more performance to be had in the future. Sometimes it gets stuck for a few seconds and then recovers.

It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).

The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.

Ambix2y ago

Try to use less cores. RAM bandwidth is real limiting factor there, so there always some sweet spot between CPU cores and RAM bandwidth for individual system.

For example, I use only 6 cores from 10 on my M1 Pro laptop.

cfn2y ago

That is an interesting idea. Can you tell me what is the switch for number of cores?

1 more reply

MrYellowP2y ago

How does the 7B model use only 512 megabytes? That's not possible?

Is it using mmap and concealing the actual memory usage?

cfn2y ago

I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

main: build = 942 (4f6b60c) main: seed = 1691400051 llama.cpp: loading model from /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-05 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 4820.60 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

This is the command: ./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \ "[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]"

1 more reply

cypress662y ago

5 tokens/s on 70B 4bit seems really high for your setup.

cfn2y ago

This is the command:

./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."

And this is the report at the end of the answer:

llama_print_timings: load time = 999.84 ms

llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)

llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)

llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)

llama_print_timings: total time = 305815.51 ms

The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.

1 more reply

ktaube2y ago

What's the cheapest way to run e.g. LLaMa2-13B and have it served as an API?

I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.

fy202y ago

You can probably run it locally with llama.cpp using CPU only, but it will be slow. I have a couple year old laptop with a RTX 3060 and it runs pretty well split across the CPU and GPU.

lgrammel2y ago

llama.cpp has a server with a REST API that you can use: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

garciasn2y ago

I mean, hosting your own outside of OpenAI is mainly to avoid OpenAI accessing the data and using it for X, Y, and Z. I wouldn't roll my own if there weren't concerns about safety due to the cost and quality of the results.

l5870uoo9y2y ago

I am interested in that as well. Can LLaMa2 models be deployed to VPS? (Preferable the 70B model).

MediumOwl2y ago

There's only mention of Nvidia GPUs on the web site, what about AMD?

lhl2y ago

On Windows, llama.cpp has OpenCL support (CLBlast) and MLC LLM (https://mlc.ai/mlc-llm/docs/) has Vulkan acceleration.

On Linux, ExLlama and MLC LLM have native ROCm support, and there is a HIPified fork of llama.cpp as well.

zzbzq2y ago

Don't really work for AI. There might be a weird experimental driver for linux or something, I never got it to work.

MediumOwl2y ago

Why not?

zzbzq2y ago

They suck at it basically. My understanding is it's really the software, Nivida supports AI first-class with the drivers (or whatever.) AMD has next to nothing nothing.

brucethemoose22y ago

I am partial to Koboldcpp over text gen UI for a number of reasons.

...But I am also a bit out of the loop. For instance, I have not kept up with the CFG/negative prompt or grammar implementations in the UIs.

gorenb2y ago

I've only used the 13b model and I'd say it was as good as GPT-3 (not GPT-4). It's amazing, and I only have a laptop to run it locally on so 13b is as good as I can do.

Manidos2y ago

One way to connect llama-2 (cpp) to a node.js app is by using this helper class (stdin) https://gist.github.com/HackyDev/814c6d1c96f259a13dbf5b2dabf...

lgrammel2y ago

You can also spin up the llama.cpp server ( https://github.com/ggerganov/llama.cpp/tree/master/examples/... ) and e.g. use ModelFusion to generate text, stream text, tokenize, etc. : https://modelfusion.dev/integration/model-provider/llamacpp

bigcloud12992y ago

Does anyone have info on how to run 70B on windows ? :) would appreciate it.

KaoruAoiShiho2y ago

What prompts did you use for the article's decorative art?

j / k navigate · click thread line to collapse

82 comments

visarga2y ago

The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago - tons of people innovating and debating approaches, lots of projects popping up, so much energy!

pavlov2y ago

> “The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago”

Hmm. If LLMs turned out like JS frameworks, that would mean that in ten years people will be saying:

“Maybe we don’t really need all this expensive ceremony, honestly this could be done with vanilla if/else heuristics…?”

spacebanana72y ago

I can imagine a bloated world where 500B param models are used for tasks where 7B param modes perform adequately.

At that time, there could be complaints on hacker news about messaging apps with autocomplete models that take up gigabytes.

pavlov2y ago

I could see a world where people turn to an LLM for a task where the hand-rolled solution could be a simple state machine or some nested switch statements.

The irony would be that the LLM could write you that code, but if you don’t know to ask…

1 more reply

phillipcarter2y ago

[0]: https://themarkup.org/news/2023/07/06/ai-is-hurting-the-clim...

dmd2y ago

"How can I sum this column of numbers?"

"IDK, throw it at the LLM"

1 more reply

why_only_152y ago

pavlov2y ago

> "That would be awesome, but we've tried for decades and haven't gotten there with basic if/else."

antupis2y ago

Well these models at least tell in name what they do eg llama-2-70b-Guanaco-QLoRA-fp16.

intelVISA2y ago

Subtle :)

broast2y ago

MrYellowP2y ago

> information extraction task

I do that with orca-mini-3b in ggml format and it's pretty good at it, at twice the speed. Of all the LLMs I've tried, this one gave me the best results. It just requires a properly written prompt.

nsomaru2y ago

Could you elaborate on the prompting strategies you have used that are more effective?

asabla2y ago

> The enthusiasm around it reminds me of Javascript wars 10 years ago... so much energy!

I kind of have the same feeling as well. With all this energy it's really hard to keep up with all new ideas, implementations, frameworks and services.

Really excited for what this will bring us the next coming years

sva_2y ago

> With all this energy it's really hard to keep up with all new ideas, implementations, frameworks and services.

The majority of them are mostly irrelevant. You just need to figure out which.

dchuk2y ago

behnamoh2y ago

> ... and lower prices.

Not sure about this. atm, the cost of any cloud GPU (spot or not) far exceeds the cost of OpenAI's API. I'd be glad to be proven wrong because I, too, want to run L2 (the 70b model).

jddj2y ago

I have to admit, I wouldn't have imagined even a few months ago that I'd be reading this comment.

Sure, you can run something comparable to OpenAI's flagship product at home, but it's moderately expensive and slightly inconvenient so people will still pay for the convenience.

worldsavior2y ago

It looks like it will always be a war like Android VS iOS only now it's with AI models.

kordlessagain2y ago

I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.

To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):

  sudo apt-get update -y
  sudo apt-get install build-essential -y
  sudo apt-get install linux-headers-$(uname -r) -y
  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
  sudo apt-get install python3-pip -y
  sudo pip install --upgrade huggingface_hub 
  
  # skip using token as git credential
  huggingface-cli login (for Meta model access paste token from HF[2])
  
  sudo pip install vllm # ~8 minutes

Then, edit the test code for a 7b Llama 2 model (paste into llama.py):

  from vllm import LLM
  llm = LLM(model="meta-llama/Llama-2-7b-hf")
  output = llm.generate("The capital of Brazil is called")
  print(output)

Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.

[1] https://vllm.readthedocs.io/en/latest/models/supported_model... [2] https://huggingface.co/settings/tokens

drusepth2y ago

This looks promising, after looking at Azure/AWS/GC/Linode GPU instances all day. When you say "eventually terminated", what magnitude of time are you referring to? Hours? Days? Weeks? Months? Years?

kordlessagain2y ago

You can set it to a time, or expect to have the spot instance terminated after 24 hours. That said, Google will terminate instances as needed for the zone you deploy in, so your mileage will vary.

jurmous2y ago

See table 10 (page 22) of the whitepaper for the numbers: https://ai.meta.com/research/publications/llama-2-open-found...

Are there other downloadable models which can be used in a multilingual environment that people here are aware of?

Gijs4g2y ago

At Mirage Studio we have successfully finetuned Llama 2 7B on a Dutch dataset to get it to output Dutch in a coherent way: https://huggingface.co/Mirage-Studio/llama-gaan-2-7b-chat-hf...

magicalhippo2y ago

Dutch is a low-hanging fruit though, ain't it? Closely related Germanic language with heavy English influence post-war?

edit: To make the implied question explicit, I guess it might do well on other similar Germanic languages (say Norwegian) but struggle beyond that? Or?

barbazoo2y ago

Isn't it only a matter of languages of the input that the model was trained on? If we want it to spit out Klingon, I'd have to be trained on Klingon input, no?

jwr2y ago

I tried to run llama-2 (13B) locally for translations. Comparing to gpt-3.5-turbo and gpt-4-0613, which I use a lot, and I tried with several languages that I know.

It is nowhere near usable.

Perhaps the 70B model performs better, but 13B produces translations that are garbage.

cypress662y ago

What language? 70B English - Spanish is pretty strong. But that's almost surely the best case scenario.

And 70B will no doubt be much better.

bigcloud12992y ago

Ambix2y ago

jmorgan2y ago

More projects in this space:

- llama.cpp which is a fast, low level runner (with bindings in several languages)

- llm by Simon Willison which supports different backends and has a really elegant CLI interface

- The MLC.ai and Apache TVM projects

Previous discussion on HN that might be helpful from an article by the great folks at replicate: https://news.ycombinator.com/item?id=36865495

qudat2y ago

I just wanted to call out that some of these quick-to-start tools are CPU only (eg ollama) which is great to play with but if you want your GPU you’ve gotta go to llama.cpp

Further, the 70B for llama.cpp is still under development as far as I know.

jmorgan2y ago

Indeed, many tools in this space don't maximize resource utilization at runtime. Even the quantized models are massive resource hogs.. so you need all the performance you can get!

Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.

To run the 70B model you can try:

  ollama run llama2:70b

Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charm

technovangelist2y ago

vczf2y ago

70B llama.cpp works now. You need the temporary `-gqa 8` flag for 70B.

You can even extend context with RoPE!

simonw2y ago

If you want to try Llama 2 on a Mac and have Homebrew (or Python/pip) you may find my LLM CLI tool interesting: https://simonwillison.net/2023/Aug/1/llama-2-mac/

bravura2y ago

Does it support Metal / MPS acceleration?

SOLAR_FIELDS2y ago

I’ve gotten mine running with FastChat - they have a Metal/MPS option.

Sadly 7b is not very good for SQL tasks. I think even with RAG it would struggle.

petulla2y ago

What's the inference time without gpu?

lm2s2y ago

It might the time mentioned at the bottom of the page since the author isn't sure that the GPU is being used:

>How to speed this up—right now my Llama prompts often take 20+ seconds to complete.

jawerty2y ago

If you’re someone who wants to fine-tune Llama 2 on Google Colab, I have a couple live coding streams I did this past week where I fine tune Llama on my own dataset

Here’s the stream - https://www.youtube.com/live/LitybCiLhSc?feature=share

chuckhend2y ago

Cool stream! Subbed!

jawerty2y ago

Thanks I appreciate it, many more to come

SOLAR_FIELDS2y ago

cube22222y ago

SOLAR_FIELDS2y ago

lhl2y ago

WizardCoder-15B (an evol-instruct starcoder fine-tune) is probably the best performing open model atm: https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder

spmurrayzzz2y ago

More reading on that problem if you're curious: https://arxiv.org/pdf/2307.03172.pdf

carom2y ago

Running on a 3090. The 13b chat model quantized to fp8 is giving about 42 tok/s.

growt2y ago

Is there an overview somewhere how much RAM is needed for which model? Is it possible at all to run 4bit 70B on CPU and RAM?

cfn2y ago

It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).

The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.

Ambix2y ago

Try to use less cores. RAM bandwidth is real limiting factor there, so there always some sweet spot between CPU cores and RAM bandwidth for individual system.

For example, I use only 6 cores from 10 on my M1 Pro laptop.

cfn2y ago

That is an interesting idea. Can you tell me what is the switch for number of cores?

1 more reply

MrYellowP2y ago

How does the 7B model use only 512 megabytes? That's not possible?

Is it using mmap and concealing the actual memory usage?

cfn2y ago

I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

1 more reply

cypress662y ago

5 tokens/s on 70B 4bit seems really high for your setup.

cfn2y ago

This is the command:

./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."

And this is the report at the end of the answer:

llama_print_timings: load time = 999.84 ms

llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)

llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)

llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)

llama_print_timings: total time = 305815.51 ms

The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.

1 more reply

ktaube2y ago

What's the cheapest way to run e.g. LLaMa2-13B and have it served as an API?

I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.

fy202y ago

You can probably run it locally with llama.cpp using CPU only, but it will be slow. I have a couple year old laptop with a RTX 3060 and it runs pretty well split across the CPU and GPU.

lgrammel2y ago

llama.cpp has a server with a REST API that you can use: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

garciasn2y ago

l5870uoo9y2y ago

I am interested in that as well. Can LLaMa2 models be deployed to VPS? (Preferable the 70B model).

MediumOwl2y ago

There's only mention of Nvidia GPUs on the web site, what about AMD?

lhl2y ago

On Windows, llama.cpp has OpenCL support (CLBlast) and MLC LLM (https://mlc.ai/mlc-llm/docs/) has Vulkan acceleration.

On Linux, ExLlama and MLC LLM have native ROCm support, and there is a HIPified fork of llama.cpp as well.

zzbzq2y ago

Don't really work for AI. There might be a weird experimental driver for linux or something, I never got it to work.

MediumOwl2y ago

Why not?

zzbzq2y ago

They suck at it basically. My understanding is it's really the software, Nivida supports AI first-class with the drivers (or whatever.) AMD has next to nothing nothing.

brucethemoose22y ago

I am partial to Koboldcpp over text gen UI for a number of reasons.

...But I am also a bit out of the loop. For instance, I have not kept up with the CFG/negative prompt or grammar implementations in the UIs.

gorenb2y ago

I've only used the 13b model and I'd say it was as good as GPT-3 (not GPT-4). It's amazing, and I only have a laptop to run it locally on so 13b is as good as I can do.

Manidos2y ago

One way to connect llama-2 (cpp) to a node.js app is by using this helper class (stdin) https://gist.github.com/HackyDev/814c6d1c96f259a13dbf5b2dabf...

lgrammel2y ago

bigcloud12992y ago

Does anyone have info on how to run 70B on windows ? :) would appreciate it.

KaoruAoiShiho2y ago

What prompts did you use for the article's decorative art?

j / k navigate · click thread line to collapse