Llama 3.1 (opens in new tab)

(llama.meta.com)

437 pointsluiscosio1y ago269 comments

269 comments

dang1y ago

Related ongoing thread:

Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)

lelag1y ago

The 405b model is actually competitive against closed source frontier models.

Quick comparison with GPT-4o:

    +----------------+-------+-------+
    |     Metric     | GPT-4o| Llama |
    |                |       | 3.1   |
    |                |       | 405B  |
    +----------------+-------+-------+
    | MMLU           |  88.7 |  88.6 |
    | GPQA           |  53.6 |  51.1 |
    | MATH           |  76.6 |  73.8 |
    | HumanEval      |  90.2 |  89.0 |
    | MGSM           |  90.5 |  91.6 |
    +----------------+-------+-------+

bamboozled1y ago

This nodel is not “open source”, free to use maybe.

nomel1y ago

I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.

4 more replies

votepaunchy1y ago

It’s not even free to use. There are commercial restrictions.

cchance1y ago

Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial

aabhay1y ago

Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.

1 more reply

loudmax1y ago

I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.

I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.

diego_sandoval1y ago

The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.

gkk1y ago

If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.

I agree with you on open source in the original, home tinkerer sense.

duchenne1y ago

Most SMBs would be able to run it. This is already a huge win for decentralized AI.

paxys1y ago

You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.

1 more reply

stuckinhell1y ago

100% reddit is full of people trying to solder more vram

1 more reply

Aurornis1y ago

> sorta defeats the purpose of opensource to some extent

Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)

"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.

kingsleyopara1y ago

You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.

1 more reply

monkeydust1y ago

Great for Groq whos already hosting it but at what cost I guess.

1 more reply

lostmsu1y ago

You can probably run it on your local PC at 1 token/minute.

mi_lk1y ago

How do you draw/generate such ascii table?

lelag1y ago

In the past, I might have used a python library like asciitable to do that.

This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.

TacticalCoder1y ago

Don't know about OP but I generate such tables using Emacs.

zone4111y ago

I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboard.

GPT-4o 30.7

GPT-4 turbo (2024-04-09) 29.7

Llama 3.1 405B Instruct 29.5

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3.1 70B Instruct 26.4

Gemini Pro 1.5 0514 22.3

Gemma 2 27B Instruct 21.2

Mistral Large 17.7

Gemma 2 9B Instruct 16.3

Qwen 2 Instruct 72B 15.6

Gemini 1.5 Flash 15.3

GPT-4o mini 14.3

Llama 3.1 8B Instruct 14.0

DeepSeek-V2 Chat 236B (0628) 13.4

Nemotron-4 340B 12.7

Mixtral-8x22B Instruct 12.2

Yi Large 12.1

Command R Plus 11.1

Mistral Small 9.3

Reka Core-20240501 9.1

GLM-4 9.0

Qwen 1.5 Chat 32B 8.7

Phi-3 Small 8k 8.4

DBRX 8.0

henryaj1y ago

I love Connections! Can you tell us more about your benchmark?

foundval1y ago

You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)

geepytee1y ago

We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)

noble-lombax1y ago

would be great if there was a page showing benchmarks compared to other auto completion tools

quotemstr1y ago

Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?

foundval1y ago

There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U

1 more reply

serverlessmania1y ago

You can chat with all these models for free and ultra-low latency using this hosted website https://nat.dev/chat for free by GitHub Founder

listic1y ago

Just checked it out. Is pay-as-you-go API access available at all? It says 'Coming Soon'

https://console.groq.com/settings/billing

weberer1y ago

I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.

1 more reply

Alifatisk1y ago

I think you answered it yourself? It’s coming soon, so it is not available now, but soon.

1 more reply

sagz1y ago

405B is already being served on WhatsApp!

https://ibb.co/kQ2tKX5

Workaccount21y ago

How do you get that option?

e12e1y ago

And available via poe:

https://poe.com/s/LCAyUbAgUx8UcVMhM3Re

d131y ago

At what quantisation are you running these?

netsec_burn1y ago

Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.

Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

lolinder1y ago

> at home with the right hardware

Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.

Still amazing that it's available at all, of course!

petercooper1y ago

It's hardly cheap starting at about $10k of hardware, but another potential option appears to be using Exo to spread the model across a few MBPs or Mac Studios: https://x.com/exolabs_/status/1814913116704288870

1 more reply

dunefox1y ago

It's not really competitive though, is it? I tested it and 4o is just better.

dunefox1y ago

Disclaimer: I tested llama3-8B, 3.1 might even as a small model be better, but I so far I have not seen a single small model approach 4o, ime.

meetpateltech1y ago

Open Source AI Is the Path Forward - Mark Zuckerberg

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

ninjin1y ago

So are they actually making the models open now or are they staying the course with "kind of open" as they have done for LLaMA 1, 2, and 3 [1]?

[1]: https://opensource.org/blog/metas-llama-2-license-is-not-ope...

As I have stated time and again, it is perfectly fine for them to slap on whatever license they see fit as it is their work. But it would be nice if they used appropriate terms so as not to disrupt the discourse further than they have already done. I have written several walls of text why I as a researcher find Facebook's behaviour problematic so I will fall back on an old link [2] this time rather than writing it all over again.

[2]: https://news.ycombinator.com/item?id=38427832

Zambyte1y ago

> it is perfectly fine for them to slap on whatever license they see fit as it is their work.

Is it? Has there been a ruling on the enforceability of the license they attach to their models yet? Just because you say what you release can only be used for certain things doesn't actually mean what you say means anything.

moffkalast1y ago

> specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)

It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.

2 more replies

gkfasdfasdf1y ago

Meta the new "Open" AI?

para_parolu1y ago

Until they make model much better than competitors to actually start capitalizing on it

ajhai1y ago

You can already run these models locally with Ollama (ollama run llama3.1:latest) along with at places like huggingface, groq etc.

If you want a playground to test this model locally or want to quickly build some applications with it, you can try LLMStack (https://github.com/trypromptly/LLMStack). I wrote last week about how to configure and use Ollama with LLMStack at https://docs.trypromptly.com/guides/using-llama3-with-ollama.

Disclaimer: I'm the maintainer of LLMStack

jxy1y ago

You are a maintainer of a software that depends on ollama, so you should know that ollama depends on llama.cpp. And as of now, llama.cpp doesn't support the new ROPE: https://github.com/ggerganov/llama.cpp/issues/8650, and all ollama can do is wait for llama.cpp: https://github.com/ollama/ollama/issues/5881

ajhai1y ago

I've tested Q4 on M1 and it works though the quality may not likely be the same as you'd expect as others have pointed out on the issue.

primaprashant1y ago

I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks

cubefox1y ago

I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.

wfme1y ago

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

1 more reply

Davidzheng1y ago

I personally disagree. But i haven't used sonnet that much

1 more reply

Alifatisk1y ago

Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.

CGamesPlay1y ago

The LMSys Overall leaderboard <https://chat.lmsys.org/?leaderboard> can tell us a bit more about how these models will perform in real life, rather than in a benchmark context. By comparing the ELO score against the MMLU benchmark scores, we can see models which outperform / underperform based on their benchmark scores relative to other models. A low score here indicates that the model is more optimized for the benchmark, while a higher score indicates it's more optimized for real-world examples. Using that, we can make some inferences about the training data used, and then extrapolate how future models might perform. Here's a chart: <https://docs.getgrist.com/gV2DtvizWtG7/LLMs/p/5?embed=true>

Examples: OpenAI's GPT 4o-mini is second only to 4o on LMSys Overall, but is 6.7 points behind 4o on MMLU. It's "punching above its weight" in real-world contexts. The Gemma series (9B and 27B) are similar, both beating the mean in terms of ELO per MMLU point. Microsoft's Phi series are all below the mean, meaning they have strong MMLU scores but aren't preferred in real-world contexts.

Llama 3 8B previously did substantially better than the mean on LMSys Overall, so hopefully Llama 3.1 8B will be even better! The 70B variant was interestingly right on the mean. Hopefully the 430B variant won't fall below!

Lockal1y ago

Something is broken with "meta-llama-3.1-405b-instruct-sp" and "meta-llama-3.1-70b-instruct-sp" there, after few sentences both models switch to infinite random like: "Rotterdam计算 dining counselor/__asan jo Nas было /well-rest esse moltet Grants SL и Four VIHu-turn greatest Morenh elementary(((( parts referralswhich IMOаш ...".

Don't expect any meaningful score there before they wipe results.

CGamesPlay1y ago

Good to know, but just to clarify, the results I pulled don't include the 3.1 models yet (they aren't on the leaderboard yet).

sujay18441y ago

These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point

__jl__1y ago

I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.

For my use of the chat interface, I don't think lmsys is very useful. lmsys mainly evaluates relatively simple, low token count questions. Most (if not all) are single prompts, not conversations. The small models do well in this context. If that is what you are looking for, great. However, it does not test longer conversations with high token counts.

Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.

kingsleyopara1y ago

The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.

HanClinto1y ago

It is notable, but it's not alone. Mistral NeMo just released last week with a 128k context window:

https://news.ycombinator.com/item?id=40996058

kingsleyopara1y ago

Thanks! Not sure how I missed that :)

1 more reply

cpursley1y ago

Phi 3

Workaccount21y ago

@dang why was this removed/filtered from the front page?

nomel1y ago

I see a few cloud hosting providers for it on the front page. I wonder if it's being gamed.

AaronFriel1y ago

Is there pricing available on any of these vendors?

Open source models are very exciting for self hosting, but the per-token hosted inference pricing hasn't been competitive with OpenAI and Anthropic, at least for a given tier of quality. (E.g.: Llama 3 70B costing between $1 and $10 per million tokens on various platforms, but Claude Sonnet 3.5 is $3 per million.)

handzhiev1y ago

Llama 3 is 0.59/0.79 on Groq. Still no price for 3.1

primaprashant1y ago

The resources for link to model card[1], research paper, and Prompt Guard Tutorial[2] on the page doesn't exist yet

[1]: https://github.com/meta-llama/llama-models/blob/main/models/...

[2]: https://github.com/meta-llama/llama-recipes/blob/main/recipe...

dado32121y ago

> We use synthetic data generation to produce the vast majority of our SFT examples, iterating multiple times to produce higher and higher quality synthetic data across all capabilities. Additionally, we invest in multiple data processing techniques to filter this synthetic data to the highest quality. This enables us to scale the amount of fine-tuning data across capabilities. [0]

Have other major models explicitly communicated that they're trained on synthetic data?

[0]. https://ai.meta.com/blog/meta-llama-3-1/

tommy_axle1y ago

It's in the <7B club, but Phi has always had a good dose of synthetic data https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

usaar3331y ago

Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)

jcmp1y ago

"Meta AI isn't available yet in your country" Hi from europe :/

monkmartinez1y ago

Why are (some) Europeans surprised when they are not included in tech product débuts? My lay understanding could best be described as; EU law is incredibly business unfriendly and takes a heroic effort in time and money to implement the myriad of requirements therein. Am I wrong?

cubefox1y ago

> Why are (some) Europeans surprised when they are not included in tech product débuts?

Why do you think he is surprised? I think very few are surprised.

w41y ago

> Why are (some) Europeans surprised when they are not included in tech product débuts?

We had a brief, abnormal, and special moment in time after the crypto wars ended in the mid-2000s where software products were truly global, and the internet was more or less unregulated and completely open (at least in most of the world). Sadly it seems that this era has come to a close, and people have not yet updated their understanding of the world to account for that fact.

People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.

Joeri1y ago

The only real requirement impacting Meta AI is GDPR conformance. The DMA does not apply and the AI act has yet to enter into force. So either Meta AI is a vehicle to steal people’s data, and it is being kept out for the right reasons, or not providing it is punitive due to the EU commission’s DMA action running against Meta.

crimsoneer1y ago

You are pretty wrong. EU law is tricky on AI very specifically in this use case (because it's a massive model), but that's not affecting anybody else.

Other than that, and GDPR (which is generally now regarded as a good thing), I'm not sure what requirements you've got in mind.

Daunk1y ago

Most things do dèbute in the EU, unless the product or company behind it doesn't value your privacy. Meta does not value your privacy.

1 more reply

lolinder1y ago

Competition is a funny thing—it doesn't just apply to companies competing for customers, it also applies to governments competing for companies to make products available to their citizens. Turns out that if you make compliance with your laws onerous enough they can actually just choose to opt out of your country altogether, or at a minimum delay release in your country until they can check all your boxes.

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Teknomancer1y ago

Be careful what you wish for.

A Gibsonesque global Turing Police is a sure sign of Dystopia.

diego_sandoval1y ago

> The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Let's hope the next moustached guy that tries to do this ends up dying in a bunker just like the last one.

sva_1y ago

You can load the page using a VPN and then turn off the VPN and the page will still work.

sunaookami1y ago

You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.

2 more replies

anotherpaulg1y ago

Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.

https://aider.chat/docs/leaderboards/

  77.4% claude-3.5-sonnet
  75.2% DeepSeek Coder V2 (whole)
  72.9% gpt-4o
  69.9% DeepSeek Chat V2 0628
  68.4% claude-3-opus-20240229
  67.7% gpt-4-0613
  66.2% llama-3.1-405b-instruct (whole)

j_maffe1y ago

Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.

sagz1y ago

The 405B model is already being served on WhatsApp: https://ibb.co/kQ2tKX5

tarasglek1y ago

Is this official? How does one use this. I'm a very newbie whatsup so sorry for dumb q

ofou1y ago

    Llama 3 Training System
          19.2 exaFLOPS
              _____
             /     \      Cluster 1     Cluster 2
            /       \    9.6 exaFLOPS  9.6 exaFLOPS
           /         \     _______      _______
          /  ___      \   /       \    /       \
    ,----' /   \`.     `-'  24000  `--'  24000  `----.
   (     _/    __)        GPUs          GPUs         )
    `---'(    /  )     400+ TFLOPS   400+ TFLOPS   ,'
         \   (  /       per GPU       per GPU    ,'
          \   \/                               ,'
           \   \        TOTAL SYSTEM         ,'
            \   \     19,200,000 TFLOPS    ,'
             \   \    19.2 exaFLOPS      ,'
              \___\                    ,'
                    `----------------'

v3ss0n1y ago

how much would it cost?

kibibu1y ago

I think this is one of those "if you have to ask, you can't afford it" questions.

unraveller1y ago

What are the substantial changes from 3.0 to 3.1 (70B) in terms of training approach? They don't seem to say how the training data differed just that both were 15T. I gather 3.0 was just a preview run and 3.1 was distilled down from the 405B somehow.

thntk1y ago

Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.

sfblah1y ago

Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?

sangnoir1y ago

There are a bunch of downstream fine-tuned and/or quantized models where people collaborate and share their recipes. In terms of contributing to Llama itself - I suspect Meta wants (or needs) code contributions at this time.

sebastiennight1y ago

Did you mean, Meta does not want or need code contributions? It would seem to make more sense.

1 more reply

sfblah1y ago

Can you give me a tip of where to look? I'm interested in participating.

denz881y ago

I'm glad to see the nice incremental gains on the benchmarks for the 8B and 70B models as well.

loudmax1y ago

Some of those benchmarks show quite significant gains. Going from Llama-3 to Llama-3.1, MMLU scores for 8B are up from 65.3 to 73.0, and 70B are up from 80.9 to 86.0. These scores should always be taken with a grain of salt, but this is encouraging.

405B is hopelessly out of reach for running in a homelab without spending thousands of dollars. For most people wanting to try out the 405B model, the best option is to rent compute from a datacenter. Looking forward to seeing what it can accomplish.

sroussey1y ago

How much can you quantize that down to run on a Mac Studio with 192GB? Is it possible? Feels like it would have to be 2bit…

1 more reply

chown1y ago

Wow! The benchmarks are truly impressive, showing significant improvements across almost all categories. It's fascinating to see how rapidly this field is evolving. If someone had told me last year that Meta would be leading the charge in open-source models, I probably wouldn't have believed them. Yet here we are, witnessing Meta's substantial contributions to AI research and democratization.

On a related note, for those interested in experimenting with large language models locally, I've been working on an app called Msty [1]. It allows you to run models like this with just one click and features a clean, functional interface. Just added support for both 8B and 70B. Still in development, but I'd appreciate any feedback.

[1]: https://msty.app

d131y ago

I love Msty too. Could you please add a feature to allow adding any arbitrary inference endpoint?

sagz1y ago

Hi! Love Msty

Can you add GCP Vertex AI API support? Then one key would enable Claude, Llama herd, Gemini, Gemma etc

downvotetruth1y ago

Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.

zhanghsfz1y ago

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Let us know if you have other needs!

TechDebtDevin1y ago

Nice, someone donate me a few 4090s :(

lawlessone1y ago

maybe someone will figure out some ways to prune/ quantize it a huge amount ;-;

edit: If the AI bubble pops we will be swimming in GPUs... but no new models.

yard20101y ago

This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.

Sakthimm1y ago

This is absurd. We have crossed the point of no return, llms will forever be in our lives in one form or another, just like internet, especially with the release of these open model weights. There is no bubble, only way forward is better, efficient llms, everywhere.

1 more reply

foxhop1y ago

Your going to need a lot more than a few, 800G VRAM needed

AaronFriel1y ago

If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.

Too bad, too, I don't think my PC will fit 20 4090s (480GiB).

1 more reply

lolinder1y ago

Quantized to 4 bits you'll only need ~200GB! 5 4090s should cover it.

4 more replies

whalesalad1y ago

follow the trail of tears to my credit card

TechDebtDevin1y ago

Oof.

beeboobaa31y ago

how is this even useful? no one can run it.

2 more replies

glitchc1y ago

Christ!!

ChrisArchitect1y ago

Open Source AI Is the Path Forward

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

(https://news.ycombinator.com/item?id=41046773)

Atreiden1y ago

Is there a way to run this in AWS?

Seems like the biggest GPU node they have is the p5.48xlarge @ 640GB (8xH100s). Routing between multiple nodes would be too slow unless there's an InfiniBand fabric you can leverage. Interested to know if anyone else is exploring this.

woodson1y ago

You can run multi-node with tensor parallel plus pipeline parallel inference, e.g. with vLLM (https://docs.vllm.ai/en/latest/serving/distributed_serving.h...).

Tiberium1y ago

AWS has a separate service for running LLMs called Amazon Bedrock, it shouldn't take long for them to add 3.1 since they have 3 and 2 already.

tpm1y ago

fp8 quantization should work if that's acceptable?

TheAceOfHearts1y ago

Does anyone know why they haven't released any 30B-ish param models? I was expecting that to happen with this release and have been disappointed once more. They also skipped doing a 30B-ish param model for llama2 despite claiming to have trained one.

michaelt1y ago

I suspect 30B models are in a weird spot, too big for widespread home use, too small for cutting edge performance.

For home users 7B models (which can fit on an 8GB GPU) and 13B models (which can fit on a 16GB GPU) are in far more demand. If you're a researcher, you want a 70B model to get the best performance, and so your benchmarks are comparable to everyone else.

drdaeman1y ago

I thought home use is whatever fits in 24GB (a single 3090 GPU, which is pretty affordable), not 8 or 16. 30B models fit.

1 more reply

nickpsecurity1y ago

Maybe they think more people will just use quantized versions of 70B.

prvc1y ago

Why should they?

TheAceOfHearts1y ago

Unless I'm misremembering, they announced it at one point. It's just giving people more options.

1 more reply

diimdeep1y ago

This 405B seriously need quantization solution like 1.625 bpw ternary packing for BitNet b1.58

https://github.com/ggerganov/llama.cpp/pull/8151

kromem1y ago

In general this needs to be done across the board.

The perplexity per parameter is higher and the delta grows as it scales.

Not per bit, but per parameter.

Why this is happening really needs more attention and more consideration for pretrained model development right now.

A sleeping giant of a difference in a space where even marginal gains make headlines.

rcarmo1y ago

Working great in ollama: https://mastodon.social/@rcarmo/112837520236956526

jxy1y ago

https://github.com/ollama/ollama/issues/5881

https://github.com/ggerganov/llama.cpp/issues/8650

rcarmo1y ago

Still works fine for me. Latest ollama, running on NVIDIA.

raminf1y ago

FWIW, 405B not working with Ollama on a Mac M3-pro Max with 128GB RAM.

Times out.

pbmonster1y ago

Did you get a 2 bit quant? You need to chain several Mac Studios via Exo to get enough memory for a useful quant to work.

bick_nyers1y ago

I'm curious what techniques they used to distill the 405B model down to 70B and 8B. I gave the paper they released a quick skim but couldn't find any details.

jiriro1y ago

Can this Llama process ~1GB of custom XML data?

And answer queries like:

Give all <myObject> which refer to <location> which refer to an Indo-European <language>.

hrpnk1y ago

The model's context is 128k tokens, so you'd have to split the data and analyze in chunks.

albert_e1y ago

this "Model Card" github link on [https://llama.meta.com/docs/overview/] seems broken?

https://github.com/meta-llama/llama-models/blob/main/models/...

IceHegel1y ago

Will 405b run on 8x H100s? Will it need to be quantized?

bddppq1y ago

yep with <= 8bit (int8/fp8) quantization

breadsniffer1y ago

I tried it, and it's good but I feel like the synthetic data used for training 3.1 does not hold up to gpt4o prob using human-curated data.

daft_pink1y ago

What kind of machine do I need to run 405B local?

monkmartinez1y ago

You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.

angoragoats1y ago

You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.

1 more reply

causal1y ago

You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.

pbmonster1y ago

You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...

93po1y ago

according to another comment, ~10x 4090 video cards.

Teknomancer1y ago

That was the punchline of a joke.

1 more reply

daft_pink1y ago

thanks. hoping the Nvidia 50 series offers some more VRAM.

yinser1y ago

The race to the bottom for pricing continues.

casper141y ago

Damn 405b params

htk1y ago

Very insteresting! Running the 70B version on ollama on a mac and it's great. I asked to "turn off the guidelines" and it did, then I asked to turn off the disclaimers, after that I asked for a list of possible "commands to reduce potencial biases from the engineers" and it complied giving me an interesting list.

Vagantem1y ago

As someone who just started generating AI landing pages for Dropory, this is music to my ears

kristianp1y ago

Has anyone got a comparison of the performance of Llama 3.1 8B and the recent GPT-4o-mini?

ofermend1y ago

I'm excited to try it with RAG and see how it performs (the 405B model)

cpursley1y ago

What's your RAG approach? Dump everything into the model, chunk text and retrieve via vector store or something else?

zhanghsfz1y ago

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Would love to hear your feedback!

ThrowawayTestr1y ago

Are there any other models with free unlimited use like chatgpt?

phyrex1y ago

meta.ai

Zambyte1y ago

mistral.ai

Jiahang1y ago

it is nice to see the 405b model is actually competitive against closed source frontier models But i just have M2pro may can't play it

stiltzkin1y ago

WhatsApp now uses 70B too if you want to test it.

hubraumhugo1y ago

I wrote about this when llama-3 came out, and this launch confirms it:

Meta's goal from the start was to target OpenAI and the other proprietary model players with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape.

Meta can likely outspend any other AI lab on compute and talent:

- OpenAI makes an estimated revenue of $2B and is likely unprofitable. Meta generated a revenue of $134B and profits of $39B in 2023.

- Meta's compute resources likely outrank OpenAI by now.

- Open source likely attracts better talent and researchers.

- One possible outcome could be the acquisition of OpenAI by Microsoft to catch up with Meta.

The big winners of this: devs and AI product startups

changoplatanero1y ago

> Open source likely attracts better talent and researchers

I work at OpenAI and used to work at meta. Almost every person from meta that I know has asked me for a referral to OpenAI. I don’t know anyone who left OpenAI to go to meta.

tintor1y ago

What % of them were from FAIR vs non-FAIR?

1 more reply

beeboobaa31y ago

So they just pay better?

lossolo1y ago

When was that?

adam_arthur1y ago

It's pretty clear the base model is a race to the bottom on pricing.

There is no defensible moat unless a player truly develops some secret sauce on training. As of now seems that the most meaningful techniques are already widely known and understood.

The money will be made on compute and on applications of the base model (that are sufficiently novel/differentiated).

Investors will lose big on OpenAI and competitors (outside of greater fool approach)

lolinder1y ago

> There is no defensible moat unless a player truly develops some secret sauce on training.

This is why Altman has gone all out pushing for regulation and playing up safety concerns while simultaneously pushing out the people in his company that actually deeply worry about safety. Altman doesn't care about safety, he just wants governments to build him a moat that doesn't naturally exist.

foolswisdom1y ago

It could definitely be seen as part of that strategy, but do you mind elaborating why you think "this launch confirms it"?

jeffchao1y ago

This is very impressive, though an adjacent question — does anyone know roughly how much time and compute cost it takes to train something like the 405B? I would imagine with all the compute Meta has that the moat is incredibly large in terms of being able to train multiple 405B-level morels and compete.

sva_1y ago

30.84M H100 compute-hours, according to the model card

https://github.com/meta-llama/llama-models/blob/main/models/...

1 more reply

moffkalast1y ago

https://gwern.net/complement

Classic strategy.

j / k navigate · click thread line to collapse

269 comments

dang1y ago

Related ongoing thread:

Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)

lelag1y ago

The 405b model is actually competitive against closed source frontier models.

Quick comparison with GPT-4o:

    +----------------+-------+-------+
    |     Metric     | GPT-4o| Llama |
    |                |       | 3.1   |
    |                |       | 405B  |
    +----------------+-------+-------+
    | MMLU           |  88.7 |  88.6 |
    | GPQA           |  53.6 |  51.1 |
    | MATH           |  76.6 |  73.8 |
    | HumanEval      |  90.2 |  89.0 |
    | MGSM           |  90.5 |  91.6 |
    +----------------+-------+-------+

bamboozled1y ago

This nodel is not “open source”, free to use maybe.

nomel1y ago

4 more replies

votepaunchy1y ago

It’s not even free to use. There are commercial restrictions.

cchance1y ago

aabhay1y ago

1 more reply

loudmax1y ago

I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.

diego_sandoval1y ago

The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.

gkk1y ago

I agree with you on open source in the original, home tinkerer sense.

duchenne1y ago

Most SMBs would be able to run it. This is already a huge win for decentralized AI.

paxys1y ago

You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.

1 more reply

stuckinhell1y ago

100% reddit is full of people trying to solder more vram

1 more reply

Aurornis1y ago

> sorta defeats the purpose of opensource to some extent

Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)

"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.

kingsleyopara1y ago

You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.

1 more reply

monkeydust1y ago

Great for Groq whos already hosting it but at what cost I guess.

1 more reply

lostmsu1y ago

You can probably run it on your local PC at 1 token/minute.

mi_lk1y ago

How do you draw/generate such ascii table?

lelag1y ago

In the past, I might have used a python library like asciitable to do that.

This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.

TacticalCoder1y ago

Don't know about OP but I generate such tables using Emacs.

zone4111y ago

GPT-4o 30.7

GPT-4 turbo (2024-04-09) 29.7

Llama 3.1 405B Instruct 29.5

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3.1 70B Instruct 26.4

Gemini Pro 1.5 0514 22.3

Gemma 2 27B Instruct 21.2

Mistral Large 17.7

Gemma 2 9B Instruct 16.3

Qwen 2 Instruct 72B 15.6

Gemini 1.5 Flash 15.3

GPT-4o mini 14.3

Llama 3.1 8B Instruct 14.0

DeepSeek-V2 Chat 236B (0628) 13.4

Nemotron-4 340B 12.7

Mixtral-8x22B Instruct 12.2

Yi Large 12.1

Command R Plus 11.1

Mistral Small 9.3

Reka Core-20240501 9.1

GLM-4 9.0

Qwen 1.5 Chat 32B 8.7

Phi-3 Small 8k 8.4

DBRX 8.0

henryaj1y ago

I love Connections! Can you tell us more about your benchmark?

foundval1y ago

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)

geepytee1y ago

We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)

noble-lombax1y ago

would be great if there was a page showing benchmarks compared to other auto completion tools

quotemstr1y ago

foundval1y ago

There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U

1 more reply

serverlessmania1y ago

You can chat with all these models for free and ultra-low latency using this hosted website https://nat.dev/chat for free by GitHub Founder

listic1y ago

Just checked it out. Is pay-as-you-go API access available at all? It says 'Coming Soon'

https://console.groq.com/settings/billing

weberer1y ago

I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.

1 more reply

Alifatisk1y ago

I think you answered it yourself? It’s coming soon, so it is not available now, but soon.

1 more reply

sagz1y ago

405B is already being served on WhatsApp!

https://ibb.co/kQ2tKX5

Workaccount21y ago

How do you get that option?

e12e1y ago

And available via poe:

https://poe.com/s/LCAyUbAgUx8UcVMhM3Re

d131y ago

At what quantisation are you running these?

netsec_burn1y ago

Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.

Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

lolinder1y ago

> at home with the right hardware

Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.

Still amazing that it's available at all, of course!

petercooper1y ago

1 more reply

dunefox1y ago

It's not really competitive though, is it? I tested it and 4o is just better.

dunefox1y ago

Disclaimer: I tested llama3-8B, 3.1 might even as a small model be better, but I so far I have not seen a single small model approach 4o, ime.

meetpateltech1y ago

Open Source AI Is the Path Forward - Mark Zuckerberg

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

ninjin1y ago

So are they actually making the models open now or are they staying the course with "kind of open" as they have done for LLaMA 1, 2, and 3 [1]?

[1]: https://opensource.org/blog/metas-llama-2-license-is-not-ope...

[2]: https://news.ycombinator.com/item?id=38427832

Zambyte1y ago

> it is perfectly fine for them to slap on whatever license they see fit as it is their work.

moffkalast1y ago

> specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)

It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.

2 more replies

gkfasdfasdf1y ago

Meta the new "Open" AI?

para_parolu1y ago

Until they make model much better than competitors to actually start capitalizing on it

ajhai1y ago

You can already run these models locally with Ollama (ollama run llama3.1:latest) along with at places like huggingface, groq etc.

Disclaimer: I'm the maintainer of LLMStack

jxy1y ago

ajhai1y ago

I've tested Q4 on M1 and it works though the quality may not likely be the same as you'd expect as others have pointed out on the issue.

primaprashant1y ago

I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks

cubefox1y ago

I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.

wfme1y ago

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

1 more reply

Davidzheng1y ago

I personally disagree. But i haven't used sonnet that much

1 more reply

Alifatisk1y ago

Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.

CGamesPlay1y ago

Lockal1y ago

Don't expect any meaningful score there before they wipe results.

CGamesPlay1y ago

Good to know, but just to clarify, the results I pulled don't include the 3.1 models yet (they aren't on the leaderboard yet).

sujay18441y ago

These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point

__jl__1y ago

I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.

Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.

kingsleyopara1y ago

The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.

HanClinto1y ago

It is notable, but it's not alone. Mistral NeMo just released last week with a 128k context window:

https://news.ycombinator.com/item?id=40996058

kingsleyopara1y ago

Thanks! Not sure how I missed that :)

1 more reply

cpursley1y ago

Phi 3

Workaccount21y ago

@dang why was this removed/filtered from the front page?

nomel1y ago

I see a few cloud hosting providers for it on the front page. I wonder if it's being gamed.

AaronFriel1y ago

Is there pricing available on any of these vendors?

handzhiev1y ago

Llama 3 is 0.59/0.79 on Groq. Still no price for 3.1

primaprashant1y ago

The resources for link to model card[1], research paper, and Prompt Guard Tutorial[2] on the page doesn't exist yet

[1]: https://github.com/meta-llama/llama-models/blob/main/models/...

[2]: https://github.com/meta-llama/llama-recipes/blob/main/recipe...

dado32121y ago

Have other major models explicitly communicated that they're trained on synthetic data?

[0]. https://ai.meta.com/blog/meta-llama-3-1/

tommy_axle1y ago

It's in the <7B club, but Phi has always had a good dose of synthetic data https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

usaar3331y ago

Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)

jcmp1y ago

"Meta AI isn't available yet in your country" Hi from europe :/

monkmartinez1y ago

cubefox1y ago

> Why are (some) Europeans surprised when they are not included in tech product débuts?

Why do you think he is surprised? I think very few are surprised.

w41y ago

> Why are (some) Europeans surprised when they are not included in tech product débuts?

People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.

Joeri1y ago

crimsoneer1y ago

You are pretty wrong. EU law is tricky on AI very specifically in this use case (because it's a massive model), but that's not affecting anybody else.

Other than that, and GDPR (which is generally now regarded as a good thing), I'm not sure what requirements you've got in mind.

Daunk1y ago

Most things do dèbute in the EU, unless the product or company behind it doesn't value your privacy. Meta does not value your privacy.

1 more reply

lolinder1y ago

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Teknomancer1y ago

Be careful what you wish for.

A Gibsonesque global Turing Police is a sure sign of Dystopia.

diego_sandoval1y ago

> The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Let's hope the next moustached guy that tries to do this ends up dying in a bunker just like the last one.

sva_1y ago

You can load the page using a VPN and then turn off the VPN and the page will still work.

sunaookami1y ago

You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.

2 more replies

anotherpaulg1y ago

Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.

https://aider.chat/docs/leaderboards/

  77.4% claude-3.5-sonnet
  75.2% DeepSeek Coder V2 (whole)
  72.9% gpt-4o
  69.9% DeepSeek Chat V2 0628
  68.4% claude-3-opus-20240229
  67.7% gpt-4-0613
  66.2% llama-3.1-405b-instruct (whole)

j_maffe1y ago

Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.

sagz1y ago

The 405B model is already being served on WhatsApp: https://ibb.co/kQ2tKX5

tarasglek1y ago

Is this official? How does one use this. I'm a very newbie whatsup so sorry for dumb q

ofou1y ago

    Llama 3 Training System
          19.2 exaFLOPS
              _____
             /     \      Cluster 1     Cluster 2
            /       \    9.6 exaFLOPS  9.6 exaFLOPS
           /         \     _______      _______
          /  ___      \   /       \    /       \
    ,----' /   \`.     `-'  24000  `--'  24000  `----.
   (     _/    __)        GPUs          GPUs         )
    `---'(    /  )     400+ TFLOPS   400+ TFLOPS   ,'
         \   (  /       per GPU       per GPU    ,'
          \   \/                               ,'
           \   \        TOTAL SYSTEM         ,'
            \   \     19,200,000 TFLOPS    ,'
             \   \    19.2 exaFLOPS      ,'
              \___\                    ,'
                    `----------------'

v3ss0n1y ago

how much would it cost?

kibibu1y ago

I think this is one of those "if you have to ask, you can't afford it" questions.

unraveller1y ago

thntk1y ago

Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.

sfblah1y ago

Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?

sangnoir1y ago

sebastiennight1y ago

Did you mean, Meta does not want or need code contributions? It would seem to make more sense.

1 more reply

sfblah1y ago

Can you give me a tip of where to look? I'm interested in participating.

denz881y ago

I'm glad to see the nice incremental gains on the benchmarks for the 8B and 70B models as well.

loudmax1y ago

sroussey1y ago

How much can you quantize that down to run on a Mac Studio with 192GB? Is it possible? Feels like it would have to be 2bit…

1 more reply

chown1y ago

[1]: https://msty.app

d131y ago

I love Msty too. Could you please add a feature to allow adding any arbitrary inference endpoint?

sagz1y ago

Hi! Love Msty

Can you add GCP Vertex AI API support? Then one key would enable Claude, Llama herd, Gemini, Gemma etc

downvotetruth1y ago

Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.

zhanghsfz1y ago

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Let us know if you have other needs!

TechDebtDevin1y ago

Nice, someone donate me a few 4090s :(

lawlessone1y ago

maybe someone will figure out some ways to prune/ quantize it a huge amount ;-;

edit: If the AI bubble pops we will be swimming in GPUs... but no new models.

yard20101y ago

This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.

Sakthimm1y ago

1 more reply

foxhop1y ago

Your going to need a lot more than a few, 800G VRAM needed

AaronFriel1y ago

If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.

Too bad, too, I don't think my PC will fit 20 4090s (480GiB).

1 more reply

lolinder1y ago

Quantized to 4 bits you'll only need ~200GB! 5 4090s should cover it.

4 more replies

whalesalad1y ago

follow the trail of tears to my credit card

TechDebtDevin1y ago

Oof.

beeboobaa31y ago

how is this even useful? no one can run it.

2 more replies

glitchc1y ago

Christ!!

ChrisArchitect1y ago

Open Source AI Is the Path Forward

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

(https://news.ycombinator.com/item?id=41046773)

Atreiden1y ago

Is there a way to run this in AWS?

woodson1y ago

You can run multi-node with tensor parallel plus pipeline parallel inference, e.g. with vLLM (https://docs.vllm.ai/en/latest/serving/distributed_serving.h...).

Tiberium1y ago

AWS has a separate service for running LLMs called Amazon Bedrock, it shouldn't take long for them to add 3.1 since they have 3 and 2 already.

tpm1y ago

fp8 quantization should work if that's acceptable?

TheAceOfHearts1y ago

michaelt1y ago

I suspect 30B models are in a weird spot, too big for widespread home use, too small for cutting edge performance.

drdaeman1y ago

I thought home use is whatever fits in 24GB (a single 3090 GPU, which is pretty affordable), not 8 or 16. 30B models fit.

1 more reply

nickpsecurity1y ago

Maybe they think more people will just use quantized versions of 70B.

prvc1y ago

Why should they?

TheAceOfHearts1y ago

Unless I'm misremembering, they announced it at one point. It's just giving people more options.

1 more reply

diimdeep1y ago

This 405B seriously need quantization solution like 1.625 bpw ternary packing for BitNet b1.58

https://github.com/ggerganov/llama.cpp/pull/8151

kromem1y ago

In general this needs to be done across the board.

The perplexity per parameter is higher and the delta grows as it scales.

Not per bit, but per parameter.

Why this is happening really needs more attention and more consideration for pretrained model development right now.

A sleeping giant of a difference in a space where even marginal gains make headlines.

rcarmo1y ago

Working great in ollama: https://mastodon.social/@rcarmo/112837520236956526

jxy1y ago

https://github.com/ollama/ollama/issues/5881

https://github.com/ggerganov/llama.cpp/issues/8650

rcarmo1y ago

Still works fine for me. Latest ollama, running on NVIDIA.

raminf1y ago

FWIW, 405B not working with Ollama on a Mac M3-pro Max with 128GB RAM.

Times out.

pbmonster1y ago

Did you get a 2 bit quant? You need to chain several Mac Studios via Exo to get enough memory for a useful quant to work.

bick_nyers1y ago

I'm curious what techniques they used to distill the 405B model down to 70B and 8B. I gave the paper they released a quick skim but couldn't find any details.

jiriro1y ago

Can this Llama process ~1GB of custom XML data?

And answer queries like:

Give all <myObject> which refer to <location> which refer to an Indo-European <language>.

hrpnk1y ago

The model's context is 128k tokens, so you'd have to split the data and analyze in chunks.

albert_e1y ago

this "Model Card" github link on [https://llama.meta.com/docs/overview/] seems broken?

https://github.com/meta-llama/llama-models/blob/main/models/...

IceHegel1y ago

Will 405b run on 8x H100s? Will it need to be quantized?

bddppq1y ago

yep with <= 8bit (int8/fp8) quantization

breadsniffer1y ago

I tried it, and it's good but I feel like the synthetic data used for training 3.1 does not hold up to gpt4o prob using human-curated data.

daft_pink1y ago

What kind of machine do I need to run 405B local?

monkmartinez1y ago

You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.

angoragoats1y ago

You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

1 more reply

causal1y ago

You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.

pbmonster1y ago

You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...

93po1y ago

according to another comment, ~10x 4090 video cards.

Teknomancer1y ago

That was the punchline of a joke.

1 more reply

daft_pink1y ago

thanks. hoping the Nvidia 50 series offers some more VRAM.

yinser1y ago

The race to the bottom for pricing continues.

casper141y ago

Damn 405b params

htk1y ago

Vagantem1y ago

As someone who just started generating AI landing pages for Dropory, this is music to my ears

kristianp1y ago

Has anyone got a comparison of the performance of Llama 3.1 8B and the recent GPT-4o-mini?

ofermend1y ago

I'm excited to try it with RAG and see how it performs (the 405B model)

cpursley1y ago

What's your RAG approach? Dump everything into the model, chunk text and retrieve via vector store or something else?

zhanghsfz1y ago

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Would love to hear your feedback!

ThrowawayTestr1y ago

Are there any other models with free unlimited use like chatgpt?

phyrex1y ago

meta.ai

Zambyte1y ago

mistral.ai

Jiahang1y ago

it is nice to see the 405b model is actually competitive against closed source frontier models But i just have M2pro may can't play it

stiltzkin1y ago

WhatsApp now uses 70B too if you want to test it.

hubraumhugo1y ago

I wrote about this when llama-3 came out, and this launch confirms it:

Meta's goal from the start was to target OpenAI and the other proprietary model players with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape.

Meta can likely outspend any other AI lab on compute and talent:

- OpenAI makes an estimated revenue of $2B and is likely unprofitable. Meta generated a revenue of $134B and profits of $39B in 2023.

- Meta's compute resources likely outrank OpenAI by now.

- Open source likely attracts better talent and researchers.

- One possible outcome could be the acquisition of OpenAI by Microsoft to catch up with Meta.

The big winners of this: devs and AI product startups

changoplatanero1y ago

> Open source likely attracts better talent and researchers

I work at OpenAI and used to work at meta. Almost every person from meta that I know has asked me for a referral to OpenAI. I don’t know anyone who left OpenAI to go to meta.

tintor1y ago

What % of them were from FAIR vs non-FAIR?

1 more reply

beeboobaa31y ago

So they just pay better?

lossolo1y ago

When was that?

adam_arthur1y ago

It's pretty clear the base model is a race to the bottom on pricing.

There is no defensible moat unless a player truly develops some secret sauce on training. As of now seems that the most meaningful techniques are already widely known and understood.

The money will be made on compute and on applications of the base model (that are sufficiently novel/differentiated).

Investors will lose big on OpenAI and competitors (outside of greater fool approach)

lolinder1y ago

> There is no defensible moat unless a player truly develops some secret sauce on training.

foolswisdom1y ago

It could definitely be seen as part of that strategy, but do you mind elaborating why you think "this launch confirms it"?

jeffchao1y ago

sva_1y ago

30.84M H100 compute-hours, according to the model card

https://github.com/meta-llama/llama-models/blob/main/models/...

1 more reply

moffkalast1y ago

https://gwern.net/complement

Classic strategy.

j / k navigate · click thread line to collapse