Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (opens in new tab)

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B

dannyobrien1y ago

What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.

https://github.com/google/generative-ai-docs/issues/257

faangguyindia1y ago

If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.

Gemini Flash is fast with upto 4 million token context.

Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o

You can simply use Gemini Flash for Code Completion, git review tool and many more.

a21281y ago

Is this sustainable though, or are they just trying really hard to attract users? If I build all of my tooling on it, will they start charging me thousands of dollars next year once the subsidies dry up? With a local model running with open source software, at least I can know that as long as my computer can still compute, the model will still run just as well and just as fast as it did on day 1, and cost the same amount of electricity

8 more replies

nycdatasci1y ago

This is great for experimentation, but as others have pointed out recently there are persistent issues with Gemini that prevent use in actual products. The recitation/self-sensoring issue results in random failures:

airspresso1y ago

Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.

Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.

TL;DR - Google free of cost models are irrelevant when talking about local models.

Deathmax1y ago

The free tier API isn't US-only, Google has removed the free tier restriction for UK/EEA countries for a while now, with the added bonus of not training on your data if making a request from the UK/CH/EEA.

hobofan1y ago

Not locked to the US, you get 1 billion tokens per month per model with Mistral since their recent announcement: https://mistral.ai/news/september-24-release/ (1 request per second is quite a harsh rate limit, but hey, free is free)

I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.

jackbravo1y ago

I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?

dannyobrien1y ago

llm is Simon's command line front-end to a lot of the llm apis, local and cloud-based. Along with aider-chat, it's my main interface to any LLM work -- it works well with a chat model, one-off queries, and piping text or output into a llm chain. For people who live on the command line, or are just put-off by web interfaces, it's a godsend.

About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331

https://llm.datasette.io/en/stable/

jerieljan1y ago

It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.

And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.

I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.

flakiness1y ago

There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...

It's worth listening to learn abouut the context on how that tool is used.

magicalhippo1y ago

I'm new to this game. I played with Gemma 2 9B in an agent-like role before and was pleasantly surprised. I just tried some of the same prompts with Llama 3.2 3B and found it doesn't stick to my instructions very well.

Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?

forgingahead1y ago

What are people using to check token length of code bases? I'd like to point certain app folders to a local LLM, but no idea how that stuff is calculated? Seems like some strategic prompting (eg: this is a rails app, here is the folder structure with file names, and btw here are the actual files to parse) would be more efficient than just giving it the full app folder? No point giving it stuff from /lib and /vendor for the most part I reckon.

simonw1y ago

I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.

Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.

xyc1y ago

You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...

sumedh1y ago

You can try Gemini Token count. https://ai.google.dev/api/tokens

lowyek1y ago

Hi simon, is there a way to run the vision model easily on my mac locally?

simonw1y ago

Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.

You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

foxhop1y ago

The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.

littlestymaar1y ago

GP is talking about context windows, not the number of token used by the tokenizer.

TZubiri1y ago

This obsession with using AI to help with programming is short sighted.

We discover gold and you think of gold pickaxes.

Carrok1y ago

If we make this an analogy to video games, gold pickaxes can usually mine more gold much faster.

What could be short sighted about using tools to improve your daily work?

opdahl1y ago

I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am grateful that Meta has taken this position, and are pushing more openness.

monkfish3281y ago

Zuckerberg has never liked having Android/iOs as gatekeepers i.e. "platforms" for his apps.

He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.

That's the strategy. No values here - just strategy folks.

itchyjunk1y ago

You seem pretty confident about there being "no values here". Just because his action also lends to strategy, does not mean there are no values there. You seem to be doubling down on the sentiment by copy/pasting same comment around. You might be right. But I don't know Zuck at a personal level enough to make such strong claims, at least.

grahamj1y ago

Yep - give away OAI etc.’s product so the they never get big enough to control whatsinstabook. If you can’t use it to build a moat then don’t let anyone else do it either.

The thing about giant companies is they never want there to be more giant companies.

HDThoreaun1y ago

You can recognize this and still be grateful that Mark's incentives align with my own in a way that has made llama free and open sourceish

cedws1y ago

Zuckerberg probably realises the value of currying favour with engineers. Also, I think he has a personal vendetta to compete with Musk in this space.

pjfin1231y ago

Meta has been good about releasing their NLO work open source for a long time. Most of the open source datasets for foreign language translation were created by Facebook.

thefourthchime1y ago

They have a hose of ad money and have nothing to lose doing this.

You can’t say that for the other guys.

talldayo1y ago

I can absolutely say that about Google and Apple.

yunwal1y ago

> They have a hose of ad money and have nothing to lose doing this.

If I didn’t have context I’d assume this was about Google.

fennecfoxy1y ago

As the Google memo (https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...) pointed out, a lot of OSS stuff/improvements are being built on top of Meta's work which somewhat benefits them as well.

But still, Kudos to Zuck/Meta for doing it anyway.

isoprophlex1y ago

They're out to fuck over the competition by killing their moat. Classic commoditize your complement.

seydor1y ago

I believe the most important contribution is to show that super-funded companies don't really have a special moat: Llama is transformers, they just have the money to scale it. Many entities around the world can replicate this and it seems Meta is doing it before they do.

asterix_pano1y ago

Maybe it's cynical to think that way but maybe it's a way to crush the competition before it even begins: I would probably not invest in researching LLMs now, knowing that there is a company that will very likely produce a model close enough for free and I will likely never make back the investment.

snek_case1y ago

I don't think it's necessarily the small competitors that they are worried about, but they could be trying to prevent OpenAI from becoming too powerful and competing with them.

imjonse1y ago

Training data is crucial for performance and they do not (cannot) share that.

nickpsecurity1y ago

Do they tell you what training data they use for alignment? As in, what biases they intentionally put in the system they’re widely deploying?

warkdarrior1y ago

Do you have some concrete example of biases in their models? Or are you just fishing for something to complain about?

dkga1y ago

Fully second that.

nextworddev1y ago

They are literally training on all the free personal data you provided, so they owe you this much

kristopolous1y ago

Given what I see in Facebook comments I'm surprised the AI doesn't just respond with "Amen. Happy Birthday" to every query.

They're clearly majorly scrubbing things somehow

a_wild_dandan1y ago

"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)

With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!

I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.

adtac1y ago

>WRONG! ENJOY MORE PENALTY, SCRUB!

Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.

But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".

I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.

snovv_crash1y ago

You are in violent agreement with GP.

croes1y ago

Isn't jumping over a fence more likely than jumping over a wall?

refulgentis1y ago

They don't, they're playing "hide the #s" a bit. Llama 3.2 3B is definitively worse than Phi-3 from May, both on any given metric and in an hour of playing with the 2, trying to justify moving to Llama 3.2 at 3B, given I'm adding Llama 3.2 at 1B.

grahamj1y ago

I would have went with “moon”

illwrks1y ago

Moat

whimsicalism1y ago

yeah i mean that is exactly why distillation works. if you just were one hotting it would be the same as training on same dataset

alanzhuly1y ago

Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).

For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:

nexa run llama3.2 --streamlit

Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.

alfredgg1y ago

It's a great tool. Thanks!

I had to test it with Llama3.1 and was really easy. At a first glance Llama3.2 didn't seem available. The command you provided did not work, raising "An error occurred while pulling the model: not enough values to unpack (expected 2, got 1)".

alanzhuly1y ago

Thanks for reporting. We are investigating this issue. Could you help submit an issue to our GitHub and provide a screenshot of the terminal (with pip show nexaai)? This could help us reproduce this issue faster. Much appreciated!

mikestaub1y ago

or https://chat.webllm.ai/

grahamj1y ago

or grab lmstudio

Zuiii1y ago

For people who really care about open source, this is not.

freedomben1y ago

If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be patient.

[1]: https://ollama.com/blog/llama3.2

We're working on it. There are already draft PRs up in the GH repo. We're still working out some kinks though.

xena1y ago

As a rule of thumb with AI stuff: it either works instantly, or wait a day or two.

refulgentis1y ago

ollama is "just" llama.cpp underneath, I recommend switching to LM Studio or Jan, they don't have this issue of proprietary wrapper that obfuscates, you can just use any ol GGUF

lolinder1y ago

What proprietary wrapper? Isn't Ollama entirely open source?

calgoo1y ago

I use gguf in ollama on a daily basis, so not sure what the issue is? Just wrap it in a modelfile and done!

moffkalast1y ago

I've just tested the 1B and 3B at Q8, some interesting bits:

- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context

- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly

- The 3B is a certified Gemma-2-2B killer

Given that llama.cpp doesn't support any multimodality (they removed the old implementation), it might be a while before the 11B and 90B become runnable. Doesn't seem like they outperform Qwen-2-VL at vision benchmarks though.

Hoping to get this out soon w/ Ollama. Just working out a couple of last kinks. The 11b model is legit good though, particularly for tasks like OCR. It can actually read my cursive handwriting.

jsarv1y ago

Naah, Qwen2-VL-7b still is much much better than 11b model for handwritten OCR from what i have tested. The 11b model hallucinates in case of handwritten OCR.

dhbradshaw1y ago

Tried out 3B on ollama, asking questions in optics, bio, and rust.

It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.

tomComb1y ago

I question whether a 3B model can have “a lot of knowledge”.

ac291y ago

As a point of comparison, the Llama 3.2 3B model is 6.5GB. The entirety of English wikipedia text is 19GB (as compressed with an algorithm from 1996, newer compression formats might do better).

Its not a perfect comparison and Llama does a lot more than English, but I would say 6.5GB of data can certainly contain a lot of knowledge.

wongarsu1y ago

From quizzing it a bit it has good knowledge but limited reasoning. For example it will tell you all about the life and death of Ho Chi Minh (and as far as I can verify factual and with more detail than what's in English Wikipedia), but when quizzed whether 2kg of feathers are heavier than 1kg of lead it will get it wrong.

Though I wouldn't treat it as a domain expert on anything. For example when I asked about the safety advantages of Rust over Python it oversold Rust a bit and claimed Python had issues it doesn't actually have

foxhop1y ago

My guess is it uses the same vocabulary size as llama 3.1 which is 128,000 different tokens (words) to support many languages. Parameter count is less of an indicator of fitness than previously thought.

create-username1y ago

Can it speak foreign languages like German, Spanish, Ancient Greek?

wongarsu1y ago

Yes. It can converse perfectly normal in German. However when quizzed about German idioms it hallucinates them (in fluent German). Though that's the kind of stuff even larger models often have trouble with. For example if you ask GPT 4 about jokes in German it will give you jokes that depend on word play that only works when translated to English. In normal conversation Llama seems to speak fluent German

For Ancient Greek I just asked it (in German) to translate its previous answer to Ancient Greek, and the answer looks like Greek and according to google translate is a serviceable translation. However Llama did add a cheeky "Πηγή: Google Translate" at the end (Πηγή means source). I know little about the differences between ancient and modern Greek, but it did struggle to translate modern terms like "climate change" or "Hawaii" and added them as annotations in brackets. So I'll assume it at least tried to use Ancient Greek.

However it doesn't like switching language mid-conversation. If you start a conversation in German and after a couple messages switch to English it will understand you but answer in German. Most models switch to answering in English in that situation

https://github.com/ollama/ollama/issues/5425

Dzidas1y ago

Not one of these, but I tried on a small, Lithuanian, language. The catch is what the language has complicated grammar, but not as bad as Finnish, Estonian and Hungarian. I asked to summarise some text and it does the job, but the grammar is not perfect and in some cases, at a foreigner level. Plus, it invented some words with no meaning. E.g. `„Sveika gyvensena“ turi būti *atnemitinamas* viso kurso *vykišioje*.`

stavros1y ago

In Greek, it's just making stuff up. I asked it how it was, and it asked me how much I like violence. It looks like it's really conflating languages with each other, it just asked me a weird mix of Spanish and Greek.

Yeah, chatting more, it's confusing Spanish and Greek. Half the words are Spanish, half are Greek, but the words are more or less the correct ones, if you speak both languages.

EDIT: Now it's doing Portuguese:

> Εντάξει, πού ξεκίνησα? Εγώ είναι ένα κigneurnative πρόγραμμα ονομάζεται "Chatbot" ή "Μάquina Γλωσσής", που δέχθηκε να μοιράσει τη βραδύτητα με σένα. Φυσικά, não sono um essere humano, así que não tengo sentimentos ou emoções como vocês.

kingkongjaffa1y ago

llama3.2:3b-instruct-q8_0 is performing better than 3.1 8b-q4 on my macbookpro M1. It's faster and the results are better. It answered a few riddles and thought experiments better despite being 3b vs 8b.

I just removed my install of 3.1-8b.

my ollama list is currently:

$ ollama list

NAME ID SIZE MODIFIED

llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago

gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago

phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago

mxbai-embed-large:latest 468836162de7 669 MB 3 months ago

PhilippGille1y ago

Aren't the _0 quantizations considered deprecated and _K_S or _K_M preferable?

https://github.com/KillianLucas/open-interpreter/

For _K_S definitely not. We quantized 3b with q4_K_M since we were getting good results out of it. Officially Meta has only talked about quantization for 405b and hasn't given any actual guidance for what the "best" quantization should be for the smaller models. With The 1b model we didn't see good results with any of the 4b quantizations and went with q8_0 as the default.

taneq1y ago

For a second I read that as “it just removed my install of 3.1-8b” :D

fragmede1y ago

aryehof1y ago

On what basis do you use these different models?

kingkongjaffa1y ago

mxbai is for embeddings for RAG.

The others are for text generation / instruction following, for various writing tasks.

kgeist1y ago

Tried the 1B model with the "think step by step" prompt.

It gets "which is larger: 9.11 or 9.9?" right if it manages to mention that decimals need to be compared first in its step-by-step thinking. If it skips mentioning decimals, then it says 9.11 is larger.

It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.

altruios1y ago

My understanding is the way the tokenization works prevents the LMM from being able to count occurrences of words or individual characters.

khafra1y ago

Of course, in many contexts, it is correct to put 9.11 after 9.9--software versioning does it that way, for example.

KeplerBoy1y ago

That's why it's an interesting question and why it struggles so hard.

A good answer would explain that and state both results if the context is not hundred percent clear.

vergessenmir1y ago

What is the "think step by step" prompt? An example would be great, Is this part of the system prompt?

potatoman221y ago

It's appending "think step-by-step" to the end of the prompt to elicit a chain-of-thought response. See: https://arxiv.org/abs/2205.11916

bick_nyers1y ago

Does anyone know of a CoT dataset somewhere for finetuning? I would think exposing it to that type of modality during a finetune/lora would help.

JohnHammersley1y ago

Ollama post: https://ollama.com/blog/llama3.2

getcrunk1y ago

Still no 14/30b parameter models since llama 2. Seriously killing real usability for power users/diy.

The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.

The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control

luke-stanley1y ago

The Llama 3.2 11B multimodal model is a bit less than 14B but smaller models can do more these days, and Meta are not the only ones making models. The 70B model has been pruned down by NVIDIA if I recall correctly. The 405B model also will be shrunk down and can presumably be used to strengthen smaller models. I'm not convinced by your shiny hat.

swader9991y ago

You don't need an F-15 to play at least, a decent sniper rifle will do. You can still practise even with a pellet gun. I'm running 70b models on my M2 max with 96 ram. Even larger models sort of work, although I haven't really put much time into anything above 70b.

int_19h1y ago

With a 128Gb Mac, you can even run 405b at 1-bit quantization - it's large enough that even with the considerable quality drop that entails, it still appears to be smarter than 70b.

foxhop1y ago

4090 has 24G

So we really need ~40B or G model (two cards) or like a ~20B with some room for context window.

5090 has ??G - still unreleased

regularfry1y ago

Qwen2.5 has a 32B release, and quantised at q5_k_m it *just about" completely fills a 4090.

It's a good model, too.

https://artificialanalysis.ai/leaderboards/models

arnaudsm1y ago

Is there an up-to-date leaderboard with multiple LLM benchmarks?

Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.

If it doesn't exist I'm willing to create it

threatripper1y ago

"LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models

Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs."

gdiamos1y ago

Llama 3.2 includes a 1B parameter model. This should be 8x higher throughput for data pipelines. In our experience, smaller models are just fine for simple tasks like reading paragraphs from PDF documents.

kombine1y ago

Are these models suitable for Code assistance - as an alternative to Cursor or Copilot?

bboygravity1y ago

I use Continue on VScode, works well with Ollama and llama3.1 (but obviously not as good as Claude).

Ey7NFZ3P0nzAe1y ago

Interesting that its scores are somewhat helow Pixtral 12B https://mistral.ai/news/pixtral-12b/

gunalx1y ago

3b was pretty good at multimodal (Norwegian) still a lot of gibberish at times, and way more sensitive than 8b but more usable than Gemma 2 2b at multi modal, fine at my python list sorter with args standard question. But 90b vision just refuses all my actually useful tasks like helping recreate the images in html or do anything useful with the image data other than describing it. Have not gotten as stuck with 70b or openai before. Insane amount of refusals all the time.

resters1y ago

This is great! Does anyone know if the llama models are trained to do function calling like openAI models are? And/or are there any function calling training datasets?

refulgentis1y ago

Yes (rationale: 3.1 was, would be strange to rollback.)

In general, you'll do a ton of damage by constraining token generation to valid JSON - I've seen models as small as 800M handle JSON with that. It's ~impossible to train constraining into it with remotely the same reliability -- you have to erase a ton of conversational training that makes it say ex. "Sure! Here's the JSON you requested:"

noahbp1y ago

What kind of damage is done by constraining token generation to valid JSON?

Closi1y ago

What about OpenAI Structured Outputs? This seems to do exactly this.

https://news.ycombinator.com/item?id=41651126

TmpstsTrrctta1y ago

They mention tool calling in the link for the smaller models, and compare to 8B levels of function calling in benchmarks here:

ushakov1y ago

yes, but only the text-only models!

https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

zackangelo1y ago

This is incorrect:

> With text-only inputs, the Llama 3.2 Vision Models can do tool-calling exactly like their Llama 3.1 Text Model counterparts. You can use either the system or user prompts to provide the function definitions.

> Currently the vision models don’t support tool-calling with text+image inputs.

They support it, but not when an image is submitted in the prompt. I'd be curious to see what the model does. Meta typically sets conservative expectations around this type of behavior (e.g., they say that the 3.1 8b model won't do multiple tool calls, but in my experience it does so just fine).

https://arxiv.org/abs/2408.05334

winddude1y ago

the vision models can also do tool calling according to the docs, but with text-only inputs, maybe that's what you meant ~ <https://www.llama.com/docs/model-cards-and-prompt-formats/ll...>

l5870uoo9y1y ago

> These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

Do they require GPU or can they be deployed on VPS with dedicated CPU?

KeplerBoy1y ago

Doesn't require a GPU, it will just be faster with a GPU.

chriskanan1y ago

The assessments of visual capability really need to be more robust. They are still using datasets like VQAv2, which while providing some insight, have many issues. There are many newer datasets that serve as much more robust tests and that are less prone to being affected by linguistic bias.

I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:

https://arxiv.org/abs/2408.03326

I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.

Jackson__1y ago

Looking at their benchmark results and my own experience with their 11B vision model, I think while not perfect they represent the model well.

Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).

sgt1y ago

Anyone on HN running models on their own local machines, like smaller Llama models or such? Or something else?

grahamj1y ago

Doesn’t everyone? X) it’s super easy now with ollama + openwebui or an all in 1 like mlstudio

sgt1y ago

Was just concerned I don't have enough RAM. I have 16GB (M2 Pro). Got amazing mem bandwidth though (800GB/s)

karpatic1y ago

For sure dude! Top comment thread is all about using ollama and other ways to get that done.

404mm1y ago

Can anyone recommend a webUI client for ollama?

Ey7NFZ3P0nzAe1y ago

Open webui has promising aspects, the same authors are pushing for "pipelines" which are a standard for how inputs and outputs are modified on the fly for different purposes.

iKlsR1y ago

openwebui

404mm1y ago

Nice one. Thank you .. it looks like ChatGPT (not that there’s anything wrong with that)

fungi1y ago

ive been using https://github.com/valiantlynx/ollama-docker which comes with https://github.com/open-webui/open-webui

papascrubs1y ago

https://get.big-agi.com/

xrd1y ago

I'm currently fighting with a fastapi python app deployed to render. It's interesting because I'm struggling to see how I encode the image and send it using curl. Their example sends directly from the browser and uses a data uri.

But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?

It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.

Can you input images without something in front of it like openwebui?

josephernest1y ago

Can it run with llama-cpp-python? If so, where can we find and download the gguf files? Are they distributed directly by meta, or are they converted to gguf format by third parties?

thimabi1y ago

Does anyone know how these models fare in terms of multilingual real-world usage? I’ve used previous iterations of llama models and they all seemed to be lacking in that regard.

aussieguy12341y ago

When using meta.ai, its able to generate images as well as understand them. Has this also been open sourced or just a GPT4o style ability to see images?

desireco421y ago

I have to say that running this model locally I was pleasantly suprised how well it ran, it doesn't use as much resources and produce decent output, comparable to ChatGPT, it is not quite as OpenAI but for a lot of tasks, since it doesn't burden the computer, it can be used with local model.

Next I want to try to use Aider with it and see how this would work.

GaggiX1y ago

The 90B seem to perform pretty weak on visual tasks compare to Qwen2-VL-72B: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct, or am I missing something?

notpublic1y ago

Llama-3.2-11B-Vision-Instruct does an excellent job extracting/answering questions from screenshots. It is even able to answer questions based on information buried inside a flowchart. How is this even possible??

Ey7NFZ3P0nzAe1y ago

Because they trained the text model. Then froze the weights. Then trained a vision model on text image pairs of progressively higher quality. Then trained an adapter to align their latent spaces. So it became smart on text then gain a new input sense magically without changing its weights

ComputerGuru1y ago

Is this - at a reasonable guess - what most believe OpenAI did with 4o?

vintermann1y ago

Oh, this is promising. It's not surprising to me: image models have been very oriented towards photography and scene understanding rather than understanding symbolic information in images (like text or diagrams), but I always thought that it should be possible to make the model better at the latter, for instance by training it more on historical handwritten documents.

faangguyindia1y ago

How good it is at comic reading?

bboygravity1y ago

magic

sk110011y ago

Can one of thse models be run on a single machine? What specs do you need?

Y_Y1y ago

Absolutely! They have a billion-parameter model that will run on my first computer if we quantize it to 1.5 bits. But realistically yes, if you can fit in total ram you can run it slowly, if you can fit it in gpu ram you can probably run it fast enough to chat.

sumedh1y ago

The 8B models run fine on a M1 pro 16GB.

bombi1y ago

Is Termux enough to run the 1B model on Android?

brrrrrm1y ago

depends on your phone, but try a couple of these variants with ollama https://ollama.com/library/llama3.2/tags

e.g. `ollama run llama3.2:1b-instruct-q4_0`

oulipo1y ago

Can the 3B run on a M1 macbook? It seems that it hogs all the memory. The 1B runs fine

Tepix1y ago

It all depends on the amount of RAM. Since you're not mentioning it, you probably have only the default 8GB? Still, the 3B should run even on an 8GB machine.

oulipo1y ago

I tried and it was so slow the system became unuseable

bottlepalm1y ago

What mobile devices can the smaller models run on? iPhone, Android?

jillion1y ago

apparently so, but im trying to find a working example / some details on what specific iOS / android devices are capable of running this

troupo1y ago

It's hallucinating so badly, it's kinda hilarious

Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.

--- start quote ---

>>> what is the last version of Ash Framework for Elixir you know My training data goes up to December 2023, and according to the Ash Framework releases, the last stable version that I'm familiar with is:

* *Ash 2.14*: Released on April 8, 2021.

However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.

In fact, Ash was one of the first frameworks to be included in Elixir, and it played a significant role in shaping the language. Many of Ash's core concepts, such as `defstruct/1` and `defmodule/1`, were later adopted by Elixir itself.

...

Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.

--- end quote ---

borromakot1y ago

Wildly incorrect

gdiamos1y ago

Do inference frameworks like vllm support vision?

woodson1y ago

Yes, vLLM does (though marked experimental): https://docs.vllm.ai/en/latest/models/vlm.html

0. https://www.arxiv.org/abs/2408.11039

You can run with LitServe. here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

stogot1y ago

Surprised no mention of audio?

edude031y ago

was surprised by this as well

ofermend1y ago

Great release. Models just added to Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard.

TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate

dharma11y ago

are these better than qwen at codegen?

taytus1y ago

meta.ai still running on 3.1

84adam1y ago

excited for this

sva_1y ago

Curious about the multimodal model's architecture. But alas, when I try to request access

> Llama 3.2 Multimodal is not available in your region.

It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?

Edit:

> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]

What a bummer.

1. https://huggingface.co/blog/llama32#llama-32-license-changes...

ankit2191y ago

If you are still curious about the architecture, from the blog:

> To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.

What this crudely means is that they extended the base Llama 3.1, to include image based weights and inference. You can do that if you freeze the existing weights. add new ones which are then updated during training runs (adapter training). Then they did SFT and RLHF runs on the composite model (for lack of a better word). This is a little known technique, and very effective. I just had a paper accepted about a similar technique, will share a blog once that is published if you are interested (though it's not on this scale, and probably not as effective). Side note: That is also why you see param size of 11B and 90B as addition from the text only models.

sva_1y ago

Thanks for the info, I now also found the model card. So it seems like they went the way of grafting models together, which I find less interesting tbh.

In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.

There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.

Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.

Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.

Anyways, exciting stuff either ways.

Y_Y1y ago

I hereby grant license to anyone in the EU to do whatever they want with this.

moffkalast1y ago

Well you said hereby so it must be law.

https://github.com/meta-llama/llama-models/blob/main/models/...

lawlessone1y ago

Cheers :)

btdmaster1y ago

Full text:

> With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.

_ink_1y ago

Oh. That's sad indeed. What might be the reason for excluding Europe?

Arubis1y ago

Glibly, Europe has the gall to even consider writing regulations without asking the regulated parties for permission.

weberer1y ago

According to the open letter they linked, it looks to be regarding some regulation about the training data used.

https://euneedsai.com/

paxys1y ago

Punishment. "Your government passes laws we don't like, so we aren't going to let you have our latest toys".

GaggiX1y ago

Fortunately, Qwen-2-VL exists, it is pretty good and under an actual open source license, Apache 2.0.

Edit: the larger 72B model is not under Apache 2.0 but https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...

Qwen2-VL-72B seems to perform better than llama-3.2-90B on visual tasks.

mrfinn1y ago

Pity, it's over. We'll never ever be able to download those ten gigabytes files, at the other side of the fence.

minimaxir1y ago

Off topic/meta, but the Llama 3.2 news topic received many, many HN submissions and upvotes but never made it to the front page: the fact that it's on the front page now indicates that moderators intervened to rescue it: https://news.ycombinator.com/from?site=meta.com (showdead on)

If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.

makin1y ago

The main issue was that Meta quickly took down the first announcement, and the only remaining working submission was the information-sparse HuggingFace link. By the time the other links were back up, it was too late. Perfect opportunity for a rescue.

senko1y ago

Yeah I submitted what turned out to be a dupe but I could never find the original, probably was buried at the time. Then a few hours later it miraculously (re?)appeared.

AIUI exact dupes just get counted as upvotes, which hasn’t happened in my case.

nmwnmwOP1y ago

- Llama 3.2 introduces small vision LLMs (11B and 90B parameters) and lightweight text-only models (1B and 3B) for edge/mobile devices, with the smaller models supporting 128K token context.

- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.

- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.

- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.

monkfish3281y ago

Zuckerberg has never liked having Android/iOs as gatekeepers i.e. "platforms" for his apps.

That's the strategy. No values here - just strategy folks.

jsemrau1y ago

Agents are the new Apps

acedTrex1y ago

I mean, just because he is not doing this as a perfectly altruistic gesture does not mean the broader ecosystem does not benefit from him doing it

monkfish3281y ago

For sure

I still can't access the hosted model at meta.ai from Puerto Rico, despite us being U.S. citizens. I don't know what Meta has against us.

Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.

[0] https://imgur.com/i9Ps1v6

daemonologist1y ago

Both Llama 3.2 90B and Claude 3.5 Sonnet can find "turkey" and "spoon", probably because they're left-to-right. Llama gave approximate locations for each and Claude gave precise but slightly incorrect locations. Further prompting to look for diagonal and right-to-left words returned plausible but incorrect responses, slightly more plausible from Claude than Llama. (In this test I cropped the word search to just the letter grid, and asked the model to find any English words related to soup.)

Anyways, I think there just isn't a lot of non-right-to-left English in the training data. A word search is pretty different from the usual completion, chat, and QA tasks these models are oriented towards; you might be able to get somewhere with fine-tuning though.

gunalx1y ago

Try and find where the words are in this word puzzle undefined

''' There are two words in this word puzzle: "soup" and "mix". The word "soup" is located in the top row, and the word "mix" is located in the bottom row. ''' Edit: Tried a bit more probing like asking it to find spoon or any other word. It just makes up a row and column.

paxys1y ago

Non US citizens can access the model just fine, if that's what you are implying.

I'm not implying anything. It's just frustrating that despite being a US territory with US citizens, PR isn't allowed to use this service without any explanation.

Workaccount21y ago

This is likely because the models use OCR on images with text, and once parsed the word search doesn't make sense anymore.

Would be interesting to see a model just working on raw input though.

simonw1y ago

Image models such as Llama 3.2 11B and 90B (and the Claude 3 series, and Microsoft Phi-3.5-vision-instruct, and PaliGemma, and GPT-4o) don't run OCR as a separate step. Everything they do is from that raw vision model.

alexcpn1y ago

In KungfuPanda there is this line that the Panda says "I love KungFuuuuuuuu", well I normally don't tell like this, but when I saw this and (starting to use this), I feel like yelling"I like Metaaaaa or is it LLAMMMAAA or is it Open source.. or is it this cool ecosystem which gives such value for free...

404mm1y ago

Newbie question, what size model would be needed to have a 10x software engineer skills and no knowledge of the human kind (ie, no need to know how to make a pizza or sequence your DNA). Is there such a model?

keyle1y ago

No, not yet. And such LLM wouldn't speak back in English or French without some "knowledge of the human kind" as you put it.

pants21y ago

Most code is grounded in real-world concepts somehow. Imagine an engineer at Domino's asking it to write an ordering app. Now your model needs to know what goes in to a pizza.

_lvbh1y ago

10x relative to what? I’ve seen bad developers use AI to 10x their productivity but they still couldn’t come anywhere close to a good developer without AI (granted, this was at a hackathon on pretty advanced optimization research. Maybe there’s more impact on lower skilled tasks)

acedTrex1y ago

A bad dev using AI is now 10 times more productive at writing bad code

palisade1y ago

Not yet. But, Nvidia's CEO announced a few months ago that we're about 5 years away. And, OpenAI just this week announced Super Intelligence is up to 2000 days (e.g. around 5 years) away.

latentsea1y ago

So long as you don't mind glue in your pizza...

faangguyindia1y ago

Try codegemma.

Or Gemini Flash for code completion and generation.

j / k navigate · click thread line to collapse

328 comments

simonw1y ago

I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).

More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/

GaggiX1y ago

Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.

f38zf5vdt1y ago

[1] https://molmo.allenai.org/blog

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B

dannyobrien1y ago

https://github.com/google/generative-ai-docs/issues/257

faangguyindia1y ago

If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.

Gemini Flash is fast with upto 4 million token context.

Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o

You can simply use Gemini Flash for Code Completion, git review tool and many more.

a21281y ago

8 more replies

nycdatasci1y ago

airspresso1y ago

Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.

TL;DR - Google free of cost models are irrelevant when talking about local models.

Deathmax1y ago

hobofan1y ago

jackbravo1y ago

I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?

dannyobrien1y ago

https://llm.datasette.io/en/stable/

jerieljan1y ago

It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.

And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.

I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.

flakiness1y ago

There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...

It's worth listening to learn abouut the context on how that tool is used.

magicalhippo1y ago

forgingahead1y ago

simonw1y ago

I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.

Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.

xyc1y ago

You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...

sumedh1y ago

You can try Gemini Token count. https://ai.google.dev/api/tokens

lowyek1y ago

Hi simon, is there a way to run the vision model easily on my mac locally?

simonw1y ago

Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.

You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

foxhop1y ago

The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.

littlestymaar1y ago

GP is talking about context windows, not the number of token used by the tokenizer.

TZubiri1y ago

This obsession with using AI to help with programming is short sighted.

We discover gold and you think of gold pickaxes.

Carrok1y ago

If we make this an analogy to video games, gold pickaxes can usually mine more gold much faster.

What could be short sighted about using tools to improve your daily work?

opdahl1y ago

monkfish3281y ago

Zuckerberg has never liked having Android/iOs as gatekeepers i.e. "platforms" for his apps.

That's the strategy. No values here - just strategy folks.

itchyjunk1y ago

grahamj1y ago

Yep - give away OAI etc.’s product so the they never get big enough to control whatsinstabook. If you can’t use it to build a moat then don’t let anyone else do it either.

The thing about giant companies is they never want there to be more giant companies.

HDThoreaun1y ago

You can recognize this and still be grateful that Mark's incentives align with my own in a way that has made llama free and open sourceish

cedws1y ago

Zuckerberg probably realises the value of currying favour with engineers. Also, I think he has a personal vendetta to compete with Musk in this space.

pjfin1231y ago

Meta has been good about releasing their NLO work open source for a long time. Most of the open source datasets for foreign language translation were created by Facebook.

thefourthchime1y ago

They have a hose of ad money and have nothing to lose doing this.

You can’t say that for the other guys.

talldayo1y ago

I can absolutely say that about Google and Apple.

yunwal1y ago

> They have a hose of ad money and have nothing to lose doing this.

If I didn’t have context I’d assume this was about Google.

fennecfoxy1y ago

But still, Kudos to Zuck/Meta for doing it anyway.

isoprophlex1y ago

They're out to fuck over the competition by killing their moat. Classic commoditize your complement.

seydor1y ago

asterix_pano1y ago

snek_case1y ago

I don't think it's necessarily the small competitors that they are worried about, but they could be trying to prevent OpenAI from becoming too powerful and competing with them.

imjonse1y ago

Training data is crucial for performance and they do not (cannot) share that.

nickpsecurity1y ago

Do they tell you what training data they use for alignment? As in, what biases they intentionally put in the system they’re widely deploying?

warkdarrior1y ago

Do you have some concrete example of biases in their models? Or are you just fishing for something to complain about?

dkga1y ago

Fully second that.

nextworddev1y ago

They are literally training on all the free personal data you provided, so they owe you this much

kristopolous1y ago

Given what I see in Facebook comments I'm surprised the AI doesn't just respond with "Amen. Happy Birthday" to every query.

They're clearly majorly scrubbing things somehow

a_wild_dandan1y ago

"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)

With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!

adtac1y ago

>WRONG! ENJOY MORE PENALTY, SCRUB!

I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.

snovv_crash1y ago

You are in violent agreement with GP.

croes1y ago

Isn't jumping over a fence more likely than jumping over a wall?

refulgentis1y ago

grahamj1y ago

I would have went with “moon”

illwrks1y ago

Moat

whimsicalism1y ago

yeah i mean that is exactly why distillation works. if you just were one hotting it would be the same as training on same dataset

alanzhuly1y ago

Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).

For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:

nexa run llama3.2 --streamlit

Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.

alfredgg1y ago

It's a great tool. Thanks!

alanzhuly1y ago

mikestaub1y ago

or https://chat.webllm.ai/

grahamj1y ago

or grab lmstudio

Zuiii1y ago

For people who really care about open source, this is not.

freedomben1y ago

[1]: https://ollama.com/blog/llama3.2

We're working on it. There are already draft PRs up in the GH repo. We're still working out some kinks though.

xena1y ago

As a rule of thumb with AI stuff: it either works instantly, or wait a day or two.

refulgentis1y ago

ollama is "just" llama.cpp underneath, I recommend switching to LM Studio or Jan, they don't have this issue of proprietary wrapper that obfuscates, you can just use any ol GGUF

lolinder1y ago

What proprietary wrapper? Isn't Ollama entirely open source?

calgoo1y ago

I use gguf in ollama on a daily basis, so not sure what the issue is? Just wrap it in a modelfile and done!

moffkalast1y ago

I've just tested the 1B and 3B at Q8, some interesting bits:

- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context

- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly

- The 3B is a certified Gemma-2-2B killer

Hoping to get this out soon w/ Ollama. Just working out a couple of last kinks. The 11b model is legit good though, particularly for tasks like OCR. It can actually read my cursive handwriting.

jsarv1y ago

Naah, Qwen2-VL-7b still is much much better than 11b model for handwritten OCR from what i have tested. The 11b model hallucinates in case of handwritten OCR.

dhbradshaw1y ago

Tried out 3B on ollama, asking questions in optics, bio, and rust.

It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.

tomComb1y ago

I question whether a 3B model can have “a lot of knowledge”.

ac291y ago

As a point of comparison, the Llama 3.2 3B model is 6.5GB. The entirety of English wikipedia text is 19GB (as compressed with an algorithm from 1996, newer compression formats might do better).

Its not a perfect comparison and Llama does a lot more than English, but I would say 6.5GB of data can certainly contain a lot of knowledge.

wongarsu1y ago

foxhop1y ago

create-username1y ago

Can it speak foreign languages like German, Spanish, Ancient Greek?

wongarsu1y ago

https://github.com/ollama/ollama/issues/5425

Dzidas1y ago

stavros1y ago

Yeah, chatting more, it's confusing Spanish and Greek. Half the words are Spanish, half are Greek, but the words are more or less the correct ones, if you speak both languages.

EDIT: Now it's doing Portuguese:

kingkongjaffa1y ago

I just removed my install of 3.1-8b.

my ollama list is currently:

$ ollama list

NAME ID SIZE MODIFIED

llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago

gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago

phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago

mxbai-embed-large:latest 468836162de7 669 MB 3 months ago

PhilippGille1y ago

Aren't the _0 quantizations considered deprecated and _K_S or _K_M preferable?

https://github.com/KillianLucas/open-interpreter/

taneq1y ago

For a second I read that as “it just removed my install of 3.1-8b” :D

fragmede1y ago

aryehof1y ago

On what basis do you use these different models?

kingkongjaffa1y ago

mxbai is for embeddings for RAG.

The others are for text generation / instruction following, for various writing tasks.

kgeist1y ago

Tried the 1B model with the "think step by step" prompt.

It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.

altruios1y ago

My understanding is the way the tokenization works prevents the LMM from being able to count occurrences of words or individual characters.

khafra1y ago

Of course, in many contexts, it is correct to put 9.11 after 9.9--software versioning does it that way, for example.

KeplerBoy1y ago

That's why it's an interesting question and why it struggles so hard.

A good answer would explain that and state both results if the context is not hundred percent clear.

vergessenmir1y ago

What is the "think step by step" prompt? An example would be great, Is this part of the system prompt?

potatoman221y ago

It's appending "think step-by-step" to the end of the prompt to elicit a chain-of-thought response. See: https://arxiv.org/abs/2205.11916

bick_nyers1y ago

Does anyone know of a CoT dataset somewhere for finetuning? I would think exposing it to that type of modality during a finetune/lora would help.

JohnHammersley1y ago

Ollama post: https://ollama.com/blog/llama3.2

getcrunk1y ago

Still no 14/30b parameter models since llama 2. Seriously killing real usability for power users/diy.

The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.

The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control

luke-stanley1y ago

swader9991y ago

int_19h1y ago

With a 128Gb Mac, you can even run 405b at 1-bit quantization - it's large enough that even with the considerable quality drop that entails, it still appears to be smarter than 70b.

foxhop1y ago

4090 has 24G

So we really need ~40B or G model (two cards) or like a ~20B with some room for context window.

5090 has ??G - still unreleased

regularfry1y ago

Qwen2.5 has a 32B release, and quantised at q5_k_m it *just about" completely fills a 4090.

It's a good model, too.

https://artificialanalysis.ai/leaderboards/models

arnaudsm1y ago

Is there an up-to-date leaderboard with multiple LLM benchmarks?

Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.

If it doesn't exist I'm willing to create it

threatripper1y ago

"LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models

gdiamos1y ago

kombine1y ago

Are these models suitable for Code assistance - as an alternative to Cursor or Copilot?

bboygravity1y ago

I use Continue on VScode, works well with Ollama and llama3.1 (but obviously not as good as Claude).

Ey7NFZ3P0nzAe1y ago

Interesting that its scores are somewhat helow Pixtral 12B https://mistral.ai/news/pixtral-12b/

gunalx1y ago

resters1y ago

This is great! Does anyone know if the llama models are trained to do function calling like openAI models are? And/or are there any function calling training datasets?

refulgentis1y ago

Yes (rationale: 3.1 was, would be strange to rollback.)

noahbp1y ago

What kind of damage is done by constraining token generation to valid JSON?

Closi1y ago

What about OpenAI Structured Outputs? This seems to do exactly this.

https://news.ycombinator.com/item?id=41651126

TmpstsTrrctta1y ago

They mention tool calling in the link for the smaller models, and compare to 8B levels of function calling in benchmarks here:

ushakov1y ago

yes, but only the text-only models!

https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

zackangelo1y ago

This is incorrect:

> Currently the vision models don’t support tool-calling with text+image inputs.

https://arxiv.org/abs/2408.05334

winddude1y ago

the vision models can also do tool calling according to the docs, but with text-only inputs, maybe that's what you meant ~ <https://www.llama.com/docs/model-cards-and-prompt-formats/ll...>

l5870uoo9y1y ago

> These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

Do they require GPU or can they be deployed on VPS with dedicated CPU?

KeplerBoy1y ago

Doesn't require a GPU, it will just be faster with a GPU.

chriskanan1y ago

I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:

https://arxiv.org/abs/2408.03326

I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.

Jackson__1y ago

Looking at their benchmark results and my own experience with their 11B vision model, I think while not perfect they represent the model well.

Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).

sgt1y ago

Anyone on HN running models on their own local machines, like smaller Llama models or such? Or something else?

grahamj1y ago

Doesn’t everyone? X) it’s super easy now with ollama + openwebui or an all in 1 like mlstudio

sgt1y ago

Was just concerned I don't have enough RAM. I have 16GB (M2 Pro). Got amazing mem bandwidth though (800GB/s)

karpatic1y ago

For sure dude! Top comment thread is all about using ollama and other ways to get that done.

404mm1y ago

Can anyone recommend a webUI client for ollama?

Ey7NFZ3P0nzAe1y ago

Open webui has promising aspects, the same authors are pushing for "pipelines" which are a standard for how inputs and outputs are modified on the fly for different purposes.

iKlsR1y ago

openwebui

404mm1y ago

Nice one. Thank you .. it looks like ChatGPT (not that there’s anything wrong with that)

fungi1y ago

ive been using https://github.com/valiantlynx/ollama-docker which comes with https://github.com/open-webui/open-webui

papascrubs1y ago

https://get.big-agi.com/

xrd1y ago

But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?

It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.

Can you input images without something in front of it like openwebui?

josephernest1y ago

Can it run with llama-cpp-python? If so, where can we find and download the gguf files? Are they distributed directly by meta, or are they converted to gguf format by third parties?

thimabi1y ago

Does anyone know how these models fare in terms of multilingual real-world usage? I’ve used previous iterations of llama models and they all seemed to be lacking in that regard.

aussieguy12341y ago

When using meta.ai, its able to generate images as well as understand them. Has this also been open sourced or just a GPT4o style ability to see images?

desireco421y ago

Next I want to try to use Aider with it and see how this would work.

GaggiX1y ago

The 90B seem to perform pretty weak on visual tasks compare to Qwen2-VL-72B: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct, or am I missing something?

notpublic1y ago

Ey7NFZ3P0nzAe1y ago

ComputerGuru1y ago

Is this - at a reasonable guess - what most believe OpenAI did with 4o?

vintermann1y ago

faangguyindia1y ago

How good it is at comic reading?

bboygravity1y ago

magic

sk110011y ago

Can one of thse models be run on a single machine? What specs do you need?

Y_Y1y ago

sumedh1y ago

The 8B models run fine on a M1 pro 16GB.

bombi1y ago

Is Termux enough to run the 1B model on Android?

brrrrrm1y ago

depends on your phone, but try a couple of these variants with ollama https://ollama.com/library/llama3.2/tags

e.g. `ollama run llama3.2:1b-instruct-q4_0`

oulipo1y ago

Can the 3B run on a M1 macbook? It seems that it hogs all the memory. The 1B runs fine

Tepix1y ago

It all depends on the amount of RAM. Since you're not mentioning it, you probably have only the default 8GB? Still, the 3B should run even on an 8GB machine.

oulipo1y ago

I tried and it was so slow the system became unuseable

bottlepalm1y ago

What mobile devices can the smaller models run on? iPhone, Android?

jillion1y ago

apparently so, but im trying to find a working example / some details on what specific iOS / android devices are capable of running this

troupo1y ago

It's hallucinating so badly, it's kinda hilarious

Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.

--- start quote ---

* *Ash 2.14*: Released on April 8, 2021.

However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.

...

Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.

--- end quote ---

borromakot1y ago

Wildly incorrect

gdiamos1y ago

Do inference frameworks like vllm support vision?

woodson1y ago

Yes, vLLM does (though marked experimental): https://docs.vllm.ai/en/latest/models/vlm.html

0. https://www.arxiv.org/abs/2408.11039

You can run with LitServe. here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

stogot1y ago

Surprised no mention of audio?

edude031y ago

was surprised by this as well

ofermend1y ago

Great release. Models just added to Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard.

TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate

dharma11y ago

are these better than qwen at codegen?

taytus1y ago

meta.ai still running on 3.1

84adam1y ago

excited for this

sva_1y ago

Curious about the multimodal model's architecture. But alas, when I try to request access

> Llama 3.2 Multimodal is not available in your region.

It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?

Edit:

What a bummer.

1. https://huggingface.co/blog/llama32#llama-32-license-changes...

ankit2191y ago

If you are still curious about the architecture, from the blog:

sva_1y ago

Thanks for the info, I now also found the model card. So it seems like they went the way of grafting models together, which I find less interesting tbh.

Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.

Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.

Anyways, exciting stuff either ways.

Y_Y1y ago

I hereby grant license to anyone in the EU to do whatever they want with this.

moffkalast1y ago

Well you said hereby so it must be law.

https://github.com/meta-llama/llama-models/blob/main/models/...

lawlessone1y ago

Cheers :)

btdmaster1y ago

Full text:

_ink_1y ago

Oh. That's sad indeed. What might be the reason for excluding Europe?

Arubis1y ago

Glibly, Europe has the gall to even consider writing regulations without asking the regulated parties for permission.

weberer1y ago

According to the open letter they linked, it looks to be regarding some regulation about the training data used.

https://euneedsai.com/

paxys1y ago

Punishment. "Your government passes laws we don't like, so we aren't going to let you have our latest toys".

GaggiX1y ago

Fortunately, Qwen-2-VL exists, it is pretty good and under an actual open source license, Apache 2.0.

Edit: the larger 72B model is not under Apache 2.0 but https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...

Qwen2-VL-72B seems to perform better than llama-3.2-90B on visual tasks.

mrfinn1y ago

Pity, it's over. We'll never ever be able to download those ten gigabytes files, at the other side of the fence.

minimaxir1y ago

If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.

makin1y ago

senko1y ago

Yeah I submitted what turned out to be a dupe but I could never find the original, probably was buried at the time. Then a few hours later it miraculously (re?)appeared.

AIUI exact dupes just get counted as upvotes, which hasn’t happened in my case.

nmwnmwOP1y ago

- Llama 3.2 introduces small vision LLMs (11B and 90B parameters) and lightweight text-only models (1B and 3B) for edge/mobile devices, with the smaller models supporting 128K token context.

- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.

- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.

- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.

monkfish3281y ago

Zuckerberg has never liked having Android/iOs as gatekeepers i.e. "platforms" for his apps.

That's the strategy. No values here - just strategy folks.

jsemrau1y ago

Agents are the new Apps

acedTrex1y ago

I mean, just because he is not doing this as a perfectly altruistic gesture does not mean the broader ecosystem does not benefit from him doing it

monkfish3281y ago

For sure

I still can't access the hosted model at meta.ai from Puerto Rico, despite us being U.S. citizens. I don't know what Meta has against us.

Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.

[0] https://imgur.com/i9Ps1v6

daemonologist1y ago

gunalx1y ago

Try and find where the words are in this word puzzle undefined

paxys1y ago

Non US citizens can access the model just fine, if that's what you are implying.

I'm not implying anything. It's just frustrating that despite being a US territory with US citizens, PR isn't allowed to use this service without any explanation.

Workaccount21y ago

This is likely because the models use OCR on images with text, and once parsed the word search doesn't make sense anymore.

Would be interesting to see a model just working on raw input though.

simonw1y ago

alexcpn1y ago

404mm1y ago

keyle1y ago

No, not yet. And such LLM wouldn't speak back in English or French without some "knowledge of the human kind" as you put it.

pants21y ago

Most code is grounded in real-world concepts somehow. Imagine an engineer at Domino's asking it to write an ordering app. Now your model needs to know what goes in to a pizza.

_lvbh1y ago

acedTrex1y ago

A bad dev using AI is now 10 times more productive at writing bad code