I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...
More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/
I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.
Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.
2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.
[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B
Gemini Flash is fast with upto 4 million token context.
Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o
You can simply use Gemini Flash for Code Completion, git review tool and many more.
Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.
TL;DR - Google free of cost models are irrelevant when talking about local models.
I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.
About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331
And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.
https://llm.datasette.io/en/stable/
I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.
It's worth listening to learn abouut the context on how that tool is used.
Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?
Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.
We discover gold and you think of gold pickaxes.
What could be short sighted about using tools to improve your daily work?
He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.
That's the strategy. No values here - just strategy folks.
The thing about giant companies is they never want there to be more giant companies.
You can’t say that for the other guys.
If I didn’t have context I’d assume this was about Google.
But still, Kudos to Zuck/Meta for doing it anyway.
They're clearly majorly scrubbing things somehow
With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!
I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.
Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.
But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".
I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.
For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:
nexa run llama3.2 --streamlit
Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.
I had to test it with Llama3.1 and was really easy. At a first glance Llama3.2 didn't seem available. The command you provided did not work, raising "An error occurred while pulling the model: not enough values to unpack (expected 2, got 1)".
- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context
- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly
- The 3B is a certified Gemma-2-2B killer
Given that llama.cpp doesn't support any multimodality (they removed the old implementation), it might be a while before the 11B and 90B become runnable. Doesn't seem like they outperform Qwen-2-VL at vision benchmarks though.
It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.
Its not a perfect comparison and Llama does a lot more than English, but I would say 6.5GB of data can certainly contain a lot of knowledge.
Though I wouldn't treat it as a domain expert on anything. For example when I asked about the safety advantages of Rust over Python it oversold Rust a bit and claimed Python had issues it doesn't actually have
For Ancient Greek I just asked it (in German) to translate its previous answer to Ancient Greek, and the answer looks like Greek and according to google translate is a serviceable translation. However Llama did add a cheeky "Πηγή: Google Translate" at the end (Πηγή means source). I know little about the differences between ancient and modern Greek, but it did struggle to translate modern terms like "climate change" or "Hawaii" and added them as annotations in brackets. So I'll assume it at least tried to use Ancient Greek.
However it doesn't like switching language mid-conversation. If you start a conversation in German and after a couple messages switch to English it will understand you but answer in German. Most models switch to answering in English in that situation
Yeah, chatting more, it's confusing Spanish and Greek. Half the words are Spanish, half are Greek, but the words are more or less the correct ones, if you speak both languages.
EDIT: Now it's doing Portuguese:
> Εντάξει, πού ξεκίνησα? Εγώ είναι ένα κigneurnative πρόγραμμα ονομάζεται "Chatbot" ή "Μάquina Γλωσσής", που δέχθηκε να μοιράσει τη βραδύτητα με σένα. Φυσικά, não sono um essere humano, así que não tengo sentimentos ou emoções como vocês.
I just removed my install of 3.1-8b.
my ollama list is currently:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
mxbai-embed-large:latest 468836162de7 669 MB 3 months ago
The others are for text generation / instruction following, for various writing tasks.
It gets "which is larger: 9.11 or 9.9?" right if it manages to mention that decimals need to be compared first in its step-by-step thinking. If it skips mentioning decimals, then it says 9.11 is larger.
It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.
A good answer would explain that and state both results if the context is not hundred percent clear.
The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.
The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control
So we really need ~40B or G model (two cards) or like a ~20B with some room for context window.
5090 has ??G - still unreleased
It's a good model, too.
Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.
If it doesn't exist I'm willing to create it
"LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models
Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs."
In general, you'll do a ton of damage by constraining token generation to valid JSON - I've seen models as small as 800M handle JSON with that. It's ~impossible to train constraining into it with remotely the same reliability -- you have to erase a ton of conversational training that makes it say ex. "Sure! Here's the JSON you requested:"
https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
> With text-only inputs, the Llama 3.2 Vision Models can do tool-calling exactly like their Llama 3.1 Text Model counterparts. You can use either the system or user prompts to provide the function definitions.
> Currently the vision models don’t support tool-calling with text+image inputs.
They support it, but not when an image is submitted in the prompt. I'd be curious to see what the model does. Meta typically sets conservative expectations around this type of behavior (e.g., they say that the 3.1 8b model won't do multiple tool calls, but in my experience it does so just fine).
Do they require GPU or can they be deployed on VPS with dedicated CPU?
I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:
https://arxiv.org/abs/2408.05334
https://arxiv.org/abs/2408.03326
I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.
Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).
But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?
It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.
Can you input images without something in front of it like openwebui?
Next I want to try to use Aider with it and see how this would work.
e.g. `ollama run llama3.2:1b-instruct-q4_0`
Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.
--- start quote ---
>>> what is the last version of Ash Framework for Elixir you know My training data goes up to December 2023, and according to the Ash Framework releases, the last stable version that I'm familiar with is:
* *Ash 2.14*: Released on April 8, 2021.
However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.
In fact, Ash was one of the first frameworks to be included in Elixir, and it played a significant role in shaping the language. Many of Ash's core concepts, such as `defstruct/1` and `defmodule/1`, were later adopted by Elixir itself.
...
Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.
--- end quote ---
TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate
> Llama 3.2 Multimodal is not available in your region.
It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?
Edit:
> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]
What a bummer.
0. https://www.arxiv.org/abs/2408.11039
1. https://huggingface.co/blog/llama32#llama-32-license-changes...
> To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.
What this crudely means is that they extended the base Llama 3.1, to include image based weights and inference. You can do that if you freeze the existing weights. add new ones which are then updated during training runs (adapter training). Then they did SFT and RLHF runs on the composite model (for lack of a better word). This is a little known technique, and very effective. I just had a paper accepted about a similar technique, will share a blog once that is published if you are interested (though it's not on this scale, and probably not as effective). Side note: That is also why you see param size of 11B and 90B as addition from the text only models.
In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.
There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.
Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.
Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.
Anyways, exciting stuff either ways.
https://github.com/meta-llama/llama-models/blob/main/models/...
https://github.com/meta-llama/llama-models/blob/main/models/...
> With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.
Edit: the larger 72B model is not under Apache 2.0 but https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...
Qwen2-VL-72B seems to perform better than llama-3.2-90B on visual tasks.
If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.
AIUI exact dupes just get counted as upvotes, which hasn’t happened in my case.
- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.
- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.
- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.
He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.
That's the strategy. No values here - just strategy folks.
Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.
Anyways, I think there just isn't a lot of non-right-to-left English in the training data. A word search is pretty different from the usual completion, chat, and QA tasks these models are oriented towards; you might be able to get somewhere with fine-tuning though.
''' There are two words in this word puzzle: "soup" and "mix". The word "soup" is located in the top row, and the word "mix" is located in the bottom row. ''' Edit: Tried a bit more probing like asking it to find spoon or any other word. It just makes up a row and column.
Would be interesting to see a model just working on raw input though.
Or Gemini Flash for code completion and generation.