> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.
I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move
It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.
What's more irritating is that they decided to do quantization aware training for fp8. int8 quantization results in an imperceptible loss of quality that is difficult to pick up in benchmarks. They should have gone for something more aggressive like 4-bit, where quantization leads to a significant loss in quality.
any example use-cases or prompts? how do you define those?
When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"
This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.
A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.
Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.
One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.
Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.
What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.
> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.
> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.
> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.
If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.
That's without the context window, so depending on how much context you want to use you'll need some more GB.
That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)
This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.
You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.
Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.
Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.
Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.
There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.
Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.
Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.
Performance improved across all benchmarks; in English (the original language).
https://openreview.net/forum?id=KIPJKST4gw
Is symbolic language a fuzzy sort of code? Absolutely, because it conveys logic and information. TLDR: yes!
IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English
I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.
I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.
What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.
Why let Amazon/Cloudflare repackage it?
EDIT: This a 1W light bulb moment for me, thank you!
Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.
Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D
I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air
If you're on a Mac, check out LM Studio.
It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.
and expose an OpenAI compatible server, or you can use their python bindings
If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.
If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open
Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.
It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.
Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.
I don't know how much that is true for legal or financial resources.
Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.
Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.
[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page
An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.
Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.
Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.
Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.
> programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.
This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.
I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.
Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.
The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.
For finance and legal you need to:
- think a bit outside the box
- be interested in finance and legal
- be prepared to carry actual legal liability for the output of your models
> We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.
They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.
Section 230.
It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).
Nobody has successfully sued.
edit: e.g. I wouldn't know the correct parameters for this calculator, but going from 8k window to 128k window goes from 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...
This is only realistic right now for people with those unified memory MacBook or for enthusiasts with Epyc servers or a very high end workstation built for inference.
Anything above that I don't consider "consumer" inference
It seems to me that while the game is very challenging for people it’s not necessarily an indicator of generalization. I can see how it’s useful - but I have trouble seeing how a low score on it would indicate low performance on most tasks.
Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.
edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.
The same thing happened with gemma-27b, where they compared it to all the 7-9b models.
It seems like an easy way to boost benchmarks while coming off as "small" at first glance.
open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens
"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one
- 3B for CPU inference or running on edge devices.
- 20-30B for maximizing single consumer GPU potential.
- 70B+ for those who can afford it.
7-9B never felt like an ideal size.
From Mistral's page about Tekken:
> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.
Does that mean that Mistral found that BPE is more efficient than unigram models?
Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.
Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.
It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.
1) Anyone have any idea of VRAM requirements?
2) When will this be available on ollama?
But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.
> Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU
Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.
The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.
Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.
It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.
Thanks, I confused those numbers!
there are 2 different things.
The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.
From the very first paragraph on the page:
> released under the Apache 2.0 license.