Mistral NeMo (opens in new tab)

(mistral.ai)

418 pointsbcatanzaro1y ago162 comments

162 comments

> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

wongarsu1y ago

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

qwertox1y ago

Aren't small models useful for providing a language-based interface - spoken or in writing - to any app? Tuned specifically for that app or more likely enriched via RAG and possibly also by using function calling?

It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.

nmfisher1y ago

I wonder if a "mixture of models" is going to become more common for real-world use cases (i.e. where latency & dollar budgets are real constraints). Chain together a huge model for reasoning, a small model for function calling/RAG, a medium model for decoding language generation. I'm definitely not dismissing 7B models as irrelevant just yet.

mistercheph1y ago

I strongly disagree, have you used fp16 or q8 llama3 8b?

imtringued1y ago

Except Llama 3 8b is a significant improvement over llama 2, which was basically so terrible that there was a whole community building fine tunes that are better than what the multi billion dollar company can do using a much smaller budget. With llama 3 8b things have shifted towards there being much less community fine-tunes that actually beat it. The fact that Mistral AI can still build models that beat it, means the company isn't falling too far behind a significantly better equipped competitor.

What's more irritating is that they decided to do quantization aware training for fp8. int8 quantization results in an imperceptible loss of quality that is difficult to pick up in benchmarks. They should have gone for something more aggressive like 4-bit, where quantization leads to a significant loss in quality.

viridian1y ago

Not that you aren't correct overall in terms of difficulty, but llama3 definitely still has a handful of fine tunes that I'd say outperform the base model by quite a bit, like the hermes model from Nous research, and we're only going to see more as time goes on.

amrrs1y ago

>reasoning and generalization

any example use-cases or prompts? how do you define those?

K0balt1y ago

I usually tell the model that I will be testing its reasoning capabilities by describing a scenario and then asking questions about the evolving scenario.

I typically give it a description of a limited environment with objects in it, and say that “we “ are in this environment. I then describe actions that I take within the environment and ask questions about the updated world-state that must be inferred from the actions. This tests a lot of “common sense” reasoning skills, which I find to be more important for real world tasks than logic puzzle type reasoning.

yjftsjthsd-h1y ago

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

xena1y ago

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

imtringued1y ago

They did quantization aware training for fp8 so you won't get any benefits from using more than 12GB of RAM for the parameters. What you might be using more RAM is the much bigger context window.

renewiltord1y ago

According to nvidia https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ it was made to fit on a 4090 so it should work with 24 GB.

bernaferrari1y ago

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

peterleiser1y ago

Welp, my data point of one shows you need more than 8 GB of vRam.

When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"

This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.

neonbrain1y ago

When I run mistral-chat on Ubuntu 22.04 after cleaning up some smaller processes from the GPU (like gnome-remote-desktop-daemon) I am able to start Mistral-Nemo 2407 and get a Prompt on RTX 4090, but after entering the prompt it still fails with OOM, so, as someone noted, it narrowly fits 4090.

1 more reply

BaculumMeumEst1y ago

What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?

hislaziness1y ago

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

ancientworldnow1y ago

This was trained to be run at FP8 with no quality loss.

1 more reply

fzzzy1y ago

it's very common to run local models in 8 bit int.

1 more reply

Bumblonono1y ago

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

michaelt1y ago

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

exe341y ago

tensors look about 20gb. not sure what that's like in vram.

kelsey987654311y ago

same size

minimaxir1y ago

> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.

rockinghigh1y ago

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

zwaps1y ago

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

numeri1y ago

SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.

Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.

What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.

alecco1y ago

Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.

dpflan1y ago

These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

eigenvalue1y ago

There are a lot of models coming out, but in my view, most don't really matter or move the needle. There are the frontier models which aren't open (like GPT-4o) and then there are the small "elite" local LLMs like Llama3 8B. The rest seem like they are mostly about manipulating benchmarks. Whenever I try them, they are worse in actual practice than the Llama3 models.

hdhshdhshdjd1y ago

I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over.

If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.

azeirah1y ago

You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.

reissbaker1y ago

Comparing this to 70b doesn't make sense: this is a 12b model, which should easily fit on consumer GPUs. A 70b will have to be quantized to near-braindead to fit on a consumer GPU; 4bit is about as small as you can go without serious degradation, and 70b quantized to 4bit is still ~35GB before accounting for context space. Even a 4090 can't run a 70b.

Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.

Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.

mcemilg1y ago

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

wongarsu1y ago

I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

gooob1y ago

no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.

candiddevmike1y ago

They can't release it without admitting to copyright infringement.

1 more reply

bilbo0s1y ago

Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.

jorgesborges1y ago

I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.

miki1232111y ago

Generally yes, with caveats.

There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

planb1y ago

> Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.

PoignardAzur1y ago

> Do you have any good sources that explain this?

The most famous result is OthelloGPT, where they trained a transformer to complete lists of Othello moves, and the transformer generated an internal model of where the pieces were after each move.

The rough consensus is that if you train a model to predict the output of a system for long enough with weight decay and some nebulous conditions are met (see "lottery ticket hypothesis"), eventually your model develops an internal simulation of how the system works because that simulation uses fewer weights than "memorize millions of patterns found in the system", and weight decay "incentivizes" lower-weight solutions.

michaelt1y ago

I don't have explanations but I can point you to one of the papers: https://arxiv.org/pdf/2309.12288 which calls it "the reversal curse" and does a bunch of experiments showing models that are successful at questions like "Who is Tom Cruise’s mother?" (Mary Lee Pfeiffer) will not be equally successful at answering "Who is Mary Lee Pfeiffer’s son?"

1 more reply

moffkalast1y ago

> language already inherently contains the „generalization“

The mental gymnastics required to handwave language model capabilities are getting funnier and funnier every day.

dannyw1y ago

Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).

benmanns1y ago

Am I understanding correctly? You look an English dataset, trained an LLM, machine translated the English dataset to e.g. Spanish, continued training the model, and performance for queries in English improved? That’s really interesting.

bionhoward1y ago

There is evidence code training helps with reasoning so if you count code as another language then, this makes sense

https://openreview.net/forum?id=KIPJKST4gw

Is symbolic language a fuzzy sort of code? Absolutely, because it conveys logic and information. TLDR: yes!

bernaferrari1y ago

no, it is basically an 'auto-correct' spell checker from the phone. It only knows what it was trained on. But it has been shown that a coding LLM that has never seen a programming language or a library can "learn" a new one faster than, say, a generic LLM.

StevenWaterman1y ago

That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English

hdhshdhshdjd1y ago

You, you can write a prompt in English, give it French, and get an accurate answer in English even with the original Mistral.

Still blows my mind we came so far so fast.

eigenvalue1y ago

I have to say, the experience of trying to sign up for Nvidia Enterprise so you can try the "NIM" packaged version of this model, is just icky and and awful now that I've gotten used to actually free and open models and software. It feels much nicer and more free to be able to clone llama.cpp and wget a .gguf model file from huggingface without any registration at all. Especially since it has now been several hours since I signed up for the Nvidia account and it still says on the website "Your License Should be Active Momentarily | We're setting up your credentials to download NIMs."

I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.

lopuhin1y ago

Also I don't think you can use NIM packages in production without a subscription, and I wasn't able to find the cost without signing up. Also NIM package for Mistral Nemo is not yet available anyways.

pennomi1y ago

This is what you get when managers design a software tool instead of engineers designing it.

andrethegiant1y ago

I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?

kaoD1y ago

> but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

andrethegiant1y ago

> What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?

bilbo0s1y ago

How would you stop them?

The license is Apache 2.

2 more replies

pixelatedindex1y ago

Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

simpaticoder1y ago

Justine Tunney (of redbean fame) is actively working on getting LLMs to run well on CPUs, where RAM is cheap. If successful this would eliminate an enormous bottleneck to running local models. If anyone can do this, she can. (And thank you to Mozilla for financially supporting her work). See https://justine.lol/matmul/ and https://github.com/mozilla-Ocho/llamafile

wkat42421y ago

I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

simpaticoder1y ago

This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.

rustcleaner1y ago

There's no good reason for consumer nvidia cards to lack SODIMM-like slots for video RAM, except to rake in big bucks and induce more hasty planned obsolescence.

timschmidt1y ago

DIMM slots won't work for GPU VRAM due to the higher speeds, tighter signalling, and dense packing of memory on wide buses. Take a look at the speeds DDR5 is running at in a typical Xeon server, and compare to GDDR6. This is the problem LPCAMM2 was developed to solve for modern x86 CPUs in laptops and desktops. Seeing it applied to GPUs would be great.

illusive40801y ago

I love that domain name.

bezbac1y ago

AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md

codetrotter1y ago

I’d probably check https://ollama.com/library?q=Nemo in a couple of days. My guess is that by then ollama will have support for it. And you can then run the model locally on your machine with ollama.

Patrick_Devine1y ago

We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

hedgehog1y ago

Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.

andrethegiant1y ago

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

HanClinto1y ago

Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.

2 more replies

RockyMcNuts1y ago

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

Raed6671y ago

First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo

Patrick_Devine1y ago

We're working on it!

Raed6671y ago

I'd love to read about what it means to add model on your end? Do you have some blog post or a TLDR list somewhere ?

nostromo1y ago

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

homarp1y ago

llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings

d131y ago

Try Lm Studio or Ollama. Load up the model, and there you go.

kanwisher1y ago

llama.cpp or ollama both have apis for most models

simonw1y ago

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

Patrick_Devine1y ago

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

a21281y ago

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

dannyw1y ago

I think it's actually reasonable to leave some opportunities to the community. It's an Apache 2.0 model. It's meant for everyone to build upon freely.

sroussey1y ago

Same could be said for onnx.

Depends on which community you are in as to what you want.

simonw1y ago

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

sroussey1y ago

I kinda wish Hugging Face just did it for people.

bugglebeetle1y ago

Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.

madeofpalk1y ago

I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

3170701y ago

I work in the field. The reason has not been mentioned yet.

It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.

drewmate1y ago

It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.

[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page

MikeKusold1y ago

Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.

knicholes1y ago

Generally I agree! I saw a guy shamefully admit he didn't read the output carefully enough when using generated code (that ran), but there was a min() instead of a max(), and it messed up a month of his metrics!

a21281y ago

Coding models solve a clear problem and have a clear integration into a developer's workflow - it's like your own personal StackOverflow and it can autocomplete code for you. It's not as clear when it comes to finance or legal, you wouldn't want to rely on an AI that may hallucinate financial numbers or laws. These other professions are also a lot slower to react to change, compared to software development where people are already used to learning new frameworks every year

sakesun1y ago

Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.

miki1232111y ago

> Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.

> programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.

madeofpalk1y ago

Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.

sofixa1y ago

Finance already has their own models and has had them for decades. Market predictions and high frequency trading is literally what all the hedge funds and the like have been doing for a few decades now. Including advanced sources of information like (take with a grain of salt, I've heard it on the internet) using satellite images to measure factory activity and thus predict results.

Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.

madeofpalk1y ago

I guess finance is not in need of a large language model?

Foobar85681y ago

It does but everything is a joke...

troupo1y ago

The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models

moffkalast1y ago

Then again, we just had this on the front page: https://news.ycombinator.com/item?id=40957990

> We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.

They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.

troupo1y ago

Stock trading is indistinguishable from gambling :)

But true, I forgot that this, too, is part of finance

dannyw1y ago

> - be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.

troupo1y ago

No one has challenged this. Because LLMs haven't been (widely) used in legal or legally binding contexts

adt1y ago

That's 3 releases for Mistral in 24 hours.

https://lifearchitect.ai/models-table/

pants21y ago

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

_flux1y ago

How much memory does employing the complete 128k window take, though? I've sadly noticed that it can take a significant amount of VRAM to use a larger context window.

edit: e.g. I wouldn't know the correct parameters for this calculator, but going from 8k window to 128k window goes from 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

azeirah1y ago

In practice, it's fine to stick with "just" 8k or 16k or 32k. If you're working with data of over 128k tokens I'd personally not recommend using an open model anyway unless you know what you're doing. The models are kinda there, but the hardware mostly isn't.

This is only realistic right now for people with those unified memory MacBook or for enthusiasts with Epyc servers or a very high end workstation built for inference.

Anything above that I don't consider "consumer" inference

mythz1y ago

IMO Google's Gemma2 27B [1] is the sweet spot for running locally on commodity 16GB VRAM cards.

[1] https://ollama.com/library/gemma2:27b

mysteria1y ago

Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).

Raed6671y ago

If I "only" have 16GB of ram on a macbook pro, would that still work ?

sofixa1y ago

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.

zone4111y ago

Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.

chant47471y ago

Can you help me understand why people seem to think of Connections as a more robust indicator of (general) performance than benchmarks typically used for eval?

It seems to me that while the game is very challenging for people it’s not necessarily an indicator of generalization. I can see how it’s useful - but I have trouble seeing how a low score on it would indicate low performance on most tasks.

Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.

edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.

Workaccount21y ago

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

voiper11y ago

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

https://mistral.ai/technology/#pricing

Palmik1y ago

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.

dannyw1y ago

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

marci1y ago

For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"

eyeswideopen1y ago

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

causal1y ago

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

PoignardAzur1y ago

From Mistral's page about Tekken:

> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.

Does that mean that Mistral found that BPE is more efficient than unigram models?

Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.

danielhanchen1y ago

I just managed to make Mistral NeMo 4bit QLoRA finetuning fit in under 12GB, so it fits in a free Google Colab with a Tesla T4 GPU! VRAM is shaved by 60% and finetuning is also 2x faster! Colab: https://colab.research.google.com/github/unslothai/studio/bl...

wkcheng1y ago

Does anyone know whether the 128K is input tokens only? There are a lot of models that have a large context window for input but a small output context. If this actually has 128k tokens shared between input and output, that would be a game changer.

hislaziness1y ago

I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.

frontierkodiak1y ago

Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.

Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.

It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.

ofermend1y ago

Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.

obblekk1y ago

Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.

davidzweig1y ago

Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.

moffkalast1y ago

Well it's not on Le Chat, it's not on LMSys, it has a new tokenizer that breaks llama.cpp compatibility, and I'm sure as hell not gonna run it with Crapformers at 0.1x speed which as of right now seems to be the only way to actually test it out.

p1esk1y ago

Interesting how it will compete with 4o mini.

lostmsu1y ago

Gonna wait for LMSYS benchmarks. The "standard" benchmarks all seem unreliable.

saberience1y ago

Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

causal1y ago

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.

alecco1y ago

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

> Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU

I_am_tiberius1y ago

The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.

k__1y ago

What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

kaoD1y ago

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.

k__1y ago

Ah, yes.

Thanks, I confused those numbers!

simion3141y ago

>What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.

yjftsjthsd-h1y ago

> Also, are these small models OSS?

From the very first paragraph on the page:

> released under the Apache 2.0 license.

pantulis1y ago

Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming

markab211y ago

It looks like it was built jointly with nvidia: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

refulgentis1y ago

Click the link, read the first sentence.

pantulis1y ago

Yeah, not my brightest HN moment, to be honest.

SubiculumCode1y ago

At least you didn't ask about finding a particular fish.

1 more reply

LoganDark1y ago

Is the base model unaligned? Disappointing to see alignment from allegedly "open" models.

xena1y ago

The reason that companies align models is so that they don't get on the front page of the new york times with a headline like "Techaro's AI model used by terrorists to build a pipe bomb that destroyed the New York Stock Exchange datacentre".

j / k navigate · click thread line to collapse

162 comments

yjftsjthsd-h1y ago

wongarsu1y ago

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

qwertox1y ago

It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.

nmfisher1y ago

mistercheph1y ago

I strongly disagree, have you used fp16 or q8 llama3 8b?

imtringued1y ago

viridian1y ago

amrrs1y ago

>reasoning and generalization

any example use-cases or prompts? how do you define those?

K0balt1y ago

I usually tell the model that I will be testing its reasoning capabilities by describing a scenario and then asking questions about the evolving scenario.

yjftsjthsd-h1y ago

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

xena1y ago

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

imtringued1y ago

They did quantization aware training for fp8 so you won't get any benefits from using more than 12GB of RAM for the parameters. What you might be using more RAM is the much bigger context window.

renewiltord1y ago

According to nvidia https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ it was made to fit on a 4090 so it should work with 24 GB.

bernaferrari1y ago

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

peterleiser1y ago

Welp, my data point of one shows you need more than 8 GB of vRam.

When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"

This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.

neonbrain1y ago

1 more reply

BaculumMeumEst1y ago

hislaziness1y ago

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

ancientworldnow1y ago

This was trained to be run at FP8 with no quality loss.

1 more reply

fzzzy1y ago

it's very common to run local models in 8 bit int.

1 more reply

Bumblonono1y ago

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

michaelt1y ago

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

exe341y ago

tensors look about 20gb. not sure what that's like in vram.

kelsey987654311y ago

same size

minimaxir1y ago

rockinghigh1y ago

The SentencePiece library also implements Byte-pair-encoding. That's what the LLaMA models use and the original Mistral models were essentially a copy of LLaMA2.

zwaps1y ago

SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.

numeri1y ago

alecco1y ago

Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

dpflan1y ago

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

eigenvalue1y ago

hdhshdhshdjd1y ago

If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.

azeirah1y ago

You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.

reissbaker1y ago

mcemilg1y ago

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.

wongarsu1y ago

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

gooob1y ago

candiddevmike1y ago

They can't release it without admitting to copyright infringement.

1 more reply

bilbo0s1y ago

Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.

jorgesborges1y ago

miki1232111y ago

Generally yes, with caveats.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

planb1y ago

PoignardAzur1y ago

> Do you have any good sources that explain this?

The most famous result is OthelloGPT, where they trained a transformer to complete lists of Othello moves, and the transformer generated an internal model of where the pieces were after each move.

michaelt1y ago

1 more reply

moffkalast1y ago

> language already inherently contains the „generalization“

The mental gymnastics required to handwave language model capabilities are getting funnier and funnier every day.

dannyw1y ago

Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).

benmanns1y ago

bionhoward1y ago

There is evidence code training helps with reasoning so if you count code as another language then, this makes sense

https://openreview.net/forum?id=KIPJKST4gw

Is symbolic language a fuzzy sort of code? Absolutely, because it conveys logic and information. TLDR: yes!

bernaferrari1y ago

StevenWaterman1y ago

That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English

hdhshdhshdjd1y ago

You, you can write a prompt in English, give it French, and get an accurate answer in English even with the original Mistral.

Still blows my mind we came so far so fast.

eigenvalue1y ago

lopuhin1y ago

pennomi1y ago

This is what you get when managers design a software tool instead of engineers designing it.

andrethegiant1y ago

kaoD1y ago

> but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

andrethegiant1y ago

> What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?

bilbo0s1y ago

How would you stop them?

The license is Apache 2.

2 more replies

pixelatedindex1y ago

EDIT: This a 1W light bulb moment for me, thank you!

simpaticoder1y ago

wkat42421y ago

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.

simpaticoder1y ago

rustcleaner1y ago

There's no good reason for consumer nvidia cards to lack SODIMM-like slots for video RAM, except to rake in big bucks and induce more hasty planned obsolescence.

timschmidt1y ago

illusive40801y ago

I love that domain name.

bezbac1y ago

AFAIK, Ollama supports most of these models locally and will expose a REST API[0]

[0]: https://github.com/ollama/ollama/blob/main/docs/api.md

codetrotter1y ago

Patrick_Devine1y ago

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D

hedgehog1y ago

andrethegiant1y ago

Why would it take a couple days? Is it not a matter of uploading the model to their registry, or are there more steps involved than that?

HanClinto1y ago

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

2 more replies

RockyMcNuts1y ago

You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air

Raed6671y ago

First thing I did when i saw the headline was to look for it on ollma but it didn't land there yet: https://ollama.com/library?sort=newest&q=NeMo

Patrick_Devine1y ago

We're working on it!

Raed6671y ago

I'd love to read about what it means to add model on your end? Do you have some blog post or a TLDR list somewhere ?

nostromo1y ago

Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.

homarp1y ago

llama.cpp supports multi gpu across local network https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

and expose an OpenAI compatible server, or you can use their python bindings

d131y ago

Try Lm Studio or Ollama. Load up the model, and there you go.

kanwisher1y ago

llama.cpp or ollama both have apis for most models

simonw1y ago

I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

Patrick_Devine1y ago

a21281y ago

dannyw1y ago

I think it's actually reasonable to leave some opportunities to the community. It's an Apache 2.0 model. It's meant for everyone to build upon freely.

sroussey1y ago

Same could be said for onnx.

Depends on which community you are in as to what you want.

simonw1y ago

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

sroussey1y ago

I kinda wish Hugging Face just did it for people.

bugglebeetle1y ago

madeofpalk1y ago

3170701y ago

I work in the field. The reason has not been mentioned yet.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.

drewmate1y ago

It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.

[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page

MikeKusold1y ago

Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.

knicholes1y ago

a21281y ago

sakesun1y ago

Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.

miki1232111y ago

> Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

> programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.

madeofpalk1y ago

Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.

sofixa1y ago

madeofpalk1y ago

I guess finance is not in need of a large language model?

Foobar85681y ago

It does but everything is a joke...

troupo1y ago

The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models

moffkalast1y ago

Then again, we just had this on the front page: https://news.ycombinator.com/item?id=40957990

troupo1y ago

Stock trading is indistinguishable from gambling :)

But true, I forgot that this, too, is part of finance

dannyw1y ago

> - be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.

troupo1y ago

No one has challenged this. Because LLMs haven't been (widely) used in legal or legally binding contexts

adt1y ago

That's 3 releases for Mistral in 24 hours.

https://lifearchitect.ai/models-table/

pants21y ago

Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.

_flux1y ago

How much memory does employing the complete 128k window take, though? I've sadly noticed that it can take a significant amount of VRAM to use a larger context window.

azeirah1y ago

This is only realistic right now for people with those unified memory MacBook or for enthusiasts with Epyc servers or a very high end workstation built for inference.

Anything above that I don't consider "consumer" inference

mythz1y ago

IMO Google's Gemma2 27B [1] is the sweet spot for running locally on commodity 16GB VRAM cards.

[1] https://ollama.com/library/gemma2:27b

mysteria1y ago

Raed6671y ago

If I "only" have 16GB of ram on a macbook pro, would that still work ?

sofixa1y ago

If it's an M-series one with "unified memory" (shared RAM between the CPU, GPU and NPU on the same chip), yes.

zone4111y ago

chant47471y ago

Can you help me understand why people seem to think of Connections as a more robust indicator of (general) performance than benchmarks typically used for eval?

Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.

edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.

Workaccount21y ago

Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

voiper11y ago

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

https://mistral.ai/technology/#pricing

Palmik1y ago

dannyw1y ago

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

marci1y ago

eyeswideopen1y ago

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

causal1y ago

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.

PoignardAzur1y ago

From Mistral's page about Tekken:

> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.

Does that mean that Mistral found that BPE is more efficient than unigram models?

Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.

danielhanchen1y ago

wkcheng1y ago

hislaziness1y ago

frontierkodiak1y ago

Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.

ofermend1y ago

Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.

obblekk1y ago

Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.

davidzweig1y ago

Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.

moffkalast1y ago

p1esk1y ago

Interesting how it will compete with 4o mini.

lostmsu1y ago

Gonna wait for LMSYS benchmarks. The "standard" benchmarks all seem unreliable.

saberience1y ago

Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

causal1y ago

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

alecco1y ago

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

> Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU

I_am_tiberius1y ago

k__1y ago

What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

kaoD1y ago

I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

k__1y ago

Ah, yes.

Thanks, I confused those numbers!

simion3141y ago

>What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

yjftsjthsd-h1y ago

> Also, are these small models OSS?

From the very first paragraph on the page:

> released under the Apache 2.0 license.

pantulis1y ago

Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming