Mistral Integration Improved in Llama.cpp (opens in new tab)

(github.com)

95 pointsdecide10009mo ago15 comments

15 comments

> We are using mistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened a PR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.

A cpp binary depending on a python server is a bit sad.

I hope this is a stopgap measure and someone port it to C++ eventually:https://github.com/mistralai/mistral-common/blob/main/src/mi...

the_mitsuhiko9mo ago

Isn’t llama.cpp already depending on Python anyways for the templating?

Maxious9mo ago

It uses a cpp implementation of jinja https://github.com/google/minja

hodgehog119mo ago

I appreciate Mistral (and others) releasing their weights for free. But given how llama.cpp underpins a lot of the programs which allow users to run open weight models, it is a little frustrating to have companies which brag about releasing models to the community, leave the community to their own devices to slowly try and actually implement their models.

I hear the reason for this is that llama.cpp keeps breaking basic things, so they have become an unreliable partner. Seems this is what Ollama is trying to address by diluting their connections to llama.cpp and directly contacting companies training these models to have simultaneous releases (e.g. GPT-OSS).

mattnewton9mo ago

There are many different inference libraries and it's not clear which ones a small company like mistral should back yet IMO.

They do release high quality inference code, ie https://github.com/mistralai/mistral-inference

bastawhiz9mo ago

There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy).

Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered.

It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time.

As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.

refulgentis9mo ago

The relationship between Ollama and llama.cpp is massively closer than it must seem.

Ollama is llama.cpp with a nice little installer GUI and nice little server binary.

llama.cpp has a server binary as well, however, no nice installer GUI.

The only time recently Ollama had a feature llama.cpp didn't was they patched SWA in with Google, llama.cpp had it a couple weeks later.

Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.

1 more reply

refulgentis9mo ago

Nah, llama.cpp is stable.

llama.cpp also got GPT-OSS early, like Ollama.

There's a lot of extremely subtle politics going on in the link.

Suffice it to say, as a commercial entity, there's a very clever way to put your thumb on the scale of what works and what doesn't without it being obvious to anyone involved, even the thumb.

hodgehog119mo ago

Stable for a power user, or stable for everyone? I don't have links on hand, but I could swear there have been instances where certain models rolled back support during llama.cpp development, and this was recent. Also llama.cpp adds features and support on a near-daily basis, how can this be LTS?

Don't get me wrong, llama.cpp is an amazing tool. But it's development is nowhere near as cautious as something like the Linux kernel, so there is room there for a more stable alternative. Not saying Ollama will do this, but llama.cpp won't be everything to everyone.

refulgentis9mo ago

I'd start by noting all software adds features and code on a near-daily basis. (* modulo weekends and holidays and lack of interest in further development)

I'm not sure comparing to Linux kernel sheds light: what is different? Just Ubuntu/Red Hat LTS type stuff? What does LTS mean in the context of not-support-contracts and not-operating systems?

Steelmaning, I could say we mean....named branches? I guess a branch isn't a necessary condition...named versions?...that get fixes backported, but no new features.

Software where that's a commonly used approach are at least ~3 OOMs larger (i.e. are much more separable in terms of bug fixes vs. features and components) and hard to upgrade, i.e. it's hard for IT to force all N changes on end users since the last time they upgraded Linux machines, just to get a 0 day fix.

Here, it's a FOSS software library that needs to be part of an app to be useful, the consumers of the library are the ones would want to offer LTS.

I'm all ears if you dig up more info on a rollback or similar nasty scandal, but as it stands, I've been involved with it near-daily for 2 years now, including CI tests on every platform you can think of, and I've never, ever, heard of such a thing.

A guiding light here may be that Ollama inference is 99% llama.cpp or its consituents. From there, we notice a contradiction: if thats the case, how can we claim Ollama fulfills these ideas but llama.cpp doesn't? We could wave it away as they have a miraculous nose for what parts of llama.cpp won't fall victim to the issues we're worried about, but...well, here's one of my favorite quotes: "When faced with a contradiction, first, check your premises"

mhitza9mo ago

llama.cpp still doesn't support gpt-oss tool calling. https://github.com/ggml-org/llama.cpp/pull/15158 (among other similar PRs)

But I also couldn't get vllm, or transformers serve, or ollama (400 response on /v1/chat/completions) working today with gpt-oss. OpenAI's cookbooks aren't really copy paste instructions. They probably tested on a single platform with preinstalled python packages which they forgot to mention :))

refulgentis9mo ago

Re: gpt-oss tool calls support, I don't think that's true, I've been using it for days. Then again, I did write my own harmony parser...(Noting for audience as you imply, neither does Ollama. Thing here is you either gotta hope all your users have nicely formed templates in their ggufs (they do not) or sometimes step in to ex. here, note the OpenAI chat completions-alike API llama.cpp provides will output a text response that you'll need to parse into a tool call yourself, until they implement a harmony parser)

electroglyph9mo ago

gpt-oss are still being actively fixed right this moment, and there have already been quite a few fixes.

baggiponte9mo ago

Wow I never realized how much mistral was “disconnected” from the ecosystem

j / k navigate · click thread line to collapse

15 comments

flakiness9mo ago

A cpp binary depending on a python server is a bit sad.

I hope this is a stopgap measure and someone port it to C++ eventually:https://github.com/mistralai/mistral-common/blob/main/src/mi...

the_mitsuhiko9mo ago

Isn’t llama.cpp already depending on Python anyways for the templating?

Maxious9mo ago

It uses a cpp implementation of jinja https://github.com/google/minja

hodgehog119mo ago

mattnewton9mo ago

There are many different inference libraries and it's not clear which ones a small company like mistral should back yet IMO.

They do release high quality inference code, ie https://github.com/mistralai/mistral-inference

bastawhiz9mo ago

As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.

refulgentis9mo ago

The relationship between Ollama and llama.cpp is massively closer than it must seem.

Ollama is llama.cpp with a nice little installer GUI and nice little server binary.

llama.cpp has a server binary as well, however, no nice installer GUI.

The only time recently Ollama had a feature llama.cpp didn't was they patched SWA in with Google, llama.cpp had it a couple weeks later.

Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now.

1 more reply

refulgentis9mo ago

Nah, llama.cpp is stable.

llama.cpp also got GPT-OSS early, like Ollama.

There's a lot of extremely subtle politics going on in the link.

Suffice it to say, as a commercial entity, there's a very clever way to put your thumb on the scale of what works and what doesn't without it being obvious to anyone involved, even the thumb.

hodgehog119mo ago

refulgentis9mo ago

I'd start by noting all software adds features and code on a near-daily basis. (* modulo weekends and holidays and lack of interest in further development)

I'm not sure comparing to Linux kernel sheds light: what is different? Just Ubuntu/Red Hat LTS type stuff? What does LTS mean in the context of not-support-contracts and not-operating systems?

Steelmaning, I could say we mean....named branches? I guess a branch isn't a necessary condition...named versions?...that get fixes backported, but no new features.

Here, it's a FOSS software library that needs to be part of an app to be useful, the consumers of the library are the ones would want to offer LTS.

mhitza9mo ago

llama.cpp still doesn't support gpt-oss tool calling. https://github.com/ggml-org/llama.cpp/pull/15158 (among other similar PRs)

refulgentis9mo ago

electroglyph9mo ago

gpt-oss are still being actively fixed right this moment, and there have already been quite a few fixes.

baggiponte9mo ago

Wow I never realized how much mistral was “disconnected” from the ecosystem

j / k navigate · click thread line to collapse