What's in a GGUF, besides the weights – and what's still missing? (opens in new tab)

(nobodywho.ooo)

195 pointsbashbjorn12d ago58 comments

58 comments

I regret that the projection models ended up separate, and I too would have preferred for them to be in a single file. I'm not entirely sure why that ended up happening, but it very much runs counter to the single-file ethos I had in mind when I designed GGUF.

Hoping that someone will shepherd the cause of merging the two; I think I'm too out of the loop to do it this time around :-)

intothemild12d ago

Well considering right now MTP support is being developed, there was a conversation in that that seemed to throw around the idea of separating the MTP model out of the main GGUF, like with Mmproj. This was rejected.

Which I'm happy for. So given that decision, I don't think it's unreasonable to think that they might be open to including Mmproj files in the GGUF.

Only issue I can think of is, which one? BF16, F16? Etc

Philpax12d ago

Quantiser's choice, IMO. They're best-placed to decide what compromise to make for their particular model.

xoxos6d ago

hi, first post. 56 year old world's foremost procedural audio programmer xoxos vst, wrote the world's first procedural lyrical engine in 1994.

just about every field has documented cancellation of my egalitarian work for the apron brethren. it would be nice if this species could explain how to use a text to image without leaving which mess of 30g worth of downloads to try.

please, just someone SAY what things i need just once simply without going "you need 5G 5G 5G 5G 5G 5G"

your species doesn't work since the Emm Kay heterodyning from orbit. since rely, natural kinda ears to west papua FOR A REASON

woctordho9d ago

Currently not many people would finetune mmproj, so mmproj is reusable. The mmproj for Qwen 3.6 27B can be reused on all its finetunes. While the MTP model usually needs to be finetuned with the main model to get the best performance, which is being studied in Heretic.

uyzstvqs12d ago

GGML & GGUF have been extremely important to the open-source ML/AI space. Projects like llama.cpp, whisper.cpp, and stable-diffusion.cpp tend to just work perfectly, across a whole bunch of different platforms and hardware backends.

doublerabbit12d ago

while llama.cpp is an meta creation, and meta as I loathe them with a passion, I do admit it's the easiest out of the others. Compile this, give it brain - run. And you get a webui and api.

packetlost12d ago

llama.cpp doesn't really have much to do with Meta other than it was originally developed for the first Llama model released by Meta. The creator doesn't and didn't work for Meta when it was written.

1 more reply

amelius12d ago

> <|turn>user Hi there!<turn|><|turn>model Hi there, how can I help you today <turn|>

Good lord, they managed to invent a format that is even less readable than XML.

aktuel12d ago

It is not supposed to be readable by humans. You rarely have to look at it. It is designed to not get confused with the actual content, where the content can be any random text from the internet. For that, you have to use a format that is not used anywhere else.

stavros12d ago

Are these markers actual text? Or does the model "see" one token per marker?

3 more replies

woctordho9d ago

Sadly, they need to be writable by humans. That's why we see the Unsloth guys fix bugs in chat templates again and again.

rexthonyy11d ago

You're right. It does seem like a suboptimal format in terms of memory usage efficiency

nixon_why6911d ago

The tokens all have int IDs, this is just how they're rendered.

halyconWays12d ago

Fun lore, GGUFs were once called GGJTs until I caught the "JT" (Justine Tunney) stealing the memory map code from a user who did 99% of the work in a draft PR (slaren) and lying about it, and misrepresenting or not understanding how memory map worked. She wanted her initials in the file format for bragging rights because it was claimed that it caused 90% memory reduction (actually it was just lazy loading into memory). Gerganov was quite angry when he found out what happened. Jart (JT) was then banned from the llama.cpp repo but managed to get back in a year or so later.

jart10d ago

Have you ever read my side of the story? https://justine.lol/dox/4chan.txt

halyconWays9d ago

I recall reading it and you mischaracterize and conflate the core issues with accusations of hate and mean words, which is an inappropriate deflection and an attempt to control the narrative. The core issues were always plagiarism, misrepresentation of another user's work, refusal to give proper credit to the real author, and you bragging about it when you thought you could get away with it with quotes like "great artists steal." You never took accountability nor grasped the seriousness of what you attempted to do. You don't seem to understand WHY you faced blowback. You never expressed regret that the person you were victimizing was deeply depressed and pushed moreso by your betrayal. It's always "me, me, me." Your behavior and response is toxic.

1 more reply

theapadayo12d ago

IMO the biggest thing still missing is an actual way to define the model architecture outside of being hard coded into the current build. It doesn't need to be a 1:1 performance parity with the fully supported models. Having proper, vendor validated support for day 1 is what is the difference between people thinking a model is amazing vs horrible. See recent Gemma vs Qwen releases.

Not sure what the solution is, other than writing a DSL to describe the model graphs which you then embed in the GGUF. The other fallback is to just read the PyTorch modules from the official model releases and convert that to GGML ops somehow.

Philpax12d ago

Yeah, I intentionally left space for the computation graph to be included in the GGUF spec in the hopes that this would be picked up by someone. I would have loved to have it in the first version, but I was prioritising getting the MVP spec out and implemented.

I'd still love to see this, but it would need a cheerleader very familiar with the current state of the GGML IR.

LoganDark12d ago

I feel like the computation graph could be embedded into the weights similarly to how ONNX works. Then you expose some common interfaces that except some common parameters, and additional custom ones can practically be extensions, sort of like how Wayland works. So you can support not only transformer-ish models like LLaMa, but also RNN-ish models like RWKV and also multimodal models and more. Not sure how this would be implemented in practice but it sounds like a cool idea. I just worry that if the computation graph is baked into the model file, then improvements to the architecture or optimizations that don't require changes to the weights won't be applied to existing files without a conversion.

Sharlin12d ago

> The really neat thing about GGUF is that it's just one file. Compare this to a typical safetensors repo on huggingface, where there's a pile of necessary JSON files scattered around [...]

Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.

Philpax12d ago

Single-file deployments were an intentional design goal on my part. While most image models were/are single-file, LLM safetensors (at least at the time) were not, and I wanted to ensure that we enforced that at a structural level. I also didn't want to mandate a JSON reader for executors (e.g. llama.cpp), which the ST approach would have required. The bigger issue at the time, if I recall, was that ST couldn't support the new-and-upcoming quants that GGML had, and having our own file format offered us flexibility that ST couldn't.

embedding-shape11d ago

> to me AI models have "always" been single files, as that's what has been the norm in the local image gen business

That doesn't even make sense in the "local image gen business", you don't use a single weights file, you need a bunch of encoders/decoders and what not to actually be able to run the architecture with the weights.

Maybe the tooling you use hides those things from you, but they're still there under the surface.

badsectoracula12d ago

> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++

As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).

But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P

[0] https://i.imgur.com/GiTBE1j.png

bitwize12d ago

Oh my God I freaking love your app. The 90s Linux desktop vibes hit like a hammer. FLTK FTW!

prashantk_11d ago

I have always used safetensors + metadata files (similar to Huggingface repo) format. It is not a major pain point by any means, but good that GGUF has a compact format and good support.

ge9612d ago

Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.

bashbjornOP12d ago

I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).

ge9612d ago

Thanks for the tip, what do you use Gemma 4 e4b for?

1 more reply

mixtureoftakes12d ago

7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.

Try both in lm studio, they really are surprisingly capable

ge9612d ago

I have 80gb of ram but it's slow capped by i9 CPU or specific asus mobo sucks I think only 2400mhz despite being ddr4

Tried all the stuff bios, volting

1 more reply

ganelonhb12d ago

I have a 2070 and can confirm it works amazingly fast.

I love TheBloke I wish he still made stuff

bashbjornOP12d ago

Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.

ge9612d ago

What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.

I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.

paradox46011d ago

A lot of the same spirit lives on in TheDrunmer

They're mostly aimed at role play and sillytavern, but they're still generally good models, with lots of quants available

sbinnee11d ago

Thanks, I learned something more about GGUF by seeing what's not there yet. Tool calling format makes so much sense. It's going to be a milestone transitioning from LLMs to agents.

monocasa12d ago

I mean, one if the big issues I've had is that it doesn't really store the compute graph. It only stores a string of the foundational architecture, along with parameter metadata to allow you to rebuild the compute graph.

That means that every foundational model architecture requires new code in whatever is consuming the gguf to support that model.

kenreidwilson12d ago

>Published May 18, 2026

hmmm...

bashbjornOP12d ago

whoops, my bad. Just a typo in the markdown. Fixed :)

1024bits12d ago

What're you using to render this blog? Any chance there could be an RSS feed?

2 more replies

j / k navigate · click thread line to collapse

58 comments

Philpax12d ago

Hoping that someone will shepherd the cause of merging the two; I think I'm too out of the loop to do it this time around :-)

intothemild12d ago

Which I'm happy for. So given that decision, I don't think it's unreasonable to think that they might be open to including Mmproj files in the GGUF.

Only issue I can think of is, which one? BF16, F16? Etc

Philpax12d ago

Quantiser's choice, IMO. They're best-placed to decide what compromise to make for their particular model.

xoxos6d ago

hi, first post. 56 year old world's foremost procedural audio programmer xoxos vst, wrote the world's first procedural lyrical engine in 1994.

please, just someone SAY what things i need just once simply without going "you need 5G 5G 5G 5G 5G 5G"

your species doesn't work since the Emm Kay heterodyning from orbit. since rely, natural kinda ears to west papua FOR A REASON

woctordho9d ago

uyzstvqs12d ago

doublerabbit12d ago

while llama.cpp is an meta creation, and meta as I loathe them with a passion, I do admit it's the easiest out of the others. Compile this, give it brain - run. And you get a webui and api.

packetlost12d ago

llama.cpp doesn't really have much to do with Meta other than it was originally developed for the first Llama model released by Meta. The creator doesn't and didn't work for Meta when it was written.

1 more reply

amelius12d ago

> <|turn>user Hi there!<turn|><|turn>model Hi there, how can I help you today <turn|>

Good lord, they managed to invent a format that is even less readable than XML.

aktuel12d ago

stavros12d ago

Are these markers actual text? Or does the model "see" one token per marker?

3 more replies

woctordho9d ago

Sadly, they need to be writable by humans. That's why we see the Unsloth guys fix bugs in chat templates again and again.

rexthonyy11d ago

You're right. It does seem like a suboptimal format in terms of memory usage efficiency

nixon_why6911d ago

The tokens all have int IDs, this is just how they're rendered.

halyconWays12d ago

jart10d ago

Have you ever read my side of the story? https://justine.lol/dox/4chan.txt

halyconWays9d ago

1 more reply

theapadayo12d ago

Philpax12d ago

I'd still love to see this, but it would need a cheerleader very familiar with the current state of the GGML IR.

LoganDark12d ago

Sharlin12d ago

> The really neat thing about GGUF is that it's just one file. Compare this to a typical safetensors repo on huggingface, where there's a pile of necessary JSON files scattered around [...]

Philpax12d ago

embedding-shape11d ago

> to me AI models have "always" been single files, as that's what has been the norm in the local image gen business

Maybe the tooling you use hides those things from you, but they're still there under the surface.

badsectoracula12d ago

> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++

[0] https://i.imgur.com/GiTBE1j.png

bitwize12d ago

Oh my God I freaking love your app. The 90s Linux desktop vibes hit like a hammer. FLTK FTW!

prashantk_11d ago

I have always used safetensors + metadata files (similar to Huggingface repo) format. It is not a major pain point by any means, but good that GGUF has a compact format and good support.

ge9612d ago

Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.

bashbjornOP12d ago

I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).

ge9612d ago

Thanks for the tip, what do you use Gemma 4 e4b for?

1 more reply

mixtureoftakes12d ago

7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.

Try both in lm studio, they really are surprisingly capable

ge9612d ago

I have 80gb of ram but it's slow capped by i9 CPU or specific asus mobo sucks I think only 2400mhz despite being ddr4

Tried all the stuff bios, volting

1 more reply

ganelonhb12d ago

I have a 2070 and can confirm it works amazingly fast.

I love TheBloke I wish he still made stuff

bashbjornOP12d ago

ge9612d ago

What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.

I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.

paradox46011d ago

A lot of the same spirit lives on in TheDrunmer

They're mostly aimed at role play and sillytavern, but they're still generally good models, with lots of quants available

sbinnee11d ago

Thanks, I learned something more about GGUF by seeing what's not there yet. Tool calling format makes so much sense. It's going to be a milestone transitioning from LLMs to agents.

monocasa12d ago

That means that every foundational model architecture requires new code in whatever is consuming the gguf to support that model.

kenreidwilson12d ago

>Published May 18, 2026

hmmm...

bashbjornOP12d ago

whoops, my bad. Just a typo in the markdown. Fixed :)

1024bits12d ago

What're you using to render this blog? Any chance there could be an RSS feed?

2 more replies

j / k navigate · click thread line to collapse