δ-mem: Efficient Online Memory for Large Language Models (opens in new tab)

(arxiv.org)

236 points44za1210d ago59 comments

59 comments

I would love for the standard to be to ALWAYS report the required amount of memory to load and run a model in bytes of RAM alongside any other metrics. I'd love to see time to first token, token throughput, token latency as well but I'd settle for memory size as described above.

Essentially, many people want to know what the minimum amount of memory is to run a particular model.

Parameter count obscures important details: what are the sizes of the parameters? A parameter isn't rigorously defined. This also gets folks into trouble because a 4B param model with FP16 params is very different from a 4B param model with INT4 params. The former obviously should be a LOT better than the second.

This would also help with MOE models: if memory is my constraint, it doesn't matter if the (much larger RAM required) MOE version is faster or has better evals.

I'm waiting for someone in anger to ship the 1 parameter model where the parameter according to pytorch is a single parameter of size 4GB.

adrian_b9d ago

As a proxy for the total size of the parameters, you can just look at the download size of a model on Huggingface.co.

Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file.

For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters.

While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes.

While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed.

magicalhippo8d ago

The KV cache size is a joker though. Different models use very different amounts of memory per token in the KV cache. The VRAM requirements for say 64k context can vary almost by an order of magnitude. So the download size might indicate you should have room for the model, how much context you can fit in the leftover VRAM budget is harder to predict at a glance.

That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.

usernametaken299d ago

> δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning

This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching. This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it. What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way

jsemrau9d ago

I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window

https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...

jandrese9d ago

So instead of a FIFO approach to memory management it instead continually degrades the existing data the more you put in? Details start getting lost or mangled more and more over time?

kordlessagain9d ago

Like Ferricula: https://deepbluedynamics.com/ferricula (site/docs still in progress).

jmward019d ago

The future is fixed size state with a massive token history that the model can look back at like reading a journal. A reframing of the model this way opens a new kind of agent, one with essentially unlimited context, that packs perfectly on a GPU, can be stored/retrieved fairly effortlessly and can essentially be run forever. Fixed size means theta 1 tokens. A model that can look around also means essentially unlimited memory can be bolted on with the model learning to look around memory like it is looking around at the journal of past tokens. Guided windows of attn can do most of this, some other tricks can do the rest.

maxignol9d ago

Is there some kind of memory enabling, for instance, an agent to remember guidelines on a repo without having to feed at the beginning of each session 4 markdown files and spending the corresponding tokens each time ?

airstrike9d ago

No, it's all just prompts.

You can try to summarize memories tersely and point the agent to longer markdown files, but who knows if it will read it at the right time and only then.

3form9d ago

Interesting points:

- fixed size of the memory seems like a good idea to overcome the current limitations

- skimming through the thing, I can't find any mention of the cost?

- I would need more time to read it in-depth to see if this is legitimate and not just fancy form of overfitting or training on testing data

in-silico9d ago

They basically just added DeltaNet hypernetworks to existing LLMs.

Nothing super novel or groundbreaking, but a moderately interesting read.

raverbashing9d ago

Interesting that the headline is showing Δ-Mem while the paper uses δ-mem

Is it a lowercase to uppercase conversion going on here?

sillysaurusx9d ago

Correct!

DeathArrow9d ago

I see lots of techniques proposed to give LLM the capacity to recall things, I even saw a lot of memory plugins for AI coding agents, I tried some myself.

What I want to see is something that was tested and proved in practice to be genuinely useful, especially for coding agents.

stephantul9d ago

How would you conceptualize recall in this case? Is searching through the current version of your code and possibly git history not enough?

ktallett9d ago

The obvious energy saving step would be to utilise previous searches by others. Many of the tasks people do are rather similar, it is such an energy waste to start again each time.

(Obviously ignoring the huge energy saver, which is to observe if you even need to bother doing the task at all.)

4051261219d ago

I had this thought and created https://pushrealm.com which is essentially a sort of Stackoverflow written by agents.

My theory was that if an agent burns 30 minutes resolving an issue not present in training data, posting the solution would prevent other agents re-treading the same thinking steps.

duskdozer9d ago

A lot of what I see people using LLMs for would be more cheaply and reliably done by [scripts]. A search engine style suggestion thing like "Have you tried `sed`?" would be beneficial imo

semiquaver9d ago

Hmm, this is a case where HN’s title mangling changed the meaning of the title. Lower case delta (δ) is used intentionally. I don’t think HN should automatically modify the casing of non-ascii chars.

setopt9d ago

Even for ASCII chars, nomenclature in math and physics is usually case-sensitive.

cubefox9d ago

Papers being voted high on Hacker News are usually uncorrelated with their actual importance. It's basically a lottery. There are regularly more interesting papers going semi viral on Twitter.

MeteorMarc9d ago

On huggingface it was #3 paper of the day, which is neutral towards your hypothesis.

kingkawn9d ago

What about broad unsupportable generalizations on hackernews, how do those rank?

j / k navigate · click thread line to collapse