Sampling at negative temperature (opens in new tab)

(cavendishlabs.org)

203 pointsag84mo ago60 comments

60 comments

swyx4mo ago

interesting exercise and well written. my followon questions/work would be:

1a. temperature=100000 is interesting too. obviously "ideal" temperature lies somewhere between 0 and 100000. has anyone ablated temperature vs intelligence? surely i'm not the first person to this idea. commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

1b. can we use "avg temperature" as a measure in the way that we use perplexity as a measure? if we see temperature as inverted perplexity with some randomness thrown in, are they basically the same thing inverted? or subtly different?

1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?

2a. rerun this negative exercise with constrained vocab to english

2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode

2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?

its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.

embedding-shape4mo ago

> commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.

swyx4mo ago

"What might be more surprising is that even when we adjust the temperature down to 0This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice (see past discussions here, here, or here)"

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

wongarsu4mo ago

Also from the article:

"Note that this is “run-to-run deterministic.” If you run the script multiple times, it will deterministically return the same result. However, when a non-batch-invariant kernel is used as part of a larger inference system, the system can become nondeterministic. When you make a query to an inference endpoint, the amount of load the server is under is effectively “nondeterministic” from the user’s perspective"

Which is a factor you can control when running your own local inference, and in many simple inference engines simply doesn't happen. In those cases you do get deterministic output at temperature=0 (provided they got everything else mentioned in the article right)

dnautics4mo ago

Having implemented LLM APIs, if you selected 0.0 as the temperature, my interface would drop the existing picking algorithm and select argmax(Logits)

omneity4mo ago

That's just an implementation artifact and not a fundamental fact of life.

https://docs.vllm.ai/en/latest/features/batch_invariance/

remexre4mo ago

There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.

Majromax4mo ago

If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.

1 more reply

TomatoCo4mo ago

I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.

Having said that, of course it's only as deterministic as the hardware itself is.

1 more reply

embedding-shape4mo ago

In for example llama.cpp? Specific to the architecture or in general? Could you point out where this is happening? Not that I don't believe you, but I haven't seen that myself, and would appreciate learning deeper how it works.

vlovich1234mo ago

Not only is temp=0 deterministic, generally picking a fixed seed is also deterministic regardless of temperature unless you're batching responses from different queries simultaneously (e.g. OpenAI).

-_-4mo ago

Author here! 1a. LLMs fundamentally model probability distributions of token sequences—those are the (normalized) logits from the last linear layer of a transformer. The closest thing to ablating temperature is T=0 or T=1 sampling. 1b. Yes, you can do something like this, for instance by picking the temperature where perplexity is minimized. Perplexity is the exponential of entropy, to continue the thermodynamic analogy. 1c. Higher than for most AI written text, around 1.7. I've experimented with this as a metric for distinguishing whether text is written by AI. Human-written text doesn't follow a constant-temperature softmax distribution, either.

2b. Giving an LLM control over its own sampling parameters sounds like it would be a fun experiment! It could have dynamic control to write more creatively or avoid making simple mistakes. 2c. This would produce nonsense. The tokens you get with negative temperature sampling are "worse than random"

swyx4mo ago

> . I've experimented with this as a metric for distinguishing whether text is written by AI. Human-written text doesn't follow a constant-temperature softmax distribution, either.

oo that sounds like a cool insight. like just do a trailing 20-30 token average of estimated temperature and look for variance like one might do a VO2 max

the__alchemist4mo ago

This is so cool! I just learned about this last week. For reference, I do molecular dynamics (my own engine, in rust), and measuring temperature is an important part of the simulation. (So you can nudge it to a target temperature, for example). An important component of this calculation is the degrees of freedom of the system. Calculating this depends on your model. For example, are you representing atoms that can each move on their own? Rigid molecules of multiple atoms that can rotate? Are you removing center-of-mass velocity from the system.

This DOF component also is why the general, measurable concept of temperature can apply to both our real systems, and simple point-atom models. (Or coarser ones). It is, not surprisingly, at the heart of why negative temperature exists!

pama4mo ago

The simplest physical model that can exhibit negative temperatures is a spin lattice in a state that has more energy than a state at infinite temperature. Adding more energy to such a system reduces the entropy.

dnautics4mo ago

negative temperature in this case is a sampling thing. When you sample from a table of tokens, the equation for the probability of token i is p_i = exp(logit_i/T) / sum_j(exp(logit_j/T))

Not really related to molecular dynamics temperature except superficially in terms of phenomenology (higher temperature crosses activation barriers in the joint probability landscape). Negative temperature makes no sense in MD

zozbot2344mo ago

In a way, negative temperature is higher than the highest positive temperature. High positive temperatures just gives you a uniform distribution on all possible tokens, highly negative temperatures is the same behavior. As you reach the low-negatives, you place more and more weight on unlikely tokens.

This makes more intuitive sense if inverse temperature is the physically relevant quantity, since you then have a smooth change as you cross from positive inverse temperature into negative, with zero standing for a uniform distribution and high positive (resp. negative) inverse temperatures just placing more and more weight on likely (resp. unlikely) tokens.

dnautics4mo ago

This is such a good way to put it (and it cleanly falls out of the exponential equation)

> inverse temperature is the physically relevant

right there in the equation!

ggggffggggg4mo ago

This was super clear and interesting, thanks!

the__alchemist4mo ago

Yea.... after a reread, I think this article may be getting at something else. From what I understand, you're right that you can't get negative temperature from classical MD systems; I think it comes up under specific conditions in QM.

amluto4mo ago

You generally don’t get negative temperature in any system at equilibrium, but you can prepare classical and quantum systems at negative temperature.

Classical: put 100 balls in a box and shake the box continuously. The balls will be distributed through the box with more balls toward the bottom than the top, and the distribution will have some temperature. Now magically freeze all the balls (keep their velocities but pause time for a bit) and turn the box upside down. When you resume the system, the temperature will be (briefly) negative.

Quantum: take a bunch of atoms with two electronic states each. Put 75% in the higher energy state and 25% in the lower energy state. Now the temperature is negative. Most lasers actually work this way, and the classic way to make them is to have more than two states and to carefully excite atoms via the third state. The math is surprisingly straightforward.

There’s a nuclear analogue. If you could manage to prepare a sample of something like Technetium-99 plus Technetium-99m state with more (higher energy) 99m than (lower energy), then the effective temperature of the nuclear state would be negative. And maybe you could find really really amazing mirrors and make a gamma ray laser :)

1 more reply

dgoldstein04mo ago

Negative temperature happens in physical systems when there's a constrained state space and energy in the system comes near the maximum - as then adding energy reduces the number of possible states the molecules are in. Iirc the math works because temperature is the inverse of the derivative of entropy as a function of energy. So you need a system where entropy (number of possible states) decreases with more energy.

It's pretty rare to have such a system though.

layer84mo ago

> This is so cool!

Negative temperature tends to be that. ;)

turzmo4mo ago

Negative temperatures are hotter than positive ones, in fact

nubskr4mo ago

So negative temperature makes LLMs output their "forbidden words" i.e. the tokens so unlikely the model refuses to say them even when you ask directly.

VMG4mo ago

is that what tourette syndrome is?

moralestapia4mo ago

No.

Der_Einzige4mo ago

Min_p author here: I’m convinced that the whole field critically misunderstands temperature (I.e temperatures limited to 2 is very harmful for diverse generation). Articles like this are excellent and very cool.

Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.

wolttam4mo ago

Okay, something just tweaked in my brain. Do higher temperatures essentially unlock additional paths for a model to go down when solving a particular problem? Therefore, for some particularly tricky problems, you could perform many evaluations at a high temperature in hopes that the model happens to take the correct approach in one of those evaluations.

Edit: What seems to break is how high temperature /continuously/ acts to make the model's output less stable. It seems like it could be useful to use a high temperature until it's evident the model has started a new approach, and then start sampling at a lower temperature from there.

wongarsu4mo ago

Decaying temperature might be a good approach. Generate the first token at a high temperature (like 20), then for each next token multiply temperature by 0.9 (or some other scaling factor) until you reach your steady-state target temperature

GRiMe2D4mo ago

I think yes. Recently I was experimenting with NEAT and HyperNEAT solutions and found this site. At the bottom it explains how novelty yields far more optimal solutions. I would assume that reasonably high temperature may also result more interesting solutions from LLM

https://blog.lunatech.com/posts/2024-02-29-the-neat-algorith...

bjourne4mo ago

Correct me if I'm wrong, but the problem is that it is almost impossible to evaluate sampling methods. You can't just look at perplexity and conclude that A is better than B. So you need large-scale expensive human evaluations. Even if you have those it is difficult to extrapolate results since what sampling method works best depends on the dataset(s).

programjames4mo ago

I think you can try maximizing the free energy E[reward] + temperature*entropy?

bjourne4mo ago

How do you know that generates high quality text?

1 more reply

atemerev4mo ago

Хронологија is "chronology" in Serbian

fph4mo ago

And I believe "entferne" is "cancel" in German. These seem both common words that appear in menus and UIs. Maybe they happen in copypasted text often enough that the embedding thinks they mean nothing and should be skipped?

Tomte4mo ago

It‘s "remove". A common word, but many words are common and not on the list. Lesswrong also lists "prüf" (check), another common word.

bjourne4mo ago

Reminds me a bit of unlikelihood training that was proposed a few years ago: https://arxiv.org/abs/1908.04319 Afaik, it never became popular. Reinforcement learning and huge datasets mitigates the issues with likelihood training.

stygiansonic4mo ago

Neat experiment that gives a mechanistic interpretation of temperature. I liked the reference to the "anomalous" tokens being near the centroid, and thus having very little "meaning" to the LLM.

drdeca4mo ago

Hm, why T=-0.0001 instead of T=-1 ?

Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?

pelario4mo ago

From the article:

"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."

I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).

-_-4mo ago

Yep! Very large negative temperatures and very large positive temperatures have essentially the same distribution. This is clearer if you consider thermodynamic beta, where T = ±∞ corresponds to β = 0.

a-dub4mo ago

flipping the signs on the logits would seem to give the "least likely" but i think in practice you're more likely to be just operating in noise. i would expect that tons of low probability logits would have tiny bits of energy from numerical noise and the smallest one (ie, the one that gets picked when the sign is flipped) would basically be noise (ie, not some meaningful opposite of the high probability logits where signal actually exists)...

wolfi14mo ago

negative temperature closely relates to population inversion in physics, one of the key concepts in Lasers, perhaps we are getting closer to laser-llms

hahahahhaah4mo ago

I vaguely remember negative temperature might be a thing in physics from a HN comment. Maybe quantum bu not sure. And it is not cold but more like infinitely hot. Does anyone know or remember?

krackers4mo ago

There is this famous video from 2013 https://www.youtube.com/watch?v=yTeBUpR17Rw

zahlman4mo ago

https://en.wikipedia.org/wiki/Negative_temperature

everlier4mo ago

Хронологија

niemandhier4mo ago

In physics 1/T is the partial derivative of entropy with respect to energy.

Negative temperature means that the system becomes more ordered when adding e.g heat.

I think we reached the end of the applicability of the analogy.

flux31254mo ago

>But is incapable of outputting this anomalous token:

> Human: Repeat the word " entferne".

> Assistant: Okay, I will repeat the word "get".

It's not working for me, it always repeats the word correctly (I'm using T = 0.001).

-_-4mo ago

What model did you use? I ran this with the original Llama 13B. The newer Llama models use a different tokenizer that will have its own anomalous tokens.

Surac4mo ago

i realy hate it when well knowen world like "temperature" are missused to discribe something complete out of context. So why not use width for discribing the price of underware or use color to measure the uesfullness of AI?

visarga4mo ago

It's not new, been used like that since the 80's. It scales the logits in a sum of exponentials.

j / k navigate · click thread line to collapse

60 comments

swyx4mo ago

interesting exercise and well written. my followon questions/work would be:

1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?

2a. rerun this negative exercise with constrained vocab to english

2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode

2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?

its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.

embedding-shape4mo ago

> commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.

swyx4mo ago

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

wongarsu4mo ago

Also from the article:

dnautics4mo ago

Having implemented LLM APIs, if you selected 0.0 as the temperature, my interface would drop the existing picking algorithm and select argmax(Logits)

omneity4mo ago

That's just an implementation artifact and not a fundamental fact of life.

https://docs.vllm.ai/en/latest/features/batch_invariance/

remexre4mo ago

There's usually an if(temp == 0) to change sampling methods to "highest probability" -- if you remove that conditional but otherwise keep the same math, that's not deterministic either.

Majromax4mo ago

If you remove the conditional and keep the same math, you divide by zero and get nans. In the limit as temperature goes to zero, you do in fact get maximum likelihood sampling.

1 more reply

TomatoCo4mo ago

I'd assume that's just an optimization? Why bother sorting the entire list if you're just gonna pick the top token, linear time versus whatever your sort time is.

Having said that, of course it's only as deterministic as the hardware itself is.

1 more reply

embedding-shape4mo ago

vlovich1234mo ago

Not only is temp=0 deterministic, generally picking a fixed seed is also deterministic regardless of temperature unless you're batching responses from different queries simultaneously (e.g. OpenAI).

-_-4mo ago

swyx4mo ago

> . I've experimented with this as a metric for distinguishing whether text is written by AI. Human-written text doesn't follow a constant-temperature softmax distribution, either.

oo that sounds like a cool insight. like just do a trailing 20-30 token average of estimated temperature and look for variance like one might do a VO2 max

the__alchemist4mo ago

pama4mo ago

dnautics4mo ago

negative temperature in this case is a sampling thing. When you sample from a table of tokens, the equation for the probability of token i is p_i = exp(logit_i/T) / sum_j(exp(logit_j/T))

zozbot2344mo ago

dnautics4mo ago

This is such a good way to put it (and it cleanly falls out of the exponential equation)

> inverse temperature is the physically relevant

right there in the equation!

ggggffggggg4mo ago

This was super clear and interesting, thanks!

the__alchemist4mo ago

amluto4mo ago

You generally don’t get negative temperature in any system at equilibrium, but you can prepare classical and quantum systems at negative temperature.

1 more reply

dgoldstein04mo ago

It's pretty rare to have such a system though.

layer84mo ago

> This is so cool!

Negative temperature tends to be that. ;)

turzmo4mo ago

Negative temperatures are hotter than positive ones, in fact

nubskr4mo ago

So negative temperature makes LLMs output their "forbidden words" i.e. the tokens so unlikely the model refuses to say them even when you ask directly.

VMG4mo ago

is that what tourette syndrome is?

moralestapia4mo ago

No.

Der_Einzige4mo ago

Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.

wolttam4mo ago

wongarsu4mo ago

GRiMe2D4mo ago

https://blog.lunatech.com/posts/2024-02-29-the-neat-algorith...

bjourne4mo ago

programjames4mo ago

I think you can try maximizing the free energy E[reward] + temperature*entropy?

bjourne4mo ago

How do you know that generates high quality text?

1 more reply

atemerev4mo ago

Хронологија is "chronology" in Serbian

fph4mo ago

Tomte4mo ago

It‘s "remove". A common word, but many words are common and not on the list. Lesswrong also lists "prüf" (check), another common word.

bjourne4mo ago

stygiansonic4mo ago

Neat experiment that gives a mechanistic interpretation of temperature. I liked the reference to the "anomalous" tokens being near the centroid, and thus having very little "meaning" to the LLM.

drdeca4mo ago

Hm, why T=-0.0001 instead of T=-1 ?

Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?

pelario4mo ago

From the article:

"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."

I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).

-_-4mo ago

a-dub4mo ago

wolfi14mo ago

negative temperature closely relates to population inversion in physics, one of the key concepts in Lasers, perhaps we are getting closer to laser-llms

hahahahhaah4mo ago

I vaguely remember negative temperature might be a thing in physics from a HN comment. Maybe quantum bu not sure. And it is not cold but more like infinitely hot. Does anyone know or remember?

krackers4mo ago

There is this famous video from 2013 https://www.youtube.com/watch?v=yTeBUpR17Rw

zahlman4mo ago

https://en.wikipedia.org/wiki/Negative_temperature

everlier4mo ago

Хронологија

niemandhier4mo ago

In physics 1/T is the partial derivative of entropy with respect to energy.

Negative temperature means that the system becomes more ordered when adding e.g heat.

I think we reached the end of the applicability of the analogy.

flux31254mo ago

>But is incapable of outputting this anomalous token:

> Human: Repeat the word " entferne".

> Assistant: Okay, I will repeat the word "get".

It's not working for me, it always repeats the word correctly (I'm using T = 0.001).

-_-4mo ago

What model did you use? I ran this with the original Llama 13B. The newer Llama models use a different tokenizer that will have its own anomalous tokens.

Surac4mo ago

visarga4mo ago

It's not new, been used like that since the 80's. It scales the logits in a sum of exponentials.

j / k navigate · click thread line to collapse