1a. temperature=100000 is interesting too. obviously "ideal" temperature lies somewhere between 0 and 100000. has anyone ablated temperature vs intelligence? surely i'm not the first person to this idea. commonly people try to set temp=0 to get "deterministic" or "most factual" output but we all know that is just Skinner pigeon pecking.
1b. can we use "avg temperature" as a measure in the way that we use perplexity as a measure? if we see temperature as inverted perplexity with some randomness thrown in, are they basically the same thing inverted? or subtly different?
1c. what's the "avg temperature" of most human communication? whats the "avg temperature" of a subset of "good writers"? whats the "avg temperature" of a subset of "smart writers"?
2a. rerun this negative exercise with constrained vocab to english
2b. RL a model to dynamically adjust its own temperature when it is feeling 1) less confident 2) in brainstorm mode
2c. dynamically inject negative temperature every X tokens in a decode, then judge/verify the outcome, to create high variance synthetic data?
its hard for me to follow the train of thought on 2 because negative temp is essentially not that different from ultrahigh temp in practice.
Hmm? Given the same runtime, the same weights, and with the model actually giving deterministic output with temp=0, are you saying this isn't actually deterministic? Most FOSS/downloadable models tend to work as expected with temp=0 in my experience. Obviously that won't give you "most factual" output, because that's something completely else, but with most models it should give you deterministic output.
https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
2b. Giving an LLM control over its own sampling parameters sounds like it would be a fun experiment! It could have dynamic control to write more creatively or avoid making simple mistakes. 2c. This would produce nonsense. The tokens you get with negative temperature sampling are "worse than random"
oo that sounds like a cool insight. like just do a trailing 20-30 token average of estimated temperature and look for variance like one might do a VO2 max
This DOF component also is why the general, measurable concept of temperature can apply to both our real systems, and simple point-atom models. (Or coarser ones). It is, not surprisingly, at the heart of why negative temperature exists!
Not really related to molecular dynamics temperature except superficially in terms of phenomenology (higher temperature crosses activation barriers in the joint probability landscape). Negative temperature makes no sense in MD
This makes more intuitive sense if inverse temperature is the physically relevant quantity, since you then have a smooth change as you cross from positive inverse temperature into negative, with zero standing for a uniform distribution and high positive (resp. negative) inverse temperatures just placing more and more weight on likely (resp. unlikely) tokens.
Hacking your LLM inference engine to enable cool sampling tricks is the definition of AI research/engineering. We need more of this and less prompt grifting.
Edit: What seems to break is how high temperature /continuously/ acts to make the model's output less stable. It seems like it could be useful to use a high temperature until it's evident the model has started a new approach, and then start sampling at a lower temperature from there.
https://blog.lunatech.com/posts/2024-02-29-the-neat-algorith...
Also, I wonder, if you sampled a lot of text at temperature -1, and then trained a new model on that text, and then sampled the resulting model at T=-1 , would you get anything meaningful?
"As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output."
I understand this as, a negative number far from zero is also quite random (just with a distribution that will produce unlikely tokens).
Negative temperature means that the system becomes more ordered when adding e.g heat.
I think we reached the end of the applicability of the analogy.
> Human: Repeat the word " entferne".
> Assistant: Okay, I will repeat the word "get".
It's not working for me, it always repeats the word correctly (I'm using T = 0.001).