undefined | Better HN

0 pointscsomar1mo ago0 comments

The models are deterministic, the inference is not.

0 comments

Which is a useless distinction. When we say models in this context we mean the whole LLM + infrastructure to serve it (including caches, etc).

jmalicki1mo ago

What does that even mean?

Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.

csomarOP1mo ago

That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.

jmalicki1mo ago

Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.

Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.

For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...

It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.

Differences in batch sizes of inference compound these issues.

Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.

csomarOP1mo ago

My point is, your inference process is the non-deterministic part; not the model itself.

1 more reply

j / k navigate · click thread line to collapse

0 comments

coldtea1mo ago

Which is a useless distinction. When we say models in this context we mean the whole LLM + infrastructure to serve it (including caches, etc).

jmalicki1mo ago

What does that even mean?

Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.

csomarOP1mo ago

That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.

jmalicki1mo ago

Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.

Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.

For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...

It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.

Differences in batch sizes of inference compound these issues.

csomarOP1mo ago

My point is, your inference process is the non-deterministic part; not the model itself.

1 more reply

j / k navigate · click thread line to collapse