so the order in which floating-point additions happen is not fixed because of how threads are scheduled, how reductions are structured (tree reduction vs warp shuffle vs block reduction)
Floating-point addition is not associative (because of rounding), so: - (a + b) + c can differ slightly from a + (b + c). - Different execution orders → slightly different results → tiny changes in logits → occasionally different argmax token.