You are absolutely right. GPU parallelism (especially reduction ops) combined with floating-point non-associativity means the same model can produce slightly different embeddings on different hardware.
However, that makes deterministic memory more critical, not less.
Right now, we have 'Double Non-Determinism':
The Model produces drifting floats.
The Vector DB (using f32) introduces more drift during indexing and search (different HNSW graph structures on different CPUs).
Valori acts as a Stabilization Boundary. We can't fix the GPU (yet), but once that vector hits our kernel, we normalize it to Q16.16 and freeze it. This guarantees that Input A + Database State B = Result C every single time, regardless of whether the server is x86 or ARM.
Without this boundary, you can't even audit where the drift came from.