Video games do not generate large streams of data, they generate individual values on demand. Your link says, to generate 8 bytes, chacha12 on modern hardware needs 24-45 cpb. That's 192-360 cycles to generate enough bits for a pseudorandom double-precision floating-point number. Xoshiro256+ [1], a relatively high-quality generator for this purpose, can do it with 11 single-cycle ALU ops. So unoptimized xoshiro256+ should be 17 times faster than optimized chacha12 on the best hardware. This is a classic latency vs. throughput issue.
Now, maybe you could optimize the use of a CSPRNG here by filling large buffer(s) and sampling values from them. Some warm-up time could go a long way. However, I fear that you would run into one or more of the following problems:
- stop-the-world pause to refill the buffer (e.g. single buffer, no threading)
- synchronization delays from mutex locks (e.g. ring buffer refilled from a background thread)
- high memory usage (e.g. rotating pool of buffers, atomically swapped)
Needless to say, none of these solutions is anywhere near as simple to implement as a non-cryptographic PRNG.
Now let's consider determinism. Video games generally use a lot of differently seeded instances of the same PRNG algorithm to provide random numbers in different parts of the simulation. Since each part may demand random numbers at different rates, it's hard to replace several independent PRNGs with a single PRNG without compromising determinism. In the 4096 bytes necessary to run one instance of chacha12 at its maximum efficiency, you can fit 128 instances of xoshiro256+ or 512 instances of splitmix64 [2].
[1] = https://prng.di.unimi.it/xoshiro256plus.c
[2] = https://github.com/svaarala/duktape/blob/master/misc/splitmi...