I think we're getting lost in the weeds. This has almost nothing to do with the LLM. It's about A/B testing. There's a piece of software where the behavior is being changed in unannounced and unexpected ways, at least as far as the author is concerned. The same criticism could apply to any other "workflow" or "professional" software.
There's some added flavor because the LLM is indeed non-deterministic, which could make it harder to realize that a change in behavior is caused by a change in the software, not randomness from the LLM. But there is also lots of software that deals with non-deterministic things that aren't LLMs, e.g. networks, physical sensors, scientific experiments, etc. Am I getting more timeouts because something is going on in my network or because some software I use is A/B testing some change?