Even repeating the same question in a single chat can have GPT-4 vary on its output, though it will often settle on a particular output due to context informing the output (which is why adding context is so important for these models)
OpenAI (and others that know what they're doing) always do their benchmarks in a multi-sampled way, by running 5 or 20 times at optimal temp. Using a wrapper that runs these samples and then another pass that judges self-consistency for a final answer can give you a correct answer 100% of the time for a question that would be wrong 100% of the time with temp at zero.