Why do you think this matters? Even in a single trial, I would judge very differently if I knew the population to be 99% human vs. 1% human. Wouldn't you? If you were judging whether a single mushroom was poisonous or not, then would you not care whether it was found in a forest (mostly poisonous) or a supermarket (mostly not)?
The question of whether probabilities are meaningful for non-repeated events was controversial in the eighteenth century, but I thought it was pretty settled by now. Bookmakers manage to estimate a probability that a given team will win the Super Bowl, with no requirement for the same pair of teams to play multiple times.
> If they had wanted "indistinguishable" as a threshold, then obviously their pass criteria would have been for the machine and human pass rates to be equal within an error bar, right?
The title of the paper is literally "People cannot distinguish GPT-4 from a human in a Turing test". They're very clear that they think that's because 50% means indistinguishable:
> A baseline of 50% is better justified since it indicates that interrogators are not better than chance at identifying machines [French, 2000].
That statement is true for a Turing test with a binary choice, but false for theirs. I agree that "for the machine and human pass rate to be equal within an error bar" would be closer to a correct criterion, and they weren't:
> humans’ pass rate was significantly higher than GPT-4’s (z = 2.42, p = 0.017)
So do you think their paper is correctly titled?