I mean this is impossible to measure right? You'd have to compare all possible outputs of GPT3 with all possible outputs of GPT4.
You can get away with a random sample. But there's a lot of bias in that sample and it's hard to control it. Out of the infinite possibilities there definitely exists sets of inputs and outputs where both GPT3 and GPT4 are always wrong.
On the other side of the coin there are also sets where GPT4 is always right and GPT3 is always wrong and vice versa.
Given that there's no methodology to control what is "random" in these sets it's hard to come up with a good metric.
So the 0.5 thing is a bit of anecdotal number from the gut.