When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?
They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.
I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.
So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".