I'm not sure it'd help what they are talking about much.
E.g. go back in time and imagine you didn't know there are ways for computers to be really good at performing integration yet as nobody had tried to make them. If someone asked you how to tell if something is intelligent "the ability to easily reason integrations or calculate extremely large multiplications in mathematics" might seem like a great test to make.
Skip forward to the modern era and it's blatantly obvious CASes like Mathematica on a modern computer range between "ridiculously better than the average person" to "impossibly better than the best person" depending on the test. At the same time, it becomes painfully obvious a CAS is wholly unrelated to general intelligence and just because your test might have been solvable by an AGI doesn't mean solving it proves something must have been an AGI.
So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?
Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse.
.
All that aside, to answer your original question, in the presentation it was said the average human gets 85% and this was the first model to beat that. It was also said a second version is being worked on. They have some papers on their site about clear examples of why the current test clearly has a lot of testing unrelated to whether something is really AGI (a brute force method was shown to get >50% in 2020) so their aim is to create a new goalpost test and see how things shake out this time.