You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.
Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?
Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?