From the article: "We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details."
I’m skeptical. There is a lot wiggle room in “no specific training”. Could just mean the didn’t fine tune the model for any of tests. Their training data probably included many past LSAT exams and certainly included many instances of people discussing how to solve LSAT problems.