AI models have historically found lots of ways to game systems. My favorite example is exploiting bugs in simulator physics to "cheat" at games of computer tag. Another is a model for radiology tasks finding biases in diagnostic results using dates on the images. And of course whenever people discuss a benchmark publicly it leaks the benchmark into the training set, so the benchmark becomes a worse measure.