In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.
If it is able to answer them, then maybe it is actually analyzing them and working out the solution.
Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.