The big problem with this is that many domains of math use hyper specialized notation, novel terms, different styles etc, and there's not much data for them to train on within any given niche.
For example, the IUT "proof" of the abc conjecture used completely novel systems to come to its conclusions, to the point that it took top number theorists a few years to even parse what it was saying, and only then could they find the faults in the proof.
I think the IUT proof would be a great adversarial example for any of these systems; if you can find the problem in the proof, you know you've hit on something pretty powerful.