Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it
Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.