The specific code I was working on, I had a general idea of the sort of performance improvement that would be possible. I just thought that it would be too hard for the models to figure out without a lot of hand-holding.
But it ended up being not "too hard ever", but more like, in 1 out of every 5 tries, the model did in fact manage to get a large refactoring to the point where it improved performance. So once I set it up to try something, use the perf test, see if it worked, if not, throw it away, repeat. Then it started, slowly, finding some useful things.