I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.
What was the reasoning_effort? verbosity? temperature?
These things matter.
So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.
Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?
In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?
For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.
1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works
Plus, knowing it's all probabilistic, how do you know, without knowing ahead of time already, that the result is correct? Is that not the classic halting problem?