Well, the AI can at least be constrained to "the code actually compiles, runs, and produces the correct output."
The number of internet answers that can't pass that bar is distressingly high.
I mean, I guess I interpreted the chart in tfa to indicate that this doesn't always happen.