Let's say there are 10 subtasks that need to be done.
Let's say a human has 99% chance of getting each one of them right, by doing the proper testing etc. And let's say that the AI has a 95% chance of getting it right (being very generous here).
0.99^10 = a 90% chance of the human getting it to work properly. 0.95^10 = only a 60% chance. Almost a coin toss.
Even with 98% success rate, the compounding success rate still goes down to about 81%.
The thing is that LLM's aren't just "a little bit" worse than humans. In comparison they're cavemen.