So how did we deal with the human mistakes? You mentioned it:
- Get humans to check each other's work
- Systematize the process -- breaking it down into smaller and smaller tasks where the likelihood of mistakes decreases
- Replace as much as possible with deterministic code
There's absolutely no reason you can't do this with LLMs -- and it might help quite a bit since LLMs are cheap. There's also hybrid systems -- where human checkers are replaced or augmented with LLM checkers.
For example -- I have an LLM check all my scientific papers for typos and minor errors. It's caught quite a few, and when it caught something that was not actually an error, it was usually something whuch would benefit from clarification anyways.
Now -- if I could afford to pay a grad student to do that, would be even better! But I can't, and if I could, not all the work which warrants a few cents of tokens warrants a few hundred dollars of tedious grad student labor -- especially when the latter has a very strong incentive to say LGTM (nothing here is life critical!)
Likewise, we could imagine:
- A deterministic process with a heuristic + an LLM in the loop checking, for example -- "is this likely correct?" -- perhaps escalating to a human (or a bigger LLM) in case of anomaly. I can see this being amazingly useful for automated refactors.
- Automatic paperwork/customer service processing -- if the cost-of-failure can be bounded (say X$) and testing shows failure happens on average only reasonably often (say Y% of the time) -- it might be cheaper to run an AI system and eat that cost, especially if continuous monitoring lets you know if you have to "shut it down."
In both cases -- there's nothing stopping an LLM from potentially having better-than-human average performance, and perhaps delegating real edge cases to actual experts. Remember: you're not competing with motivated PhDs, you're competing with minimum wage labor reading a list of instructions which is like a prompt except poorly formatted and missing steps.