It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.
In this case, we chose an expert that we treat as an objective "source of truth".
re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.
re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.