Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.
It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.