The problem with this sort of argument against caring about SOTA scores is that there is only so much luck to go around. While any individual 5% reduction in error rates could theoretically be highly influenced by luck, if you have a chain of small reductions in error rates, such that the difference between the first and the last is more like a factor of 2, then you know that somewhere in the middle of that, even if any individual improvement is suspect, there must have been real, gradual improvement.
It isn't that important on CIFAR-10 any more, which is pretty much a solved benchmark, but CIFAR was only solved because of such incremental progress, and papers focusing on moving the state of the art use newer, much harder benchmarks.