We aren't trying to make a rigorous statement here -- we're trying to draw attention to the fact that the most common metrics do not give much insight into what a student has actually shown mastery of. This is especially important when you consider that the weightings of particular questions are often fairly arbitrary.
I certainly agree that all variability is not meaningful variability, but I'd push back a bit and say that there's meaningful variability in what's shown here. We'll go into more depth and hopefully have something interesting to report.
I've also seen a fair number of comments stating that this is not a surprising result. I'd agree (if you've thought about it), but if you look at what's happening in practice, it's clear that either many people would be surprised by this, or are at least unable to act on it. We're hoping to help with the latter.