Why do you think that defining accuracy in relative terms works in favor of this model? This pairwise relative measure should give you less confidence that the model generalizes, because now we don't even have an idea of what the relative levels of confidence given by the models between these pairs are, just that they were ordered correctly. This further supports my claim that the way they're measuring results is designed to make them appear more significant than is justified
Explaining the model becoming more "accurate" by this measure is pretty easy. The model is working with an extremely small and skewed dataset for this sort of thing, and has overtrained to tendencies in the dataset. Given the kinds of numbers we're working with and that measure, a jump from 81 to 91% "accuracy" does not seem particularly significant, especially given that, again, the classifier fails meet even the baseline of accuracy we need under a more realistic accuracy measurement to beat a null hypothesis, and probably this baseline would need to be even higher to reflect the lower statistical power of this standard of accuracy.
In any real-world application, this classifier would need to make a judgement in situ based on some threshold of confidence. From that perspective, this metric is worse than useless, because while it doesn't really demonstrate that the result is even as significant as the (again, not meeting the base rate) thresholds described in the summary of it, this methodological smoke and mirrors has seemingly convinced you after reading it more thoroughly. I imagine this is similar to the process by which these systems are sold to investors