Great breakdown… with some interesting results and a ton of effort.
Are there any open benchmarks like this for all models that are actually runnable like the data exposed in https://github.com/syhw/wer_are_we but with some of your additional metrics?