>Any errors the NN is making are not worth learning about.
But that's the whole point of this method! To understand what errors the NN might be making. It's also quite possible the NN's errors aren't really errors, if there are mistakes or noise in the labels.
This technique has been called "dark knowledge" and is really interesting. See http://www.kdnuggets.com/2015/05/dark-knowledge-neural-netwo... They train much simpler models to get the same accuracy as much bigger models, just by copying the predictions of the bigger model on the same data. In fact you can get crazy results like this:
>When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.