Another way to think of it is that each layer learns a maximally efficient compression scheme for translating the data at the input layer to that at the output layer. Each layer learns a high-dimensional representation of the output that uses minimum bits for maximum information reconstruction capacity. There was a great talk given recently by Naftali Tishby where he explains this in great detail.[1]
Having the math is great to know how it works on a granular level. I've found that also explaining it in such holistic terms serves a great purpose by fitting "linear algebra + calculus" into an understanding of NNs that is greater than the sum of their parts.
Only 383k subscribers. Hmm. Remember to subscribe, comment, and smash that like button.
Real Engineering is another great channel. And of course Veritasium and Numberphile.