When you have nothing to learn, you need to memorize the data. But when there is structure, it is easier to memorize the structure, so the network will learn this first (and will memorize after).
That sounds like it’s probably right to me. But so do lots of things that turn out to be wrong. I wish we had a better grasp of what is happening, not just plausible stories. I’m already sick of doing alchemical tinkering to find a model that works.