It feels like neurons in the first layer are weaker, because all they can do is a linear separation. Given deep networks, I was wondering if adding neurons to the first layer was better than adding them to the last one, and empirically, it feels like it is quite worse. I wonder if there is a theorem around that.