> Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.
It's not because you're "adding up stuff", there is specific mathematical or statistical reason why it is used. For neural networks it's there to stop your multi layer network collapsing to a single layer one (i.e. a linear algebra reason). You can choose whatever function you want, for hidden layers tanh generally isn't used anymore, it's usually some variant of a ReLU. In fact Leaky ReLUs are very commonly used so OP isn't changing the subject.
If you define a "perceptron" (`g(Wx+b)` and `W` is a `Px1` matrix) and train it as a logistic regression model then you want `g` to be sigmoid. Its purpose is to ensure that the output can be interpreted as a probability (given that use the correct statistical loss), which means squashing the number. The inverse isn't true, if I take random numbers from the internet and squash them to `[0,1]` I don't go call them probabilities.
> and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
Squashing the number isn't the reason, it's the side effect. And even then, I just said that not all activation functions squash numbers.
> All the training does is adjust linear weights tho, like I said.
Not sure what your point is. What is a "linear weight"?
We call layers of the form `g(Wx+b)` "linear" layers but that's an abused term, if g() is non-linear then the output is not linear. Who cares if the inner term `Wx + b` is linear? With enough of these layers you can approximate fairly complicated functions. If you're arguing as to whether there is a better fundamental building block then that is another discussion.