For extraordinary claims ('intelligence'), the burden of proof is on those making the claim, not on others to prove the negative.
The sine function is nonlinear.
You give sprinkle in a tiny nonlinearity (e.g. x^2 instead of x) and suddenly you can get infinite complexity by weighted composition - which is also the reason why we're so helpless with nonlinear functions and reach for linear approximations immediately (cf. gradient descent).
y = b + x1 + x1^2 + x1^3 + ... + x1 * x2 + (x1 * x2)^2 + ... + x2 + x2^2 + ...
By that point you're making a Taylor approximation of the latent function through linear space, which is also a universal approximator.
So the commenter above is wrong -- neural networks are indeed just glorified linear regression from this point of view.
The main difference is that this kitchen sink regression is computationally inefficient which neural nets are extremely efficient computationally.*
I'm not an expert, but the motivation seems more like this:
- Linear regression and SVM sometimes work. But they apply to very few problems.
- We can fit those models using gradient descent. Alternatives to gradient descent do exist, but they become less useful as the above models get varied and generalised.
- Empirically, if we compose with some simple non-linearities, we get very good results on otherwise seemingly intractable problems like OCR. See Kernel SVM and Krieging.
- Initially, one might choose this non-linearity from a known list. And then fit the model using specialised optimisation algorithms. But gradient descent still works fine.
- To further improve results, the choice of non-linearity must itself be optimised. Call the non-linearity F. We break F into three parts: F' o L o F'', where L is linear, and F' and F'' are "simpler" non-linearities. We recursively factorise the F' and F'' in a similar way. Eventually, we get a deep feedforward neural network. We cannot use fancy algorithms to fit such a model anymore.
- Somehow, gradient descent, despite being a very generic optimisation algorithm, works much better than expected at successfully fitting the above model. We have derived Deep Learning.
Deep learning has been a series of *engineering* successes stacking over each other rather than theory being applied rigorously.
It's hard to scale training on the "dumb" approximators like a kitchen sink regression, and balancing overfitting becomes a nightmare.
(Edit: this whole thread feels like a setup for this punchline)