Another important lesson is that often good ideas get passed over because of hype or politics. We often like to pretend that science is all about the merit and what is correct. Unfortunately this isn't true. It is that way in the long run, but in the short run there's a lot of politics and humans still get in their own way. This is a solvable problem, but we need to acknowledge it and create systematic changes. Unfortunately a lot of that is coupled to the aforementioned one.
> I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.
As should we all. Clearly he was upset that others got credit for his contributions. But what I do appreciate is that he has recognized that it is a problem bigger than him, and is trying to combat the problem at large and not just his own little battlefield. That's respectable.Lol, I still used to notice him before covid when he was railing against Bengio, Hinton, and LeCun. Can't believe he's still going.
After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.
Now let me address the other possibility that you are talking about: what if residual connections aren't necessary? What if there is another way? What are the criteria necessary to avoid exploding or vanishing gradient or slow learning in the absence of both?
For that we need to first know why residual connections work. There is no way around calculating the back propagation formula by hand, but there is an easy trick to make it simple. We don't care about the number of parameters in the network, we only care about the flow of the gradient. So just have a single input and output with hidden size 1 and two hidden layers.
Each layer has a bias and a single weight and an activation function.
Let's assume you initialize each weight and bias with zero. The forward pass returns zero for any input and the gradient is zero. In this artificial scenario the gradient starts vanished and stays vanished. The reason is pretty obvious when you apply back propagation. The second layer clips the gradient of the first layer. If there was a single layer, the gradient would be non zero and yield a non zero gradient, rescuing the network out of the vanishing gradient.
Now what if you add residual connections? The forward pass stays the same, but the backward pass changes for two layers and beyond. The gradient for the second layer consists of just the second layer activation function multiplied by the first layer activation of the forward pass. The first layer gradient consists of the second layer gradient where the first layer activation is substituted by the gradient of the first layer but because it is a residual net, you also add the gradient of just the first layer.
In other words, the first layer is trained independently of the layers that come after it, but also gets feedback from higher layers on top. This allows it to become non zero, which then lets the second layer become non zero, which lets the third be non zero and so on.
Since the degenerate case of a zero initialized network makes things easy to conceptualise, it should help you figure out what other ways there are to accomplish the same task.
For example, what if we apply the loss to every layer's output as a regularizer? That is essentially doing the same thing as a residual, but with skip connections that sum up the outputs. You could replace the sum with a weighted sum where the weights are not equal to 1.0.
But what if you don't want skip connections either, because they are too similar to residual networks? A residual network has one skip connection already and summing up in a different way is uninteresting. It is also too reliant on each layer being encouraged to produce an output that is matched against the label.
In other words, what if we wanted to let the inner layers not be subject to any correlation with the output data? You would need something that forces the gradients away from zero but also away from excessively high numbers. I.e. weight regularization or layer normalisation with a fixed non zero bias.
Predictive coding and especially batched predictive coding could also be a solution to this.
Predictive coding predicts the input of the next layer, so the only requirement is that the forward pass produces a non zero output. There is no requirement for the gradient to flow through the entire network.
The person with whom an idea ends up associated often isn't the first person to have the idea. Most often is the person who explains why the idea is important, or find a killer application for the idea, or otherwise popularizes the idea.
That said, you can open what Schmidhuber would say is the paper which invented residual NNs. Try and see if you notice anything about the paper that perhaps would hinder the adoption of its ideas [1].
[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...
[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...
1. If two things are not both true, then one or both of them must be false. (And the reverse.)
2. If neither of two things is true, then both of them are false. (And the reverse.)
You might notice that both statements are blindingly obvious, but we've named them after Augustus de Morgan anyway.
Starting in the 1930s, though, that tradition began to change... for reasons that I'm sure won't ever apply to American English. Nosirree, Bob, we're special. Great, even.
Conversely, a huge amount of science is just scientists going "here's something I found interesting" but no one can figure out what to do with it. Then 30 or 100 years go by and it's a useful in a field that didn't even exist at the time.
It seems that these two people Schimidhuber and Hochreiter were perhaps solving the right problem for the wrong reasons. They thought this was important because they expected that RNNs could hold memory indefinitely. Because of BPTT, you can think of that as a NN with infinitely many layers. At the time I believe nobody worries about vanishing gradient for deep NNs, because the compute power for networks that deep just didn't exist. But nowadays that's exactly how their solution is applied.
That's science for you.
1. In LSTMs skip connections help propagate gradients backwards through time. In ResNets, skip connections help propagate gradients across layers.
2. Forking the dataflow is part of the novelty, not only the residual computation. Shortcuts can contain things like batch norm, down sampling, or any other operation. LSTM "residual learning" is much more rigid.
LSTMs are an incredible architecture, I use them a lot in my research. While LSTMs are useful over many more timesteps than other RNNs, LSTMs certainly don't offer 'essentially unlimited depth'.
When training LSTMs whose input were sequences of amino acids, whose length easily top 3,000 timesteps, I got huge amounts of instability... with gradients rapidly vanishing. Tokenizing the AAs, getting the number of timesteps down to more like 1,500, has made things way more stable.