undefined | Better HN

0 pointspmayrgundter2y ago0 comments

This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.

So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.

The functional units in Transformer's layers have lost their orginal biological inspiration and functional analog. The core function in Transformers is more like autoencoding/decoding (concepts from info theory) and model/grammar-free translation, with a unique attention based optimization. Transformers were developed for translation. The magic is smth like "attending" to important parts of the translation inputs&outputs as tokens are generated, maybe as a kind of deviation on pure autoencoding, due to the bias from the .. learned model :) See I can't even escape it.

Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.

Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.

0 comments

kmmlng2y ago

> This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

pmayrgundterOP2y ago

Agreed Hebbian learning isn't used.. just meant it as an example of what would signal a NN.

For Backprop, I'm basing this off the development of the Perception. Wiki supports this and its bio-inslired origin[1].

As for its use in Transformers, if you mean simple regressing of errors or use of gradient descent, I'd agree, but that's not usually called Backprop and the term isn't used in the original paper. The term typically means back propagating the errors thru the entire network at a certain stage of learning, and that's not present in Transformers that I can tell.

Happy to see any support for your claims tho.

https://en.m.wikipedia.org/wiki/Backpropagation

kmmlng2y ago

What do you mean, the development of the "Perception"? Do you mean the Perceptron? In that case, Backprop was invented way later than the Perceptron (see https://people.idsia.ch/~juergen/who-invented-backpropagatio...).

I don't see any information in your linked Wikipedia article that supports a bio-inspired origin. In fact, researchers have been wondering whether an equivalent to Backprop might be found in biological brains, but Backprop is widely believed to be biologically implausible (see e.g. https://arxiv.org/pdf/1502.04156.pdf, https://www.sciencedirect.com/science/article/pii/S089360801...).

It's not surprising that the term Backprop is not mentioned in the original paper, it isn't mentioned in most neural network research, because it's simply the default method to optimize weights and additionally it's hidden away by modern autodiff frameworks, so no one actually has to give it any thought. But backprop is definitely used in transformers (see e.g. https://aclanthology.org/2020.emnlp-main.463.pdf, https://arxiv.org/pdf/2004.08249, https://proceedings.mlr.press/v202/phang23a/phang23a.pdf, https://dinkofranceschi.com/docs/bft.pdf)

1 more reply

theonlybutlet2y ago

Thanks for your reply, you raise a very good point, transformer models are a lot more complex. I'd argue conceptually they're the same, just the data and process is more abstracted. Autoencoded data implies using efficient representations, basically semantically abstracted data and opting for measures like back propagation through time.

pmayrgundterOP2y ago

So like in my sister reply, I don't see the Backprop, but maybe I'm missing it. This article does use the word, but in a generic way

"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"

But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.

What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.

Is there a deep regression? Maybe I'm missing it

theonlybutlet2y ago

Yes, if the below perhaps helps. Over my head but...

https://courses.grainger.illinois.edu/ece448/sp2023/slides/l...

From another source:

Backpropagation Through Time (BPTT) is an adaptation of backpropagation used for training recurrent neural networks (RNNs), which are designed to process sequences of data and have internal memory. Because the output at a given time step might depend on inputs from previous time steps, the forward pass involves unfolding the RNN through time, which essentially converts it into a deep feedforward neural network with shared weights across the time steps. The error for each time step is computed, and then BPTT is used to calculate the gradients across the entire unfolded sequence, propagating the error not just backward through the layers but also backward through the time steps. Updates are then made to the network weights in a way that should minimize errors for all time steps. This is computationally more involved than standard backpropagation and has its own challenges such as exploding or vanishing gradients"

j / k navigate · click thread line to collapse

0 pointspmayrgundter2y ago0 comments

Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.

So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.

Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.

Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.

0 comments

kmmlng2y ago

Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

pmayrgundterOP2y ago

Agreed Hebbian learning isn't used.. just meant it as an example of what would signal a NN.

For Backprop, I'm basing this off the development of the Perception. Wiki supports this and its bio-inslired origin[1].

Happy to see any support for your claims tho.

https://en.m.wikipedia.org/wiki/Backpropagation

kmmlng2y ago

1 more reply

theonlybutlet2y ago

pmayrgundterOP2y ago

So like in my sister reply, I don't see the Backprop, but maybe I'm missing it. This article does use the word, but in a generic way

"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"

What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.

Is there a deep regression? Maybe I'm missing it

theonlybutlet2y ago

Yes, if the below perhaps helps. Over my head but...

https://courses.grainger.illinois.edu/ece448/sp2023/slides/l...

From another source:

j / k navigate · click thread line to collapse