undefined | Better HN

0 pointshalflings2y ago0 comments

The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others.

tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.

0 comments

binarymax2y ago

Attention in transformers works because over time the model learns token importance based on frequency and context.

If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis.

curious_cat_1632y ago

To use your metaphor, TF-IDF will result in ‘fixed’ weights.

Attention makes it so that the weights of each token can be different in each sequence of tokens. Same token gets different weights depending on who its ‘neighbors’ in the sequence end up being.

This property allows the models to solve a variety of natural language problems and gets ‘used’ by the model to express context-aware dependencies.

littlestymaar2y ago

Given that GP explicitly said “if you don't have attention”, and we're in a thread about a language model whose main characteristics is not to use attention, I don't understand why you insist in talking about attention …

1 more reply

j / k navigate · click thread line to collapse