* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).
* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.
* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)
* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)
* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)
* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)
Might have misread some takeaways; happy to be corrected.
Previous discussion: https://news.ycombinator.com/item?id=35502187
In contrast, an LLM is usually built using a sequence of Transformer modules, which use something called "self-attention": it modifies each piece of input seen so far to include information about its relationship with the other bits of input. In text, this is a natural thing to do (what role does my word play in the sentence); you can also do it with images (giving you Vision Transformers, aka ViTs) but it might be less natural. After self-attention is a little fully connected network, and then the output is passed into a new transformer, etc, for a number of times (commonly 6), until the output of the final transformer is used as the prediction.
In a nutshell: very different architectures, exploiting different aspects of the data (CNN: local structure; Transformer: relationship between all elements in the context).
That’s quite the difference
The real breakthrough is that Hyena apparently has an unlimited context window.
It's extrapolated volition time (゚∀゚)
I think that anything reolacing attention will suffer quadratic growth for some pathological examples.
maybe if we have a better understanding of the data we could give a better definition (much like graph complexity is usually given in the actual number of edges, which are theoretically O(n2).)
Isn't this basically the bitter lesson again? Making small improvements work but in long term it won't give the same impressive result?
If we could just make big improvements we would.