New technology could blow away GPT-4 and everything like it (opens in new tab)

(zdnet.com)

102 pointsandy_threos_io3y ago17 comments

17 comments

Notes from quick read of paper at https://arxiv.org/abs/2302.10866. Title of popsci is overreaching, this is a drop-in subquadratic replacement for attention. Could be promising, but to be seen if it is adopted in practice. skybrian (https://news.ycombinator.com/item?id=35657983) points out new blog post by authors, and prev discussion of older (march 28th) blog post. Takeaways:

* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).

* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.

* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)

* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)

* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)

* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)

Might have misread some takeaways; happy to be corrected.

skybrian3y ago

Blog post: https://hazyresearch.stanford.edu/blog/2023-03-07-hyena

Previous discussion: https://news.ycombinator.com/item?id=35502187

LoganDark3y ago

The biggest thing I'm worried about is whether the unlimited context (via convolution) is anything like picking a random set of samples. If I give it a super long context and then tell it to recall something at the very beginning, will it be able to do it? Will it be able to parse the question if it's a very small percentage of the total context, even if the question is at the very end? Or is it just less computationally expensive to process the entire context with this method?

PaulHoule3y ago

I had so much fun with CNN models just before BERT hit it big. It would be nice to see them make a comeback.

MuffinFlavored3y ago

how do CNN models and LLM differ?

dt233y ago

They are quite different neural net architectures! CNNs (ie convolutional neural nets) have little "patches" that are used to pick up features of the input such as edges, etc. As you go deeper in the network, these patches tend to pick up more and more abstract features, like textures or kinds of object. All this is passed into a fully connected network which then gives a prediction. CNNs are most famously used for image classification.

In contrast, an LLM is usually built using a sequence of Transformer modules, which use something called "self-attention": it modifies each piece of input seen so far to include information about its relationship with the other bits of input. In text, this is a natural thing to do (what role does my word play in the sentence); you can also do it with images (giving you Vision Transformers, aka ViTs) but it might be less natural. After self-attention is a little fully connected network, and then the output is passed into a new transformer, etc, for a number of times (commonly 6), until the output of the final transformer is used as the prediction.

In a nutshell: very different architectures, exploiting different aspects of the data (CNN: local structure; Transformer: relationship between all elements in the context).

barbariangrunge3y ago

> At 64,000 tokens, the authors relate, "Hyena speed-ups reach 100x" -- a one-hundred-fold performance improvement.

That’s quite the difference

sottol3y ago

Classic attention is quadratic in context length and faster alternatives seem to not perform as well, wonder how Hyena is in comparison to linear attention algorithms.

sharemywin3y ago

I didn't see anything in the article about what the scaling factor was? less than P^2 but what was it?

bckr3y ago

The paper has a “preliminary scaling law” diagram. The shape of the graph is the same, but with 20% fewer FLOPS.

The real breakthrough is that Hyena apparently has an unlimited context window.

flangola73y ago

>The real breakthrough is that Hyena apparently has an unlimited context window.

It's extrapolated volition time (ﾟ∀ﾟ)

1 more reply

choeger3y ago

presumably still O(n2) in theory, but not for practical cases.

I think that anything reolacing attention will suffer quadratic growth for some pathological examples.

maybe if we have a better understanding of the data we could give a better definition (much like graph complexity is usually given in the actual number of edges, which are theoretically O(n2).)

galaxytachyon3y ago

How good is it at scaling? And will it still retain the emergent capabilities of the huge transformer LLMs?

Isn't this basically the bitter lesson again? Making small improvements work but in long term it won't give the same impressive result?

coldtea3y ago

So? Would you rather we didn't make small improvements?

If we could just make big improvements we would.

j / k navigate · click thread line to collapse