undefined | Better HN

0 pointslittlestymaar1y ago0 comments

You are moving the goalpost. The discussion has always been about transformers vs non transformers.

You claimed that self attention was needed to achieve the level of intelligence that we've seen with GPT 3.5:

> without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. (Verbatim quote from you https://news.ycombinator.com/item?id=41986010)

This is the claim I've been disputing, by responding that the key feature of the intelligence of tranformer models come from their scalability. And now that we have alternative that scale equally well (SSM and RWKV) unsurprisingly we see them achieve the same level of reasoning abilities.

> Every statement above that I made, that you called wrong, was correct. lol.

Well, except the one quoted above at least…

0 comments

quantadev1y ago

In the quote you're calling wrong (41986010), you're interpreting "scaling up" as "scaling up, including changing architecture". Scaling up transformers just means scaling up transformers, and keeping everything else the same. In other words you're interpreting "parameter size" as "parameter size, independent of architecture", and I meant parameter size of a Transformer (in the context of with v.s. without Self-Attention).

littlestymaarOP1y ago

Pathetic.

quantadev1y ago

Straw-manning failed, so now you insult.

1 more reply

j / k navigate · click thread line to collapse