undefined | Better HN

0 pointsp1esk2y ago0 comments

A quote from the link you posted: "Then, when you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude” then, it’s not the model weights that you are referring to. It’s the dataset."

Yet all four of these examples use the same model architecture (transformers).

0 comments

famouswaffles2y ago

"Everything else is a means to an end in efficiently delivery compute..."

Without tweaking anything (so not RWKV), you could train a GPT level RNN...if you had the compute to burn.

p1eskOP2y ago

We don't know that. No one has demonstrated it. It's very likely that at larger scales, for a given amount of compute you cannot train a traditional RNN to be as a good as a transformer.

famouswaffles2y ago

We are saying the same thing. Transformers are more compute efficient than RNNs. Nobody is denying that but the switch from RNNs didn't precede some performance wall(i.e it's not like we were training bigger RNNs that weren't getting better).

We use Transformers today in large part because they got rid of recursion and in effect could massively parallelize compute.

j / k navigate · click thread line to collapse

0 comments

famouswaffles2y ago

"Everything else is a means to an end in efficiently delivery compute..."

Without tweaking anything (so not RWKV), you could train a GPT level RNN...if you had the compute to burn.

p1eskOP2y ago

We don't know that. No one has demonstrated it. It's very likely that at larger scales, for a given amount of compute you cannot train a traditional RNN to be as a good as a transformer.

famouswaffles2y ago

We use Transformers today in large part because they got rid of recursion and in effect could massively parallelize compute.

j / k navigate · click thread line to collapse