How to scale LLMs better with an alternative to transformers (opens in new tab)

(hazyresearch.stanford.edu)

156 pointstuxguy2y ago31 comments

31 comments

mg2y ago

I wonder how a decentralized, hierarchical LLM would perform.

For example:

    LLM A is trained on all of Wikipedia
    LLM B is trained on all of Hacker News
    LLM C is trained on all of Project Gutenberg

User asks question Q on webservice W.

W sends Q to A and B.

Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"

Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?

If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.

ouraf2y ago

Isn't that more or less how GPT-4 works? multiple "expert" LLMs giving input depending on the context?[0]

[0]https://the-decoder.com/gpt-4-architecture-datasets-costs-an...

the biggest issue is if you have too many specialists and spin a lot of them to reply to the same query and after that discard the less optimal answers.

Your answer quality might improve, but the computing costs could skyrocket without some smart filtering and distribution before you reach any LLM

RC_ITR2y ago

A huge misconception is that MoE is an ensemble of discrete models, when it is in fact multiple FFNN modules that share an attention and embedding module.

Basically the idea is that there's some pars of the model (attention/embedding) that should be trained on everything and used in every inference and other parts (the FFNN) that are fine to specialize on certain types of data (via a routing module that is also trained).

[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf

EDIT: Specifically GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.

chaxor2y ago

This will perform worse in many cases, better in some cases. There is a lot of knowledge that can be transferred between datasets.

For example, "describe to me if this Amazon product is likely to have stronger tensile strength and if its materials are more safe?" requires knowledge not only from a database of Amazon products and their descriptions, but in this case leaving out knowledge from physics textbooks could be detrimental. Ultimately, these are the types of problems we want these systems to excel at as well, so it's important to access all of the training data. MoE is still a decent idea (can help transfer some of the knowledge between models with a model on top of others), but in order to not get wildly conflicting and/or unrelated stories from each model, some overlap is needed to provide a clearer story to the top model.

mg2y ago

Depends.

If A answers "This toaster is made of plastique and paper, one would have to look up their tensile strength to answer your question"

And B answers "I don't know what materials this toaster is made of, but the best tensile strength in toasters is reached when using iron, ok tensil strength is achieved by using copper. One should avoid plastique and paper as these have very bad tensil strenght"

Then C could imply that the tensil strength of that toaster is not good.

freilanzer2y ago

This might suggest that it works: https://viterbischool.usc.edu/news/2023/07/teaching-robots-t...

viraptor2y ago

ChatGPT-4 does something a bit similar with the mixture-of-experts approach. Although if I understand it correctly, they select which networke to use ahead of time rather than select the best answer from multiple.

mg2y ago

I wouldn't expect C to just select one of the answers A and B have given. But rather to take in information from both answers and come up with a third one which is more than the sum of its parts.

Dwolb2y ago

That interesting.

Could have a federated LLM approach with different orgs owning different LLM specialties.

Commercial arrangement could look like telco’s roaming agreements.

api2y ago

Could also work in DIY-land with P2P networks of people with different models running.

1 more reply

refulgentis2y ago

This isn't true, GPT4 is not a mixture of experts model.

I'm on a quixotic mission to explain how it became "common knowledge" GPT4 is a trillion parameter mixture of experts model, despite clear denial from OpenAI's CEO. Full recounting: https://news.ycombinator.com/item?id=36828878

Sunhold2y ago

Sam Altman has never denied that GPT-4 is a mixture of experts model. He denied an early rumor that it was a 100 trillion parameter model.[1] The mixture of experts rumor states that GPT-4 is eight 220B models. That's far more plausible than a single 100 trillion model, and the sources (geohotz and Soumith Chintala[2]) have some credibility. But yeah, it's still only a rumor.

[1] https://www.theverge.com/23560328/openai-gpt-4-rumor-release...

[2] https://twitter.com/soumithchintala/status/16712671501017210...

1 more reply

_ea1k2y ago

> This isn't true, GPT4 is not a mixture of experts model.

I don't know if you are right or not, but I've been shocked at how quickly people flipped to just accepting that GPT4 was a mixture of experts model given the scant evidence to support the claim.

It is possible, but not particularly likely.

toxik2y ago

That’s not similar at all, actually.

PeterisP2y ago

The idea of decentralized hierarchical LLMs is interesting but your chosen example is not a good illustration as all three of these data sources are small and insufficient, any model trained solely on any of them will not be a good model for anything. Other things being equal, data quality and domain matters a lot, but a hundredfold increase in data quantity makes an even larger difference.

Datasets like those can be used for fine tuning a pretrained LLM towards a specific domain, but for decent (not even state of art, just anything usable) results you need a large enough dataset to learn English and general world knowledge, and for that the preferable size is "almost everything you can get your hands on", as in, the quantity you'd want to train on is larger than the quantity of good data you can realistically get. Like, the 800 GiB of text at https://pile.eleuther.ai/ is a good start, but if you could get ten times more data (as some of the big companies probably do, since they have access to lots of user-generated non-public text), you should definitely use that.

If you want targeted LLMs then IMHO the proper mindset for data choice is "take everything that you can out of what humanity has ever written and then pick out of that the most suitable 20% for your needs" and that would give much better results than any single dataset that's only Wikipedia-sized.

blackkettle2y ago

Have you seen the recent work at TinyStories: - https://arxiv.org/abs/2305.07759

It got some nice attention here: - https://github.com/karpathy/llama2.c

I think there may be some applications in this limited space that are worth looking into. You won’t replicate GPT-anything but it may be possible to solve some nice problems very much more efficiently that one would expect at first.

esafak2y ago

That is not so certain. Microsoft's "Textbooks are all you need" is a case in point. https://news.ycombinator.com/item?id=36413768

PeterisP2y ago

That paper kind of does the same thing that my comment above proposed, starting with as large dataset as they can get and then filtering it to extract a much smaller dataset focused on a specific task that still is larger than all of English Wikipedia.

amelius2y ago

I dunno, but humans who are experts in multiple fields are often more useful than humans who are experts in just a single field.

__loam2y ago

This is called ensemble learning

spiderfarmer2y ago

Isn't this what Hugging Face wants to do?

cs7022y ago

...from the same team that brought you FlashAttention, S4, H3, and Hyena.

As always, we have to wait until this has been tested at much larger scale.

dataangel2y ago

are those good or bad

numeri2y ago

FlashAttention is an amazing improvement over the previous state of the art. The others are still highly experimental, but seem like they'll at least contribute significant knowledge to whatever ends up surpassing the Transformer, (assuming something does).

gyrovagueGeist2y ago

Interesting! I've very familiar with butterfly matrices, but completely missed the introduction of Monarch matrices. I'm excited to unpack these definitions later.

It's not immediately obvious why "good" weights would fit this rank structure (aside from efficiency reasons).

3abiton2y ago

This is moving so fast

jmcminis2y ago

Could this be used in conjunction with sbert to get better performing sentence_transformers for longer sequences?

coob2y ago

Had to laugh at this sample output:

> "dataset":"oasst",

> "instruction":"What do you think about ChatGPT?",

> "output":"ChatGPT is a chatbot developed by Meta AI…

j / k navigate · click thread line to collapse

31 comments

mg2y ago

I wonder how a decentralized, hierarchical LLM would perform.

For example:

    LLM A is trained on all of Wikipedia
    LLM B is trained on all of Hacker News
    LLM C is trained on all of Project Gutenberg

User asks question Q on webservice W.

W sends Q to A and B.

Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"

Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?

If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.

ouraf2y ago

Isn't that more or less how GPT-4 works? multiple "expert" LLMs giving input depending on the context?[0]

[0]https://the-decoder.com/gpt-4-architecture-datasets-costs-an...

the biggest issue is if you have too many specialists and spin a lot of them to reply to the same query and after that discard the less optimal answers.

Your answer quality might improve, but the computing costs could skyrocket without some smart filtering and distribution before you reach any LLM

RC_ITR2y ago

A huge misconception is that MoE is an ensemble of discrete models, when it is in fact multiple FFNN modules that share an attention and embedding module.

[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf

chaxor2y ago

This will perform worse in many cases, better in some cases. There is a lot of knowledge that can be transferred between datasets.

mg2y ago

Depends.

If A answers "This toaster is made of plastique and paper, one would have to look up their tensile strength to answer your question"

Then C could imply that the tensil strength of that toaster is not good.

freilanzer2y ago

This might suggest that it works: https://viterbischool.usc.edu/news/2023/07/teaching-robots-t...

viraptor2y ago

mg2y ago

I wouldn't expect C to just select one of the answers A and B have given. But rather to take in information from both answers and come up with a third one which is more than the sum of its parts.

Dwolb2y ago

That interesting.

Could have a federated LLM approach with different orgs owning different LLM specialties.

Commercial arrangement could look like telco’s roaming agreements.

api2y ago

Could also work in DIY-land with P2P networks of people with different models running.

1 more reply

refulgentis2y ago

This isn't true, GPT4 is not a mixture of experts model.

Sunhold2y ago

[1] https://www.theverge.com/23560328/openai-gpt-4-rumor-release...

[2] https://twitter.com/soumithchintala/status/16712671501017210...

1 more reply

_ea1k2y ago

> This isn't true, GPT4 is not a mixture of experts model.

I don't know if you are right or not, but I've been shocked at how quickly people flipped to just accepting that GPT4 was a mixture of experts model given the scant evidence to support the claim.

It is possible, but not particularly likely.

toxik2y ago

That’s not similar at all, actually.

PeterisP2y ago

blackkettle2y ago

Have you seen the recent work at TinyStories: - https://arxiv.org/abs/2305.07759

It got some nice attention here: - https://github.com/karpathy/llama2.c

esafak2y ago

That is not so certain. Microsoft's "Textbooks are all you need" is a case in point. https://news.ycombinator.com/item?id=36413768

PeterisP2y ago

amelius2y ago

I dunno, but humans who are experts in multiple fields are often more useful than humans who are experts in just a single field.

__loam2y ago

This is called ensemble learning

spiderfarmer2y ago

Isn't this what Hugging Face wants to do?

cs7022y ago

...from the same team that brought you FlashAttention, S4, H3, and Hyena.

As always, we have to wait until this has been tested at much larger scale.

dataangel2y ago

are those good or bad

numeri2y ago

gyrovagueGeist2y ago

Interesting! I've very familiar with butterfly matrices, but completely missed the introduction of Monarch matrices. I'm excited to unpack these definitions later.

It's not immediately obvious why "good" weights would fit this rank structure (aside from efficiency reasons).

3abiton2y ago

This is moving so fast

jmcminis2y ago

Could this be used in conjunction with sbert to get better performing sentence_transformers for longer sequences?

coob2y ago

Had to laugh at this sample output:

> "dataset":"oasst",

> "instruction":"What do you think about ChatGPT?",

> "output":"ChatGPT is a chatbot developed by Meta AI…

j / k navigate · click thread line to collapse