For example:
LLM A is trained on all of Wikipedia
LLM B is trained on all of Hacker News
LLM C is trained on all of Project Gutenberg
User asks question Q on webservice W.W sends Q to A and B.
Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"
Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?
If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.
[0]https://the-decoder.com/gpt-4-architecture-datasets-costs-an...
the biggest issue is if you have too many specialists and spin a lot of them to reply to the same query and after that discard the less optimal answers.
Your answer quality might improve, but the computing costs could skyrocket without some smart filtering and distribution before you reach any LLM
Basically the idea is that there's some pars of the model (attention/embedding) that should be trained on everything and used in every inference and other parts (the FFNN) that are fine to specialize on certain types of data (via a routing module that is also trained).
[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf
EDIT: Specifically GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.
For example, "describe to me if this Amazon product is likely to have stronger tensile strength and if its materials are more safe?" requires knowledge not only from a database of Amazon products and their descriptions, but in this case leaving out knowledge from physics textbooks could be detrimental. Ultimately, these are the types of problems we want these systems to excel at as well, so it's important to access all of the training data. MoE is still a decent idea (can help transfer some of the knowledge between models with a model on top of others), but in order to not get wildly conflicting and/or unrelated stories from each model, some overlap is needed to provide a clearer story to the top model.
If A answers "This toaster is made of plastique and paper, one would have to look up their tensile strength to answer your question"
And B answers "I don't know what materials this toaster is made of, but the best tensile strength in toasters is reached when using iron, ok tensil strength is achieved by using copper. One should avoid plastique and paper as these have very bad tensil strenght"
Then C could imply that the tensil strength of that toaster is not good.
Could have a federated LLM approach with different orgs owning different LLM specialties.
Commercial arrangement could look like telco’s roaming agreements.
I'm on a quixotic mission to explain how it became "common knowledge" GPT4 is a trillion parameter mixture of experts model, despite clear denial from OpenAI's CEO. Full recounting: https://news.ycombinator.com/item?id=36828878
Datasets like those can be used for fine tuning a pretrained LLM towards a specific domain, but for decent (not even state of art, just anything usable) results you need a large enough dataset to learn English and general world knowledge, and for that the preferable size is "almost everything you can get your hands on", as in, the quantity you'd want to train on is larger than the quantity of good data you can realistically get. Like, the 800 GiB of text at https://pile.eleuther.ai/ is a good start, but if you could get ten times more data (as some of the big companies probably do, since they have access to lots of user-generated non-public text), you should definitely use that.
If you want targeted LLMs then IMHO the proper mindset for data choice is "take everything that you can out of what humanity has ever written and then pick out of that the most suitable 20% for your needs" and that would give much better results than any single dataset that's only Wikipedia-sized.
It got some nice attention here: - https://github.com/karpathy/llama2.c
I think there may be some applications in this limited space that are worth looking into. You won’t replicate GPT-anything but it may be possible to solve some nice problems very much more efficiently that one would expect at first.
As always, we have to wait until this has been tested at much larger scale.
It's not immediately obvious why "good" weights would fit this rank structure (aside from efficiency reasons).
> "dataset":"oasst",
> "instruction":"What do you think about ChatGPT?",
> "output":"ChatGPT is a chatbot developed by Meta AI…