We can clearly see in 2D space itself how different "concepts" are explored.
Using the shape of stories for semantic chunking we can clearly see in multiple articles how we can chunk by "concepts". [2]
Now we are trying to see if we can just use these chunks and train a next "chunk" predictor instead of a next word predictor.
In the paper, they take a sentence to mean a concept. We believe that a "semantic chunk" is better suited for a concept instead of a sentence.
[1] https://gpt3experiments.substack.com/p/the-shape-of-stories-...
[2]https://gpt3experiments.substack.com/p/a-new-chunking-approa...
For instance what is the shape of the ugly duckling compared to Rudolf the red nosed reindeer. They are essentially the same story, so presumably on some dimension you should be able to spot them in a group of unrelated stories.
> One may argue that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output
And the problem remains that (text surrounding the above):
> Despite the undeniable success of LLMs and continued progress, all current LLMs miss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. [...] Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs
Also, humans cannot iterate over thousands of possibilities in a second, like computers do.
And finally, animal brains are severely limited by heat dissipation and energy input flow.
Based on that, artificial intelligence may arise from unexpected simple strategies, given the fundamental differences in scale and structure from animal brains.
- where 7 is whatever number is the correct number nowadays.
That is what tokens are doing in the first place though, and you get better results with tokens instead of letters.
This said, to research whether the search for concepts (in the solutions space) works better than the search for tokens seems absolutely dutiful, in absence of a solid theory that showed otherwise.
(*Sounds convey their own meaning e.g. in proto-Indo-European according to some interpretations, but that becomes too remote in the current descendants - you cannot reconstruct the implicit sound-token in words directly in English, just from the spelling.)
Edit2: ...and both (and their variants) be compared to other ideas such as "multi-token prediction"...
Edit: or, appropriateness of the approach should be demonstrated after acquired "transparency" of how the LLMs effectively internally work. I am not aware of studies that make the inner workings of LLMs adequately clear.
Edit3: Substantially, the architecture should be as solid as possible (and results should reflect that).
For some 2024 may have ended badly,
but reading the lines above shines a great light of hope for the new year.
I would think that the purpose of concepts is to capture information at a higher density than tokens, so you can remember a longer conversation or better produce long-form output.
Given that, I would have expected that during the training phase, the concept model is evaluated based on how few concepts it emits until it emits a stop.
I think it’s time to read up.
What am I missing -- aside from the marketing? Is there something architecturally different or what? Looks like regular autoregressive sequence transformer to me.
I like this idea because that's how humans think. We mentally formulate a whole sentence, then say it. People who don't do this speak in run-ons and word salad.
An embedding space engine accepting sentences (SONAR) fit in so that the tokens of this architecture are complex sets of the tokens of past architectures.
>> In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a “concept”.
I wonder if the many authors of the paper know that what they call "concept" is what all of machine learning and AI has also called a "concept" for many decades, and not a new thing that they have just named from scratch.
For instance, classes of "concepts" are the target of learning in Leslie Valiant's "A Theory of the Learnable", the paper that introduced Probably Approximately Correct Learning (PAC-Learning). Quoting from its abstract:
ABSTRACT: Humans appear to be able to learn new
concepts without needing to be programmed explicitly in
any conventional sense. In this paper we regard learning as
the phenomenon of knowledge acquisition in the absence of
explicit programming. We give a precise methodology for
studying this phenomenon from a computational viewpoint.
It consists of choosing an appropriate information gathering
mechanism, the learning protocol, and exploring the class of
concepts that can be learned using it in a reasonable
(polynomial) number of steps. Although inherent algorithmic
complexity appears to set serious limits to the range of
concepts that can be learned, we show that there are some
important nontrivial classes of propositional concepts that
can be learned in a realistic sense
From: https://web.mit.edu/6.435/www/Valiant84.pdfOr take this Introduction to Chapter 2 in Tom Mitchell's "Machine Learning" (the original ML textbook, published 1997):
This chapter considers concept learning: acquiring the definition of
a general category given a sample of positive and negative training
examples of the category.
From: https://www.cs.cmu.edu/~tom/mlbook.html (clink link in "the book").I mean I really wonder some times what is going on here. There's been decades of research in AI and machine learning but recently papers look like their authors have landed in an undiscovered country and are having to invent everything from scratch. That's not good. There are pitfalls that all the previous generations have explored thoroughly by falling in them time and again. Those who don't remember those lessons will have to find that out the hard way.
it seems to me the concept of «concept» in the paper is "the embedding vector we get in systems like SONAR (which we could use to generalize ordered sets of tokens into more complex ideas)". That's pretty specific, only marginally related to past handling as mentioned.