This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.
I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?
It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.
I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)
This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone
Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.
Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.
And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.
If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.
This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.
Understanding requires much less data than brute forcing your way into pattern recognition.
When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.
Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.
Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.
While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.
The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.
What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.
Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.
Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.
I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.
I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.
For transformer-based LLMs, and most LLMs there's an obvious class of problems that they cannot solve. LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance. If you have a back-and-forth (many shot) your LLM can possibly utilize the context as state to solve harder problems, up to the context window, of course.
People also routinely fail to reason, even programmers often write "obvious" logic bugs they don't notice until it gives an unexpected result at which point it's obvious to them. So both humans and AI don't always reason. But humans reason much better.
I myself have observed ChatGPT 4 solving novel problems I invented to my personal satisfaction well enough to say that it seems to have a rudimentary ability to sometimes show abilities we would typically call reasoning, but only at the level of a child. The issue isn't that it is supposed to reason perfectly or that humans reason perfectly, the issue is that it doesn't reason well enough to succeed at completing many kinds of tasks we would like it to succeed at. Please note that nobody expects it to reason perfectly. "Prove Fermat's last theorem in a rigorous way. Produce a proof that can be checked by Coq, Isabelle, Mizar, or HOL in a format supported directly by any of them" is arguably a request that includes nothing but reasoning and writing code. But we would not expect even Wiles to be able to complete it, and Wiles has actually proved Fermat's last theorem.
So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.
Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...
The issue is that humans can see its answer is wrong and its "reasoning" is wrong.
The issue isn't that it never reasons correctly. It's that it doesn't do so often enough or well enough, and it doesn't complete tasks we expect humans to complete, and it doesn't always notice when it is printing something outrageously wrong and illogical.
It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.
I think until we know the answer to this, we can't make predictions about how to build true AGI.
deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)
the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.
I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized
But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.
For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".
Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.
And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".
This is according to whom, please?
They just don't have the right architecture to support it.
An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.
You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.
Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.
It is critical thinking, continuous cycles of reprocessing.
And this cannot be overrated: it is the core activity.
I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.
That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.
Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.
This is not suggestive of an underlying capacity to draw conclusions from general patterns.
So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"
edit: Which might be an artifact of the training data always being in that kind of format.
The sentence you're referring to is "What is the number of words in the sentence coming before the next one? Please answer." It contains 14 words.
Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?
I guess perhaps the techniques could be generalized though?
For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.
However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves?
I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings.
For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!
https://platform.openai.com/tokenizer
As a human child you learn about numbers largely visually by pointing to units, tens, hundreds etc, visually aligning them to add, etc. Maybe a multi-modal model, if it was visually trained on chalkboard primary school math, would do better in learning the concept of position based powers of 10, etc.
That's a far more surmountable problem. Maybe you need one model for biology and another for coding etc. i.e. Broad split by domain. Still weak AI not true general in AGI sense, but still seems like a good next step
[1] Why LLMs like ChatGPT and Google Bard are bad at math:
Why?
I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.
However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).
Quite the opposite, it's the holy grail of all AI.
Consider various work that isn't (and can't) be done by computers/robots/etc right now.
The intelligence constraint is universally, a required amount of problem solving. Even the "low skill" labour requires it.
And to perform such problem solving, you need advanced logic and reasoning capabilities, which is the same thing as novel mathematics, just applied to a different end.
There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.
What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.
One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.
The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.
Just an idle thought...
[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...
Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.
I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.
It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.
> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.
A tangent is that positional systems were originally invented with least digit first, I believe.
The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).
The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.
16 = six teen, sech zehn
98 = acht und neunzig, achten negentig, ثمانية وتسعون
17 + 14 = 20 + 11 = 30 + 1 = 31
vs 17 + 14 = 10 + 10 + 10 + 1 = 31
Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.
For example with arithmetic, this study cites another [Dziri et al. 2023], that says:
"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."
But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.
I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.
DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40
Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.
LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.
Also found this, a Mensa test for across the top dozen frontier models.
https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...
That does seem to me to be demonstrating a global type of reasoning or generalization.
Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.
Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.
At least with AGI, we can just throw up our hands, use an easier definition and take the W.
This nitpicking is a red herring.
The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)
In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.
Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.
Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)
And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.
Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...
My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.
So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.
[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...
It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.
It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?
This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.
The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.
And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.
Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.
"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.
Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.
And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.
We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).
So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.
In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.
> With positions resolved, we can study the logical extrapolation ability of transformers
They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.
This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.
No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.
If everyone was using horses what would you had said about the first prototype car? Probably a very slow and clumsy and failureprone thing.
Obviously, the "best" way to do addition on a computer is by doing it exactly.
The paper makes this claim but if they could do that, they'd have showed it already: instead their hand-crafted, artisanal embeddings only work well for addition and only weakly for multiplication and sorting, and not at all for other arithmetic operations.
One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.
Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.
Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.
>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?
You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?
And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?
E.g., take chess. Modelling a game of chess as a game tree and searching the game tree by adversarial search is a human invention. Humans are pretty crap at searching a game tree beyond a handful of ply, but we can program a computer to go dozens of ply deep across thousands of branches, and beat any human.
So the challenge for AI is not to get computers to calculate when we know how the calculation is to be performed. The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs. Yann LeCun and Yoshua Bengio have said similar things.
The linked work doesn't move the needle any closer to that and it just shows progress in calculating arithmetic using a transformer, which we already know how to do in a myriad different ways and much more accurately. Hence my criticism for it.
There are many situations where it is useful for the LLM to get basic arithmetic right.
For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?
And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.
[1] https://github.com/pytorch/examples/blob/37a1866d0e0118875d5...
Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.
General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.
Yes, seriously.
>> The goal is to learn how to infer algorithmic processes directly from data.
And they demonstrated nothing like that. An "algorithmic process" is not finding the weights for a function given some carefully designed bias. An algorithm is a sequence of operations that calculates the result of a function. Nothing like that has been demonstrated in the linked paper at all.
>> General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.
It's not missing at all, you just wont' find it in neural nets. And my PhD and post-doc research is exactly on that sort of thing, learning programs, algorithms and, currently, solvers for general planning problems.
Math inference is a palor trick as is the whole “world model” bullshit - physics doesn’t work with 99% accuracy.
It’s the same reason agents are bullshit right now - error compounding at 95% reliability per step murders them and currently there is no path to triple 9