Transformers Can Do Arithmetic with the Right Embeddings (opens in new tab)

(arxiv.org)

207 pointsbyt3h3ad1y ago211 comments

211 comments

Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.

This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?

wrsh071y ago

This is cool, but special casing digits is unsatisfying.

It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.

I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)

neves1y ago

but it makes sense to have a different encoding. Mathematics is a completely different language. Maybe we should have more than one class of encodings.

PartiallyTyped1y ago

There were some recent posts (either here or reddit) supporting the claim that different regions activate when reading programs vs when reading text. If we take that to be true; and squint just enough, one could claim that arithmetic and mathematics should be treated differently to language.

1 more reply

wrsh071y ago

I would only find that satisfying (from a snobbish and impractical perspective) if we were able to have the model decide: 1) what encoding should this section use? 2) how should I train this encoding?

A mixture of experts but for encodings is interesting, though!

Maybe there's a clean way to implement

PeterisP1y ago

For arbitrary documents and queries, how do we reliably segment the text between those two different languages? And if we can do that, why can't the model do it implicitly?

nprateem1y ago

But I don't want tricks. I want to know that it knows so I don't have to continually guess whether it's right or not.

naasking1y ago

That's simply not possible. Human understanding is still unreliable, even for geniuses.

lapitopi1y ago

That’s why I am asking a computer.

2 more replies

krapp1y ago

It's entirely possible. Don't use LLMs for math. Use the computers we already have that have been capable of doing math accurately for a century. Right tool, right job.

1 more reply

nprateem1y ago

My calculator manages

1 more reply

jhanschoo1y ago

A basic transformer architecture performs only a bounded amount of computation per generated token, so it can never emulate a machine computing sufficiently hard problems.

EVa5I7bHFq9mnYK1y ago

Yes, because it's feed forward. It must have loops to be a Turing machine.

1 more reply

refulgentis1y ago

> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone

Filligree1y ago

Doesn’t seem as complicated as, say, coding a lexer for C. And why shouldn’t tokenisation use lexers or an equivalent?

refulgentis1y ago

Good old software development. :( Recent case studies:

- llama.cpp wasn't tokenizing properly, and it came to a head with llama3. Essentially every local model before May 2024 is soft-deprecated, new ones have to indicate the proper tokenizer, and that currently only covers a small subset of popular models

- I recently had to review 41 Phi-3 and Llama 3 models, only 3 had the right tokenizer set

Not saying it's impossible, and we definitely should, and I bet it 100% happens, but...*shudders*

1 more reply

badrunaway1y ago

engineering vs. science -> scientist-types find such hacks ugly whereas engineers have to pay bills and get things moving fast.

TeMPOraL1y ago

And when engineers accumulate enough related hacks, scientist-types may discover a pattern and find a proper, general solution. But they wouldn't get there without the pile of hacks that are effectively meta-level empirical evidence.

Retric1y ago

AI research has mostly progressed when there’s been enough processing power to avoid needing to use the old style of hacks rather than any sort of generalization going on.

AlphaZero vs Stockfish wasn’t some outgrowth of existing methods. They basically throw the old style away and started over.

Object recognition, LLM’s etc all involved throwing what used to be unimaginable levels of data and compute at a problem that “suddenly” worked. Not saying the people at OpenAI aren’t clever, but instead that it wouldn’t have worked in 2000.

badrunaway1y ago

Yea, sometimes that happen. But I won't say it's must. Scientists work by funding. Engineers work on real world markets.

uoaei1y ago

It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.

Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.

Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.

PeterisP1y ago

ASCII digits do not always imply base-10 numbers, they can also be identifiers (e.g. phone numbers), parts of words (IPv6, Log4j), and used in various 'written slang' such as g2g, 4ever, m8 for mate, etc, etc.

And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.

uoaei1y ago

You're missing the picture again.

Stepping one level out in the metacognition hierarchy is the key. "Learning to learn" as it were. It is only the relative ease of implementation and deployment of feedforward models like Transformers that makes it seem like we have reached an optimum but we desperately need to move beyond it before it's entrenched too thoroughly.

1 more reply

imtringued1y ago

You seemingly missed the part where the next model could learn how to generate its own hierarchical position embeddings. The problem here is obviously that you want the model to look at position i in object a and object b where the position i was chosen by a previous layer. If anything, the answer is probably to just have a dynamic position input from the model into the RoPE embedding, then it can learn the ideal position encoding on its own.

pixl971y ago

I'd rather not wait another billion or so years for computers to evolve themselves

zacksiri1y ago

I think the problem here is that 'understanding' is not the same as curve fitting.

If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.

This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.

Understanding requires much less data than brute forcing your way into pattern recognition.

When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.

Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.

Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.

zyklu51y ago

I think you are mixing layers of abstraction. To make a crude but I think not unhelpful analogy: 'Understanding' is a natural language concept that is our way to describe whats happening in our heads, and like most other such concepts is resistant to any clear definition and will exhibit sorites type paradoxes when one is attempted. It belongs to the presentation layer of the stack. While the process of curve fitting, however it is implemented, with whatever NN structure (like transformers) or maybe something else entirely belongs to the physical layer of the stack -- akin to frequency modulation.

While I am unsure whether LLMs are really understanding, whatever that means, I think it is not difficult to believe that any form of understanding we implement will involve 'curve fitting' as a central part.

zacksiri1y ago

Thank you for your explanation. It's helpful to see another perspective on 'understanding'.

hackinthebochs1y ago

This seems like its confusing how we conceptualize the training/learning process with what the system is actually doing. We conceptualize tuning parameters as curve fitting, and we conceptualize predicting the next token as maximizing probability. But that doesn't mean there is anything like curve fitting or probability maxxing happening as the system's parameters converge.

The core feature of curve fitting is learning explicit examples and then interpolating (in an uninformative manner) between unlearned examples. But there's no reason to think this completely describes what the system is doing, in the sense that there are no more informative descriptions of its behavior. Take an example that LLMs are surprisingly good at, creating poetry given arbitrary constraints. Imagine the ratio of the poems it has seen during its training over the number of unique poems it could create in principle. This number would be vanishingly small. Interpolating between two strings representing well-formed poems in an uninformative manner (i.e. some finite polynomial) will not generate well-formed poems. The only way you could move between two examples of well-formed poems while staying on the manifold of well-formed poems is if you captured all relevant features of the manifold. But I fail to see a difference between capturing all relevant features of the poetry-manifold and understanding poetry.

What LLMs do can be described as curve fitting in only the most uninformative description possible. What they do is discover features of the structures referred to by the training text and competently deploy these features in predicting the next token. A human that could do this would be consider to understand said structure.

msoad1y ago

It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.

Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.

sshine1y ago

> Problem is the current systems can’t reason about things

Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.

I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.

I do think that the combination of a transformer network and calls to customized reasoning chips (systems that search and deduce answers, like Wolfram Alpha or logic/proof systems) may be a short-stop to something that can perform reason and execution of actions better than humans, but is not AGI.

jhanschoo1y ago

> They're not able to reason, but we can't succintly define what it is.

For transformer-based LLMs, and most LLMs there's an obvious class of problems that they cannot solve. LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance. If you have a back-and-forth (many shot) your LLM can possibly utilize the context as state to solve harder problems, up to the context window, of course.

simianparrot1y ago

Humans can realise they don’t understand something and seek more knowledge to learn to understand it. But also humans can build complex structures out of simple fundamentals: The same logic of counting up beans on a table can be extrapolated to multiplying that table of beans. And then counting horses the same way you count beans but give them a value of multiple beans. And then simplify that by trading in promises of beans in trade of horses.

The fact that so many people can’t see the fundamental differences of an LLM and human intelligence reminds me of back when the very early computer scientists thought they could model the entirety of nature by reducing every “component” to a numeric value and compute it as “transfer of energy”.

Quite literally they did the same thing: They had a new toy (very advanced computation machines) and forced all of nature to “fit” within it. It also ended in failure, obviously. Not because nature or ecosystems (as it was coined) are “magic” but because grossly oversimplifying reality to fit desired models is a fool’s errand.

1 more reply

klabb31y ago

> LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance.

I can’t judge if this is true, because I don’t know transformers well, but if it is, it unravels an intuitive thought I’ve never been able to articulate about not only LLMs, but possibly all pattern matching and the human analog of System 1 thinking.

Another fuzzy way of saying this is there’s something irreducible about complexity that can’t be pattern matched by any bounded heuristic – that it’s wishful thinking to assume historical data contains hidden higher-level patterns that unlock magical shortcuts to novel problems.

1 more reply

lupire1y ago

Humans have the same limitation and use same solution: showing your work and taking notes. There's no blocker here.

1 more reply

logicallee1y ago

>They're not able to reason, but we can't [succinctly] define what it is.

People also routinely fail to reason, even programmers often write "obvious" logic bugs they don't notice until it gives an unexpected result at which point it's obvious to them. So both humans and AI don't always reason. But humans reason much better.

I myself have observed ChatGPT 4 solving novel problems I invented to my personal satisfaction well enough to say that it seems to have a rudimentary ability to sometimes show abilities we would typically call reasoning, but only at the level of a child. The issue isn't that it is supposed to reason perfectly or that humans reason perfectly, the issue is that it doesn't reason well enough to succeed at completing many kinds of tasks we would like it to succeed at. Please note that nobody expects it to reason perfectly. "Prove Fermat's last theorem in a rigorous way. Produce a proof that can be checked by Coq, Isabelle, Mizar, or HOL in a format supported directly by any of them" is arguably a request that includes nothing but reasoning and writing code. But we would not expect even Wiles to be able to complete it, and Wiles has actually proved Fermat's last theorem.

So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.

Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

The issue isn't that it never reasons correctly. It's that it doesn't do so often enough or well enough, and it doesn't complete tasks we expect humans to complete, and it doesn't always notice when it is printing something outrageously wrong and illogical.

It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.

vitus1y ago

> Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

> The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

I've noticed with LLMs that they're more likely to come to the wrong conclusion if you prime them in that manner. In this case, you posed the follow-up question as "Will <incorrect conclusion> always be true?" As a result, it's primed to try to prove that incorrect conclusion.

(That said, ChatGPT further did not answer the posed question, as it also changed "difference" -> "absolute difference"; in fact, the difference will alternate between increasing and decreasing, while the absolute difference is strictly increasing.)

FeepingCreature1y ago

Yes, thank you! This exactly matches my experience. The patterns are in there, they're just not prominent or developed enough to reach our level.

That's why I think of GPT3+ as "subhuman AGI," personally.

short_sells_poo1y ago

I suppose it's a question whether what we call "reasoning" is an emergent phenomenon from having enough connections in a graph, or whether it's some other special sauce which we simply don't have in our current models yet. E.g. humans follow a deductive process to answer questions which they haven't encountered yet. Do we gain this ability purely from a denser/larger graph of knowledge, or from a completely different architecture?

I think until we know the answer to this, we can't make predictions about how to build true AGI.

psychoslave1y ago

> E.g. humans follow a deductive process to answer questions which they haven't encountered yet.

Rarely, actually.

More generally humans use all kind of inferences where problem at hand is intertwined with all other attention points that is occupying the mental load of the person. Giving a topic full mental attention and finding a path through pure deduction about a circumscribed subject is a rarity, even if you consider only those situations that require any conscious attention at all to perform some action before moving on.

1 more reply

fishe1y ago

All of our deductive reasoning is founded in induction. For example, the basis of all arithmetic is physics analogies regarding things that exist and the understanding that a thing implies another thing is not based in deduction. Similarly, I suspect from my own experience that general reasoning requires a basic understanding of physics if its origin isn't something ineffable. The ability to connect and find implications cannot itself be purely deductive and it would seem to me that an understanding of physical reality would have to be the origin for that ability.

sshine1y ago

> an emergent phenomenon from having enough connections in a graph, or ... some other special sauce

For humans, it is emergent. But when we reason about reason, we invent special sauce.

If we build our theories of reason into our models, they achieve the strengths and limitations of our models.

If we don't, we're limited by the pace of evolution, because we don't have enough connections in our graph.

So I think we'll have something immediately more useful if we embed ALU special instructions into a neural network.

anon2911y ago

I must be in the minority here, but I don't think most people exercise any reason. I'd even venture that the vast majority of people haven't reasoned recently at all. In my mind, reasoning is an ability... a willful act to engage in thinking through an abstract problem. Most people don't do this and just use rationalization and learned behavior, which our brains are good at.

3 more replies

ekianjo1y ago

> humans follow a deductive process to answer questions which they haven't encountered yet

nope. most humans fall in various traps such as pattern recognition, confirmation bias, and many others instead of relying on deductive analysis. Even scientists fail at being rigorous.

1 more reply

golol1y ago

As I understand, conceptually they just changed 346 + 23 = ? to (1: 3, 2: 4, 3: 6) + (1: 2, 2: 3) = ? So it is not that much of a specific hack. There could be a broader principle here where something is holding transformers back in a general fashion, and we might be able to improve on the architecture!

ckemere1y ago

Hopefully 3:3, 2:4, 1:6 and 2:2, 1:3?

josehackernews1y ago

how do you argue that these models are not able to reason?

deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)

the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.

I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized

math_dandy1y ago

We have no definition of reasoning that is sufficiently precise to be useful.

But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.

For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".

Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.

And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".

YeGoblynQueenne1y ago

>> deductive reasoning is just drawing specific conclusion from general patterns.

This is according to whom, please?

nicklecompte1y ago

The fundamental argument of "Artificial Intelligence, Natural Stupidity" is that AI researchers constantly abuse terms like "reasoning," "deduction," "understanding," and so on, deluding others and themselves that their machine is almost as intelligent as a human when it's clearly dumber than a dog. My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

In the 80s the computers were indisputably dumber than ants. That's probably not true these days. But the decades-long refusal of most AI researchers to accept humility about the limitations of their knowledge (now they describe multiple-choice science trivia as "graduate level reasoning") suggests to me that none of us will live to see an AI that's smarter than a mouse. There's just too much money and ideology, and too little falsifiability.

3 more replies

foolswisdom1y ago

> Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false.

<https://en.wikipedia.org/wiki/Deductive_reasoning>

1 more reply

HarHarVeryFunny1y ago

> how do you argue that these models are not able to reason?

They just don't have the right architecture to support it.

An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.

You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.

Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.

mdp20211y ago

It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).

It is critical thinking, continuous cycles of reprocessing.

And this cannot be overrated: it is the core activity.

msoad1y ago

> how do you argue that these models are not able to reason?

I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.

[1] https://github.com/facebookresearch/clutrr

Last5Digits1y ago

There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.

1 more reply

Shrezzing1y ago

>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do

That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.

Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.

This is not suggestive of an underlying capacity to draw conclusions from general patterns.

mike_hearn1y ago

> Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition

Maybe you did! Most young children cannot actually do bigint arithmetic reliably or at all after a couple days worth of tuition!

throwthrowuknow1y ago

I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.

The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.

There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

1 more reply

baq1y ago

We're as humanity building a reasoning machine bottom up. It can't reason... yet. Expecting a magical switch that will make it reason about anything and everything is unreasonable. Starting with arithmetic makes perfect sense.

psychoslave1y ago

I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."

asgeir1y ago

In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.

So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"

edit: Which might be an artifact of the training data always being in that kind of format.

olalonde1y ago

GPT-4 (OpenAI):

The sentence you're referring to is "What is the number of words in the sentence coming before the next one? Please answer." It contains 14 words.

mikeocool1y ago

Interestingly, chat gpt 4o gave me the answer 15.

psychoslave1y ago

Thanks. I don’t have access to this engine which for some reason is kept in a closed garden for richer people. ¯\_(ツ)_/¯

1 more reply

itchyjunk1y ago

How many humans have you tested this with?

psychoslave1y ago

Interesting point. Would you please answer the question I was mentioning? :)

Thorham1y ago

grumpopotamus1y ago

>Problem is the current systems can’t reason about things, math included.

Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?

Havoc1y ago

For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.

I guess perhaps the techniques could be generalized though?

mike_hearn1y ago

Generalizable techniques is mostly the point of papers like this one yes. What they show here is that apparently fundamental problems with transformer reasoning can be fixed by encoding data in a more sophisticated manner. This is exciting. I've been thinking for a long time that the tokenization schemes are a low hanging fruit for improving coding LLM performance, this isn't exactly the same thing but it's in the same general area. Smartness and reasoning ability with the current set of algorithmic techniques seems to have topped out around GPT-4 level, which implies that further leaps in mental abilities must come from improving other things beyond training set size.

For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.

HarHarVeryFunny1y ago

It seems sub-word tokenization vs using character inputs is just a trade off to gain computational efficiency, and obviously isn't how our brain works. We're not born with a fixed visual tokenization scheme - we learn to create our own groupings and object representations.

However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves?

I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings.

mike_hearn1y ago

Yeah a tiny vocab of characters doesn't work that well, it was tried very early on and creating large vocabs of tokens was a big improvement. Which makes sense. A lot of tokens are full words and so the token->embedding phase can quickly look up an embedding in vector space that contains a lot of meaning, whereas an embedding of 'z' or whatever is going to be meaningless.

1 more reply

0-_-01y ago

To me this finding shows how transformers don't generalise, since they need specialised embeddings to handle a problem

HarHarVeryFunny1y ago

I think this is more a matter of how numbers are input and lack of specific training, including visual training.

For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!

https://platform.openai.com/tokenizer

As a human child you learn about numbers largely visually by pointing to units, tens, hundreds etc, visually aligning them to add, etc. Maybe a multi-modal model, if it was visually trained on chalkboard primary school math, would do better in learning the concept of position based powers of 10, etc.

Havoc1y ago

I'd say the key point here isn't that they "need" specialised embeddings, but rather that it improves things and it can samewhat manage without.

That's a far more surmountable problem. Maybe you need one model for biology and another for coding etc. i.e. Broad split by domain. Still weak AI not true general in AGI sense, but still seems like a good next step

int_19h1y ago

The fact that transformers generalize is kinda evident from the fact that they can solve novel puzzles.

verticalscaler1y ago

Creating the universe in 100 lines of code is the ultimate code golf and we have all been nerd sniped.

teleforce1y ago

I think understanding mathematics is what LLM really need at the moment far more important than video generation that is just another form of CGI [1]. After deep learning and transformer, understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM and a turning point for humanity.

[1] Why LLMs like ChatGPT and Google Bard are bad at math:

https://www.xda-developers.com/why-llms-are-bad-at-math/

staunton1y ago

> understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM

Why?

I definitely agree that such capabilities would represent a major advance (and very likely go together with game changing increases of capabilities in other areas). I also think using AI to write formal math proofs in e.g. Lean is very cool.

However, by itself, it seems like this capability wouldn't be very useful, commercially for example. Do you think this capability is exceptionally informative merely because it has to go together with other capabilities? It's not impossible to have a (maybe somewhat limited) formal math AI that will remain mostly irrelevant to the everyday world (like FormalGeo).

simiones1y ago

Understanding mathematics basically means understading higher-level reasoning. If an AI were able to actually do this + the ability to generate and interpret language that LLMs already show, it would seem to be 90% or more of the way to AGI.

ADeerAppeared1y ago

> However, by itself, it seems like this capability wouldn't be very useful, commercially for example.

Quite the opposite, it's the holy grail of all AI.

Consider various work that isn't (and can't) be done by computers/robots/etc right now.

The intelligence constraint is universally, a required amount of problem solving. Even the "low skill" labour requires it.

And to perform such problem solving, you need advanced logic and reasoning capabilities, which is the same thing as novel mathematics, just applied to a different end.

staunton1y ago

Let's be a little more concrete: do you think FormalGeo [1] is a big deal? I think it's very cool but ultimately not useful in and of itself. It's only useful insofar as it shows AI capabilities advancing in general.

Let's suppose we had an AI that works roughly like [1] but for the kind of mathematics done in Lean's Mathlib, and that was on par or better than humans working on it. Would that AI by itself be commercially useful?

Again, of course having such an AI implies a major jump in capabilities and it would most likely mean useful AI can be trained with similar techniques. But that's not what I mean by the system itself being useful. If all you're saying is that such an AI demonstates we can now probably build AIs that do things which we usually say require "logic and reasoning abilities", I completely agree.

Maybe I'm splitting hairs too much here. However, it could well be that such an AI would be useful by itself. I just can't think of much besides a major advance in the formal software verification niche, which still almost nobody would use...

[1]: https://github.com/FormalGeo/FormalGeo

1 more reply

jiggawatts1y ago

Something I've been thinking about is how the Minds -- the super-human AI hyper-computers that fly the ships in the Culture series of novels are described. The image built up in my head[1] is that they're hybrids blending neural networks and regular compute substrates. They can calculate, simulate, and reason in combination.

There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.

What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.

One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.

The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.

Just an idle thought...

[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...

int_19h1y ago

The problem, if you embed an ALU like that, is how to train it to use them properly. And then it's not clear if they actually need to be able to do that in the middle of a pass that, at the end, is going to produce a single token anyway.

Controlling that stuff via output tokens actually kinda makes sense by analogy, since that is how we use calculators etc. But I do agree that specialized tokens that are used specifically to activate tools like that might be a better idea than just using plain text to signal in-band. And production of such specialized tokens can be easily trained.

vessenes1y ago

Fellow huge Banks fan here.

I like this idea a lot. Right now we are going the long/hard way round, and post training asking an LLM to know it needs compute, then write a compute request, then feed back the compute answer into a tokenization loop.

It probably does make sense to add a mini CPU as a layer / tool / math primitive. I wonder how you'd train it to use such a thing? In my mind it's not really a layer per-se, but it's a set of function calls a layer could route to when it wants, and weight the response appropriately.

torginus1y ago

I just wonder if numbers were written right to left, llms would be much better at arithmetic. You can 'predict' the least significant digit by reusing the already written digits in the computation, but to generate most significant ones, you generally need to do the entire computation in one go.

gwern1y ago

Yes. This has already been demonstrated by "Teaching Arithmetic to Small Transformers" https://arxiv.org/abs/2307.03381 , I'm not sure what OP adds except demonstrating that you can do that via the embedding itself rather than the tokenization.

> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.

weinzierl1y ago

This is an interesting idea but probably hard to verify.

A tangent is that positional systems were originally invented with least digit first, I believe.

The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).

The most significant digit first convention came when right-to left numbers were used in left-to-right systems without reversing them in writing. To this day we read the more common smaller numbers least significant digit first to varying degrees.

16 = six teen, sech zehn

98 = acht und neunzig, achten negentig, ثمانية وتسعون

lupire1y ago

Left to right is fine, but it takes more work (multi shot) to do carries.

17 + 14 = 20 + 11 = 30 + 1 = 31

vs 17 + 14 = 10 + 10 + 10 + 1 = 31

spencerchubb1y ago

They do that in the paper. Least significant digit on the left

pmayrgundter1y ago

I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.

Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.

For example with arithmetic, this study cites another [Dziri et al. 2023], that says:

"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."

But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.

DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40

Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.

Last5Digits1y ago

Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.

LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.

pmayrgundter1y ago

Agreed on the first part, but for LLMs not having correlated capabilities, I think we've seen they do. As the GPTs progress, mainly by model size, their scores across a battery of tests goes up, eg OpenAI's paper for ChatGPT 4, showing a leap in performance across a couple dozen tests.

Also found this, a Mensa test for across the top dozen frontier models.

https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

That does seem to me to be demonstrating a global type of reasoning or generalization.

Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.

CuriouslyC1y ago

AGI is like consciousness, 75% of the people in any given conversation are talking about different things.

Truthfully we're going to see that improving language models towards AGI works out the same way self driving cars did - we're going to feel like we're 85% of the way there out of the gate, then we're going to keep tripping over things for the next 15 years.

At least with AGI, we can just throw up our hands, use an easier definition and take the W.

edflsafoiewq1y ago

I don't understand the framing of your comment. You act like the LLM's feelings are going to be hurt if you say it isn't a real AGI. "Well, you can't do basic math expected of fifth graders, but there are dumb fifth graders too, so here's the 'human-level intelligence' participation trophy anyway."

ADeerAppeared1y ago

> But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

This nitpicking is a red herring.

The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)

In particular, the lack of reasoning capability. And what the pessimists argue here is that there is no road to get there for current systems. Transformers are approximation machines, and are generalized for that specific task. But that's also where it stops, they can't do things that aren't such pattern-approximation.

Optimizing a transformer for arithmetic isn't a step towards AGI, because it is not generalizing. You'd need to do this for every conceivable task and subtask. This is the exact reason why imperative-programmed AI architectures were discarded.

Put bluntly, this approach will never get you a transformer that won't shit itself when asked to do novel reasoning tasks, such as novel mathematics. (Which I will remind the reader, anything but the basic programming work counts as)

And critically, the fundamental architecture of these transformer systems doesn't allow the combination of them into other AI systems to acquire generalized capabilities. There's no way to make an LLM hook into a computer-algebra-system, you can only feed 'finished' output of one system into another.

infogulch1y ago

The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.

Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...

Terr_1y ago

Isn't it more that they don't have ready access to the much-more-fundamental concept of decimal numbers?

My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.

So "2+2=4" isn't being treated that differently from "all's well that ends well." This might lead to a kind of Benny's Rules [0] situation, where sufficient brute-force can make a collection of overfitted non-arithmetic rules appear to work.

[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...

isaacfung1y ago

The current gen llms tokenize numbers digit by digit unlike earlier llms.

Last5Digits1y ago

They don't. Which you can easily check with any of the dozen web apps currently implementing the GPT-4o tokenizer.

mike_hearn1y ago

No, it doesn't help. Bloomberg tried this and it didn't seem to make much difference.

1 more reply

andrepd1y ago

Fascinating article!

Terr_1y ago

It looks like the math-notation formatting didn't survive, for that you might want to see a PDF, ex: https://people.wou.edu/~girodm/library/benny.pdf

matrix25961y ago

wouldnt presenting numbers in reverse order, with the least significant digit on the left and most significant on the right help with the reasoning?

spencerchubb1y ago

They do that in the paper

topherjaynes1y ago

I went through the paper and thought immediately about how did they implement it; I missed they published their code as well. Here is the link for everyone who skimmed past it: https://github.com/mcleish7/arithmetic/tree/main

byt3h3adOP1y ago

my bad, should have posted it with the link itself

topherjaynes1y ago

Good to start with the concept, I just had so many implementation questions. Working through the code know... which is way harder to digest.

nerdponx1y ago

I like to see more focus on the input embeddings.

It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.

It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?

Shrezzing1y ago

Since models are very good at writing very short computer programs, and computer programs are very good at mathematical calculations, would it not be considerably more efficient to train them to recognise a "what is x + y" type problem, and respond with the answer to "write and execute a small javascript program to calculate x + y, then share the result"?

Grimblewald1y ago

From a getting answers perspective yes, from an understanding LLMs perspective no. If you read the avstract you can see how this goes beyond arithmetic and helps with longform reasoning

simiones1y ago

But that's not all that relevant to the question "can LLMs do math". People don't really need ChatGPT to replace a calculator. They are interested in whether the LLM has learned higher reasoning skills from it's training on language (especially since we know it has "read" more math books than any human could in a lifetime). Responding with a program that reuses the + primitive in JS proves no such thing. Even responding with a description of the addition algorithm doesn't prove that it has "understood" maths, if it can't actually run that algorithm itself - it's essentially looking up a memorized definition. The only real proof is actually having the LLM itself perform the addition (without any special-case logic).

This question is of course relevant only in a research sense, in seeking to understand to what extent and in what ways the LLM is acting as a stochastic parrot vs gaining a type of "understanding", for lack of a better word.

Shrezzing1y ago

That's a fair summary of why the research is happening. Thanks.

gmerc1y ago

That's in fact what ChatGPT does ... because 99% accurate math is not useful to anyone.

ADeerAppeared1y ago

This is a cromulent approach, though it would be far more effective to have the LLM generate computer-algebra-system instructions.

The problem is that it's not particularly useful: As the problem complexity increases, the user will need to be increasingly specific in the prompt, rapidly approaching being fully exact. There's simply no point to it if your prompt has to (basically) spell out the entire program.

And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.

andrepd1y ago

Yes, this is what external tools/plugins/api calls are all about.

kjhcvkek771y ago

Very cool that it was able to generalise from small numbers to larger ones with such high accuracy.

skyde1y ago

Why not apply same concept every time a word is split into more than one token?

Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.

skyde1y ago

It seems it has been done before:

"Syntax-Aware Transformer Models for Neural Machine Translation" by Yang et al. (2019). This model enhances the transformer architecture with syntax-aware attention mechanisms that consider dependency parse trees.

Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.

michaelnny1y ago

I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance

ynik1y ago

I thinks that's far from the only problem. To me the most obvious problem is that we use right-to-left numbers (think about the order you're writing digits when doing long addition) in a left-to-right language. Without a special number-flipping step; the transformer is forced to produce the output token-by-token, i.e. from left-to-right. Without the ability to store additional internal state, this turns addition into an O(N²) problem purely due to the suboptimal output ordering!

Shrezzing1y ago

The paper discusses this, and the approach taken in the paper implements a number-flip stage, so numbers are formatted with their least significant figure first.

threatofrain1y ago

That doesn't stop decent code output for many computer languages.

wantsanagent1y ago

I like these kinds of fixes. It's like realizing your child has vision problems and getting them glasses.

winddude1y ago

But a calculator wouldn't be very good if it's only correct 99% of the time for arithmetic...

CyberDildonics1y ago

I'm pretty sure getting computers to do arithmetic is not a giant hurdle.

YeGoblynQueenne1y ago

What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.

And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.

We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).

So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

golol1y ago

Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?

Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?

This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.

Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.

YeGoblynQueenne1y ago

Is addition a Markov process? I really don't think so. You can certainly model e.g. integer addition by a Markov process, up to some integer k but addition itself is usually formalised by the Peano axioms, that are not quite Markovian. I guess you can see the relation between S(n) and S(S(n)) as some kind of Markov chain. That's really not a standard view though.

In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.

mike_hearn1y ago

They explain their true goal in the introduction:

> With positions resolved, we can study the logical extrapolation ability of transformers

They are interested in how well they can make a neural net logically extrapolate outside its training set, once encoding barriers are removed. They show that in fact even quite small language models can do this successfully once we're not confusing them with bad encodings anymore.

This seems like fundamental work. It was only a few years ago that Google employees were arguing LLMs were nothing more than "stochastic parrots". Well, that take will go down in history as one of the worst takes on AI ever. I don't think anyone really had any doubt by 2024 that this wasn't true, but the huge and opaque datasets meant people could always argue that maybe this wasn't an example of logical reasoning or extrapolation, maybe it had just seen this specific question before. But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers. It's not just repeating answers it's seen in its dataset. It should kill off the parrot meme for good.

YeGoblynQueenne1y ago

>> But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers.

No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.

zarzavat1y ago

It’s not about arithmetic but about embeddings. The positional embeddings used in transformers are rather simplistic. If they can add this one new capability to transformers by using different embeddings then maybe there are other capabilities that are within reach.

YeGoblynQueenne1y ago

No, because those embeddings only work for addition (very weakly for multiplication and sorting). Imagine needing a specially-crafted bias for every single task. The Deep Learning revolution brought on by Convolutional Neural Nets was supposed to do away with the need to do exactly that.

dagss1y ago

By such arguments, what is the point of any research, at all?

If everyone was using horses what would you had said about the first prototype car? Probably a very slow and clumsy and failureprone thing.

toxik1y ago

I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network. Maybe there are other subsequences that could be annotated in this way? Per paragraph, tokens per word, who knows.

Obviously, the "best" way to do addition on a computer is by doing it exactly.

YeGoblynQueenne1y ago

>> I think there is a good reason to find low-hanging fruits that pay dividends on these types of tasks, not because solving addition with a transformer is a good idea, but because it could improve performance in other parts of the network.

The paper makes this claim but if they could do that, they'd have showed it already: instead their hand-crafted, artisanal embeddings only work well for addition and only weakly for multiplication and sorting, and not at all for other arithmetic operations.

Chinjut1y ago

Minor point, but Blaise Pascal was centuries earlier than the 1850s.

YeGoblynQueenne1y ago

Thanks, you're right - my bad.

IanCal1y ago

There are two sides to this that jump out

One is that research into what the limits of the architecture are is useful. Maths has a nice property of being very easy to verify and you can construct logical processes with it. It's a useful testbed.

Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.

famouswaffles1y ago

>What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.

Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.

>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?

And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?

YeGoblynQueenne1y ago

Yes, but humans invented arithmetic. And then we invented computers that are much better than us at arithmetic calculations. That's a pattern we can observe all over the place: we're pretty damn good at inventing rich models of complex environments and processes but we're not very good at calculating the results of such models when that requires a lot of computation.

E.g., take chess. Modelling a game of chess as a game tree and searching the game tree by adversarial search is a human invention. Humans are pretty crap at searching a game tree beyond a handful of ply, but we can program a computer to go dozens of ply deep across thousands of branches, and beat any human.

So the challenge for AI is not to get computers to calculate when we know how the calculation is to be performed. The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs. Yann LeCun and Yoshua Bengio have said similar things.

The linked work doesn't move the needle any closer to that and it just shows progress in calculating arithmetic using a transformer, which we already know how to do in a myriad different ways and much more accurately. Hence my criticism for it.

famouswaffles1y ago

>Yes, but humans invented arithmetic.

I think most would argue Mathematics is a discipline that is discovered more than invented. That said, this isn't really the point I think.

A few humans invented/discovered arithmetic. Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.

It doesn't make sense to me that a bar most humans can't reach is the bar for General Intelligence of the Artificial kind. You can't eat your cake and have it.

Don't get me wrong. It's a fine goal to have. Of course we want machines that can invent things and push the frontier of science! It is still however a logical fallacy that an inability to do such would disqualify machines of general intelligence when it does not do so for Humans.

>The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs.

LLMs have fairly complex models of the world made manifest by the data they're trained on.

https://transformer-circuits.pub/2024/scaling-monosemanticit...

Lecun may disagree but some others like Hinton, Ilya and Norvig don't.

1 more reply

michaelt1y ago

> What is the point of this work? [...] We already know how to hard-code a (literally) infinitely more accurate addition machine.

There are many situations where it is useful for the LLM to get basic arithmetic right.

For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?

And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.

[1] https://github.com/pytorch/examples/blob/37a1866d0e0118875d5...

Xcelerate1y ago

> What is the point of this work?

Seriously? They say it right in the introduction. The goal is to learn how to infer algorithmic processes directly from data. Much like how MNIST was used in the early days of NNs, you have to start with small toy problems that are representative of the problem domain. Once you have success with that, you can scale up problem complexity.

General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.

I would even appreciate seeing more papers on approaches that didn’t work very well so it saves other researchers from going in the wrong direction. That alone would be enough justification for publishing an article.

YeGoblynQueenne1y ago

>> Seriously?

Yes, seriously.

>> The goal is to learn how to infer algorithmic processes directly from data.

And they demonstrated nothing like that. An "algorithmic process" is not finding the weights for a function given some carefully designed bias. An algorithm is a sequence of operations that calculates the result of a function. Nothing like that has been demonstrated in the linked paper at all.

>> General algorithmic capability is one of the key traits that we think AGI should have, and it’s currently missing. If you have a better approach for getting there quicker than everyone else in the field, please share it.

It's not missing at all, you just wont' find it in neural nets. And my PhD and post-doc research is exactly on that sort of thing, learning programs, algorithms and, currently, solvers for general planning problems.

r2_pilot1y ago

Meanwhile I'm over here using Claude 3 Opus to do trig and calculus problems as well as generate the LaTex representation of the equations. It's not necessary to be 100% in my case (purely for fun) but I follow its reasoning and it's pretty consistent at least enough for "orders of magnitude" and first order effects. I was gonna post some of the chats about physics but probably nobody cares.

r2_pilot1y ago

I did do some followup research. The math in its complex reasoning "tracks" but when I asked it to do 4 digit x 4 digit multiplication, it got most of it right except for a weird random digit error in the middle (?!) of the correct answer, lol. Now I want to run CLUTTR against Claude since it seems nobody has published that yet.

gmerc1y ago

That's great, 99% math is absolutely good enough for real world problems /s

traverseda1y ago

It's probably on-par or better than humans get unaided. Hell, I'd bet due to transcription errors it's better than what humans get in a lot of settings, even when aided by a calculator.

gmerc1y ago

I guarantee you professionals using math at work - for example in finance - not have a 1% error quota. They use tools. We have tools. Nobody in any serious role (money, etc) works unaided.

Math inference is a palor trick as is the whole “world model” bullshit - physics doesn’t work with 99% accuracy.

It’s the same reason agents are bullshit right now - error compounding at 95% reliability per step murders them and currently there is no path to triple 9

Jensson1y ago

People didn't have 1% error before computers either, they just practiced enough to barely ever make errors.

mike_hearn1y ago

How many "real world" problems require 100 digit numbers?

gmerc1y ago

Not many. But all require 100% accuracy. If not someone has to take responsibility.

j / k navigate · click thread line to collapse

211 comments

vessenes1y ago

This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?

wrsh071y ago

This is cool, but special casing digits is unsatisfying.

It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.

neves1y ago

but it makes sense to have a different encoding. Mathematics is a completely different language. Maybe we should have more than one class of encodings.

PartiallyTyped1y ago

1 more reply

wrsh071y ago

A mixture of experts but for encodings is interesting, though!

Maybe there's a clean way to implement

PeterisP1y ago

For arbitrary documents and queries, how do we reliably segment the text between those two different languages? And if we can do that, why can't the model do it implicitly?

nprateem1y ago

But I don't want tricks. I want to know that it knows so I don't have to continually guess whether it's right or not.

naasking1y ago

That's simply not possible. Human understanding is still unreliable, even for geniuses.

lapitopi1y ago

That’s why I am asking a computer.

2 more replies

krapp1y ago

It's entirely possible. Don't use LLMs for math. Use the computers we already have that have been capable of doing math accurately for a century. Right tool, right job.

1 more reply

nprateem1y ago

My calculator manages

1 more reply

jhanschoo1y ago

A basic transformer architecture performs only a bounded amount of computation per generated token, so it can never emulate a machine computing sufficiently hard problems.

EVa5I7bHFq9mnYK1y ago

Yes, because it's feed forward. It must have loops to be a Turing machine.

1 more reply

refulgentis1y ago

> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

Filligree1y ago

Doesn’t seem as complicated as, say, coding a lexer for C. And why shouldn’t tokenisation use lexers or an equivalent?

refulgentis1y ago

Good old software development. :( Recent case studies:

- I recently had to review 41 Phi-3 and Llama 3 models, only 3 had the right tokenizer set

Not saying it's impossible, and we definitely should, and I bet it 100% happens, but...*shudders*

1 more reply

badrunaway1y ago

engineering vs. science -> scientist-types find such hacks ugly whereas engineers have to pay bills and get things moving fast.

TeMPOraL1y ago

Retric1y ago

AI research has mostly progressed when there’s been enough processing power to avoid needing to use the old style of hacks rather than any sort of generalization going on.

AlphaZero vs Stockfish wasn’t some outgrowth of existing methods. They basically throw the old style away and started over.

badrunaway1y ago

Yea, sometimes that happen. But I won't say it's must. Scientists work by funding. Engineers work on real world markets.

uoaei1y ago

It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.

PeterisP1y ago

uoaei1y ago

You're missing the picture again.

1 more reply

imtringued1y ago

pixl971y ago

I'd rather not wait another billion or so years for computers to evolve themselves

zacksiri1y ago

I think the problem here is that 'understanding' is not the same as curve fitting.

Understanding requires much less data than brute forcing your way into pattern recognition.

Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.

zyklu51y ago

zacksiri1y ago

Thank you for your explanation. It's helpful to see another perspective on 'understanding'.

hackinthebochs1y ago

msoad1y ago

It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.

Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.

sshine1y ago

> Problem is the current systems can’t reason about things

Sounds like the AGI argument trap: They're not able to reason, but we can't succintly define what it is.

I don't come with a reasoning chip. Whatever I call reasoning happens as a byproduct of my neural process.

jhanschoo1y ago

> They're not able to reason, but we can't succintly define what it is.

simianparrot1y ago

1 more reply

klabb31y ago

> LLMs generally perform bounded computation per token, so they cannot reason about computational problems that are more than linearly complex, for a sufficiently large input instance.

1 more reply

lupire1y ago

Humans have the same limitation and use same solution: showing your work and taking notes. There's no blocker here.

1 more reply

logicallee1y ago

>They're not able to reason, but we can't [succinctly] define what it is.

So we have an idea of reasoning as completing certain types of tasks successfully, and today humans can do it and AI can't.

Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

It notices sometimes, it engages in elementary rudimentary guesswork sometimes, but just not often enough or well enough.

vitus1y ago

> Today, it fails badly at tasks that require reasoning. A simple example: https://chatgpt.com/share/da95843e-218a-4d69-a161-6aa2d7a3c9...

> The issue is that humans can see its answer is wrong and its "reasoning" is wrong.

FeepingCreature1y ago

Yes, thank you! This exactly matches my experience. The patterns are in there, they're just not prominent or developed enough to reach our level.

That's why I think of GPT3+ as "subhuman AGI," personally.

short_sells_poo1y ago

I think until we know the answer to this, we can't make predictions about how to build true AGI.

psychoslave1y ago

> E.g. humans follow a deductive process to answer questions which they haven't encountered yet.

Rarely, actually.

1 more reply

fishe1y ago

sshine1y ago

> an emergent phenomenon from having enough connections in a graph, or ... some other special sauce

For humans, it is emergent. But when we reason about reason, we invent special sauce.

If we build our theories of reason into our models, they achieve the strengths and limitations of our models.

If we don't, we're limited by the pace of evolution, because we don't have enough connections in our graph.

So I think we'll have something immediately more useful if we embed ALU special instructions into a neural network.

anon2911y ago

3 more replies

ekianjo1y ago

> humans follow a deductive process to answer questions which they haven't encountered yet

nope. most humans fall in various traps such as pattern recognition, confirmation bias, and many others instead of relying on deductive analysis. Even scientists fail at being rigorous.

1 more reply

golol1y ago

ckemere1y ago

Hopefully 3:3, 2:4, 1:6 and 2:2, 1:3?

josehackernews1y ago

how do you argue that these models are not able to reason?

deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)

math_dandy1y ago

We have no definition of reasoning that is sufficiently precise to be useful.

But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.

For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".

Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.

And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".

YeGoblynQueenne1y ago

>> deductive reasoning is just drawing specific conclusion from general patterns.

This is according to whom, please?

nicklecompte1y ago

3 more replies

foolswisdom1y ago

<https://en.wikipedia.org/wiki/Deductive_reasoning>

1 more reply

HarHarVeryFunny1y ago

> how do you argue that these models are not able to reason?

They just don't have the right architecture to support it.

mdp20211y ago

It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).

It is critical thinking, continuous cycles of reprocessing.

And this cannot be overrated: it is the core activity.

msoad1y ago

> how do you argue that these models are not able to reason?

I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.

[1] https://github.com/facebookresearch/clutrr

Last5Digits1y ago

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.

1 more reply

Shrezzing1y ago

>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do

That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.

This is not suggestive of an underlying capacity to draw conclusions from general patterns.

mike_hearn1y ago

> Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition

Maybe you did! Most young children cannot actually do bigint arithmetic reliably or at all after a couple days worth of tuition!

throwthrowuknow1y ago

1 more reply

baq1y ago

psychoslave1y ago

I didn’t test with all LLM out there, but all of thus I tested failed with something as basic as "What is the number of words in the sentence coming before the next one? Please answer."

asgeir1y ago

In my experience, LLMs tend to perform better if you give them instructions before the data to be operated on. At least for the ~13b size models.

So,something like: Please count the number of words in the following sentence. "What is the number of words in the sentence coming before the next one?"

edit: Which might be an artifact of the training data always being in that kind of format.

olalonde1y ago

GPT-4 (OpenAI):

The sentence you're referring to is "What is the number of words in the sentence coming before the next one? Please answer." It contains 14 words.

mikeocool1y ago

Interestingly, chat gpt 4o gave me the answer 15.

psychoslave1y ago

Thanks. I don’t have access to this engine which for some reason is kept in a closed garden for richer people. ¯\_(ツ)_/¯

1 more reply

itchyjunk1y ago

How many humans have you tested this with?

psychoslave1y ago

Interesting point. Would you please answer the question I was mentioning? :)

Thorham1y ago

grumpopotamus1y ago

>Problem is the current systems can’t reason about things, math included.

Have you tried asking GPT-4 any questions that require reasoning to solve? If so, what did you ask, and what did it get wrong?

Havoc1y ago

For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.

I guess perhaps the techniques could be generalized though?

mike_hearn1y ago

HarHarVeryFunny1y ago

mike_hearn1y ago

1 more reply

0-_-01y ago

To me this finding shows how transformers don't generalise, since they need specialised embeddings to handle a problem

HarHarVeryFunny1y ago

I think this is more a matter of how numbers are input and lack of specific training, including visual training.

For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!

https://platform.openai.com/tokenizer

Havoc1y ago

I'd say the key point here isn't that they "need" specialised embeddings, but rather that it improves things and it can samewhat manage without.

int_19h1y ago

The fact that transformers generalize is kinda evident from the fact that they can solve novel puzzles.

verticalscaler1y ago

Creating the universe in 100 lines of code is the ultimate code golf and we have all been nerd sniped.

teleforce1y ago

[1] Why LLMs like ChatGPT and Google Bard are bad at math:

https://www.xda-developers.com/why-llms-are-bad-at-math/

staunton1y ago

> understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM

Why?

simiones1y ago

ADeerAppeared1y ago

> However, by itself, it seems like this capability wouldn't be very useful, commercially for example.

Quite the opposite, it's the holy grail of all AI.

Consider various work that isn't (and can't) be done by computers/robots/etc right now.

The intelligence constraint is universally, a required amount of problem solving. Even the "low skill" labour requires it.

And to perform such problem solving, you need advanced logic and reasoning capabilities, which is the same thing as novel mathematics, just applied to a different end.

staunton1y ago

[1]: https://github.com/FormalGeo/FormalGeo

1 more reply

jiggawatts1y ago

There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.

Just an idle thought...

[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...

int_19h1y ago

vessenes1y ago

Fellow huge Banks fan here.

torginus1y ago

gwern1y ago

weinzierl1y ago

This is an interesting idea but probably hard to verify.

A tangent is that positional systems were originally invented with least digit first, I believe.

The Babylonian sexagesimal system was like that as was the Arabic one (where first is on the right).

16 = six teen, sech zehn

98 = acht und neunzig, achten negentig, ثمانية وتسعون

lupire1y ago

Left to right is fine, but it takes more work (multi shot) to do carries.

17 + 14 = 20 + 11 = 30 + 1 = 31

vs 17 + 14 = 10 + 10 + 10 + 1 = 31

spencerchubb1y ago

They do that in the paper. Least significant digit on the left

pmayrgundter1y ago

For example with arithmetic, this study cites another [Dziri et al. 2023], that says:

But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.

Last5Digits1y ago

pmayrgundter1y ago

Also found this, a Mensa test for across the top dozen frontier models.

https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

That does seem to me to be demonstrating a global type of reasoning or generalization.

Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.

CuriouslyC1y ago

AGI is like consciousness, 75% of the people in any given conversation are talking about different things.

At least with AGI, we can just throw up our hands, use an easier definition and take the W.

edflsafoiewq1y ago

ADeerAppeared1y ago

> But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

This nitpicking is a red herring.

The issue that separates "AGI" from current AI systems is the lack of generality. (Humour me.)

infogulch1y ago

The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.

Terr_1y ago

Isn't it more that they don't have ready access to the much-more-fundamental concept of decimal numbers?

My understanding was that they tokenized them into chunks and tried to learn associations between the chunks, the same as if one was breaking apart English words.

[0] https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...

isaacfung1y ago

The current gen llms tokenize numbers digit by digit unlike earlier llms.

Last5Digits1y ago

They don't. Which you can easily check with any of the dozen web apps currently implementing the GPT-4o tokenizer.

mike_hearn1y ago

No, it doesn't help. Bloomberg tried this and it didn't seem to make much difference.

1 more reply

andrepd1y ago

Fascinating article!

Terr_1y ago

It looks like the math-notation formatting didn't survive, for that you might want to see a PDF, ex: https://people.wou.edu/~girodm/library/benny.pdf

matrix25961y ago

wouldnt presenting numbers in reverse order, with the least significant digit on the left and most significant on the right help with the reasoning?

spencerchubb1y ago

They do that in the paper

topherjaynes1y ago

byt3h3adOP1y ago

my bad, should have posted it with the link itself

topherjaynes1y ago

Good to start with the concept, I just had so many implementation questions. Working through the code know... which is way harder to digest.

nerdponx1y ago

I like to see more focus on the input embeddings.

Shrezzing1y ago

Grimblewald1y ago

From a getting answers perspective yes, from an understanding LLMs perspective no. If you read the avstract you can see how this goes beyond arithmetic and helps with longform reasoning

simiones1y ago

Shrezzing1y ago

That's a fair summary of why the research is happening. Thanks.

gmerc1y ago

That's in fact what ChatGPT does ... because 99% accurate math is not useful to anyone.

ADeerAppeared1y ago

This is a cromulent approach, though it would be far more effective to have the LLM generate computer-algebra-system instructions.

And at that point, the user might as well use the backing system directly, and we should just write a convenient input DSL for that.

andrepd1y ago

Yes, this is what external tools/plugins/api calls are all about.

kjhcvkek771y ago

Very cool that it was able to generalise from small numbers to larger ones with such high accuracy.

skyde1y ago

Why not apply same concept every time a word is split into more than one token?

Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.

skyde1y ago

It seems it has been done before:

Context-Aware Neural Machine Translation Learns Anaphora Resolution" by Bawden et al. (2018). This paper explores integrating context and syntax into neural machine translation models.

michaelnny1y ago

I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance

ynik1y ago

Shrezzing1y ago

The paper discusses this, and the approach taken in the paper implements a number-flip stage, so numbers are formatted with their least significant figure first.

threatofrain1y ago

That doesn't stop decent code output for many computer languages.

wantsanagent1y ago

I like these kinds of fixes. It's like realizing your child has vision problems and getting them glasses.

winddude1y ago

But a calculator wouldn't be very good if it's only correct 99% of the time for arithmetic...

CyberDildonics1y ago

I'm pretty sure getting computers to do arithmetic is not a giant hurdle.

YeGoblynQueenne1y ago

So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

golol1y ago

Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.

YeGoblynQueenne1y ago

mike_hearn1y ago

They explain their true goal in the introduction:

> With positions resolved, we can study the logical extrapolation ability of transformers

YeGoblynQueenne1y ago

>> But this work shows in a controlled environment that the model can learn the principles of addition and extrapolate to much larger numbers.

No, because it's given hand-engineered embeddings that act as a strong inductive bias that is specific to addition. It's like addition is programmed right in.

zarzavat1y ago

YeGoblynQueenne1y ago

dagss1y ago

By such arguments, what is the point of any research, at all?

If everyone was using horses what would you had said about the first prototype car? Probably a very slow and clumsy and failureprone thing.

toxik1y ago

Obviously, the "best" way to do addition on a computer is by doing it exactly.

YeGoblynQueenne1y ago

Chinjut1y ago

Minor point, but Blaise Pascal was centuries earlier than the 1850s.

YeGoblynQueenne1y ago

Thanks, you're right - my bad.

IanCal1y ago

There are two sides to this that jump out

Second is there are a lot more places that understanding how to do arithmetic help, outside of just doing sums on their own.

famouswaffles1y ago

Nobody's going to be replacing calculators with transformers sure but many are and will be using transformers to solve problems arithmetic is a necessary component of.

>So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

You don't need to shove anything down for transformers to get arithmetic. Just changing how numbers are tokenized works. But that requires an entire retrain so why not explore other techniques?

And what does any of this have to do with AGI ? You know how terrible humans are at arithmetic right ?

YeGoblynQueenne1y ago

famouswaffles1y ago

>Yes, but humans invented arithmetic.

I think most would argue Mathematics is a discipline that is discovered more than invented. That said, this isn't really the point I think.

A few humans invented/discovered arithmetic. Most humans will be born, live and die inventing absolutely nothing, even those with the opportunity and resources to do so.

It doesn't make sense to me that a bar most humans can't reach is the bar for General Intelligence of the Artificial kind. You can't eat your cake and have it.

>The challenge is to get computers to create their own models. And that's a grand, open challenge that is not even close to be solved, certainly not by LLMs.

LLMs have fairly complex models of the world made manifest by the data they're trained on.

https://transformer-circuits.pub/2024/scaling-monosemanticit...

Lecun may disagree but some others like Hinton, Ilya and Norvig don't.

1 more reply

michaelt1y ago

> What is the point of this work? [...] We already know how to hard-code a (literally) infinitely more accurate addition machine.

There are many situations where it is useful for the LLM to get basic arithmetic right.

For example, if someone asks your LLM to explain this line of code [1] which takes a 28x28 px input image, is the right explanation that 28×28÷4×64=9216 ? Or is that the wrong explanation?

And being able to get 100-digit arithmetic right 99% of the time might make use feel reassured that the 4-digit arithmetic we need from the model will be right an even higher % of the time.

[1] https://github.com/pytorch/examples/blob/37a1866d0e0118875d5...

Xcelerate1y ago

> What is the point of this work?

YeGoblynQueenne1y ago

>> Seriously?

Yes, seriously.

>> The goal is to learn how to infer algorithmic processes directly from data.

r2_pilot1y ago

gmerc1y ago

That's great, 99% math is absolutely good enough for real world problems /s

traverseda1y ago

It's probably on-par or better than humans get unaided. Hell, I'd bet due to transcription errors it's better than what humans get in a lot of settings, even when aided by a calculator.

gmerc1y ago

I guarantee you professionals using math at work - for example in finance - not have a 1% error quota. They use tools. We have tools. Nobody in any serious role (money, etc) works unaided.

Math inference is a palor trick as is the whole “world model” bullshit - physics doesn’t work with 99% accuracy.

It’s the same reason agents are bullshit right now - error compounding at 95% reliability per step murders them and currently there is no path to triple 9

Jensson1y ago

People didn't have 1% error before computers either, they just practiced enough to barely ever make errors.

mike_hearn1y ago

How many "real world" problems require 100 digit numbers?

gmerc1y ago

Not many. But all require 100% accuracy. If not someone has to take responsibility.

j / k navigate · click thread line to collapse