undefined | Better HN

0 pointsthrowthrowuknow1y ago0 comments

I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.

The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.

There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

0 comments

OtherShrezzing1y ago

>There is no textbook or tutor helping them do this either it should be noted.

For this particular paper there isn't, but all of the large frontier models do have textbooks (we can assume they have almost all modern textbooks). They also have formal proofs of addition in Principia Mathematica, alongside nearly every math paper ever produced. And still, they demonstrate an incapacity to deal with relatively trivial addition - even though they can give you a step-by-step breakdown of how to correctly perform that addition with the columnar-addition approach. This juxtaposition seems transparently at odds with the idea of an underlying understanding & deductive reasoning in this context.

>There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

The paper is technically interesting, but I think it's reasonable to definitively conclude the model had not created an algorithm that is remotely as effective as columnar addition. If it had, it would be able to perform addition on n-size integers. Instead it has created a relatively predictable result that, when given lots of domain-specific problems, transformers get better at approximating the results of those domain-specific problems, and that when faced with problems significantly beyond its training data, its accuracy degrades.

That's not a useless result. But it's not the deductive reasoning that was being discussed in the thread - at least if you add the (relatively uncontroversial) caveat that deductive reasoning should lead to correct conclusion.

j / k navigate · click thread line to collapse

0 comments

OtherShrezzing1y ago

>There is no textbook or tutor helping them do this either it should be noted.

j / k navigate · click thread line to collapse