Solving Math Word Problems (opens in new tab)

(openai.com)

56 pointsyigitdemirag4y ago16 comments

16 comments

> Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?

> answer:

> Robert has 30 + 10 = 40 cherries.

> If there are 60 cherries to be shared, then Richard and Jerry will have 60 - 40 = 20 cherries each.

> Robert has 40 - 20 = 20 more cherries than Jerry.

Um, the answer is "correct" but isn't the actual reasoning wrong?

Robert has 30

Richard has 20

Jerry has 10

Hence they split the 60 this way.

dragontamer4y ago

This is some "Sideways stories from Wayside School" logic here. (https://wayside-school.fandom.com/wiki/Joe_(book_chapter))

> This doesn’t make any sense. When I count the wrong way I get the right answer, and when I count right I get the wrong answer.

---------

The other story this reminds me of is Abbot and Costello's "7 x 13 == 28" skit.

Jensson4y ago

Looks like it randomly applies operations and reasonings rather than read the text. This sentence for example makes no sense and shows this AI has no understanding of numbers whatsoever, not even first grade level understanding:

> If there are 60 cherries to be shared, then Richard and Jerry will have 60 - 40 = 20 cherries each.

gverrilla4y ago

> Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?

"and" here is kind of inexact, it implies a sum, so something else. "If Robert has 30 cherries, 10 more than Richard" would be better.

Jtsummers4y ago

It's not inexact, though. "and" in this case is a logical joiner not a numerical joiner. "Richard, Jerry, and Robert are going to share 60 cherries" is the first fact presented. "Robert has 30 cherries" is the second fact, it is one property about the cherries Robert has. "and has 10 more than Richard" is a third fact, it is another property of the cherries Robert has. The only addition that comes out of this is from the "10 more than Richard" bit, "more than" suggests addition, "and" does not. The way kids are taught to transform that would be something like:

  richard + jerry + robert = 60
  robert = 30
  robert = richard + 10

Trying to make Robert have 40 cherries makes the math conducted by the "AI" even more absurd, because it throws out the first fact (that there are 60 total).

psadri4y ago

This might work better if GPT3 is used to rewrite each statement into an algebraic equation. And then a equation solver is used to solve the system.

drzoltar4y ago

It’s frustrating how myopic these papers can be. It seems like the goal of the paper is to solely work within the GPT framework to test the theory of verifiers. Why not try verifiers out with other models? Perhaps it’s not a fair comparison but I remember a Kaggle competition [0] from six years ago which involved building models to solve grade school science multiple choice questions. A simple word2vec model already could achieve 50% accuracy. Despite multiple choice being (maybe?) easier than free response, I’m just skeptical that the way to solve these problems is to throw billions of weights at them. It’s also not convincing to me that this new dataset doesn’t suffer from a much smaller template space, in that the models still just memorize templates.

[0]: https://www.kaggle.com/c/the-allen-ai-science-challenge/over...

pred_4y ago

For a moment there, the title had me hoping that they were working on the generally undecidable https://en.m.wikipedia.org/wiki/Word_problem_(mathematics)

powera4y ago

Scoring 55% on a test like this should not be considered a great accomplishment. A sign of progress, yes, but not an accomplishment by itself.

This is still simply a system that is good at guessing. It does not know anything.

ur-whale4y ago

> It does not know anything.

I would argue that it "knows" an awful lot, but it can't actually reason with it.

However impressive GPT3 type models are, I am not particularly convinced that they're much more than glorified hashtables.

If the hash table is large enough, it can produce lot of answers to a lot of questions, or approximately imitate a lot of stuff it's seen before.

Whether it can actually combine "knowledge" it has stored in its weights into a pattern it's never seen before ... I'm not convinced.

Der_Einzige4y ago

Re: "Glorified Hastable"

There is a 1-1 correspondence between data compression and generative models. GPT-2 is a highly effective loseless data compression tool: https://bellard.org/textsynth/sms.html

Always wondered why this insight is not taught as much, especially in the context of things like dimensionality reduction...

1 more reply

mathteddybear4y ago

GEOS (2015) scores 49% on SAT problems and it is in geometry https://www.semanticscholar.org/paper/Solving-Geometry-Probl...

You were good at guessing!

gwern4y ago

SAT problems are multiple choice, with 5 options. So 50% is barely twice random guessing (1/5).

See how far randomly guessing an integer 1-1000 gets you with OP's word problems with freeform responses.

2 more replies

j / k navigate · click thread line to collapse

16 comments

howeyc4y ago

> Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?

> answer:

> Robert has 30 + 10 = 40 cherries.

> If there are 60 cherries to be shared, then Richard and Jerry will have 60 - 40 = 20 cherries each.

> Robert has 40 - 20 = 20 more cherries than Jerry.

Um, the answer is "correct" but isn't the actual reasoning wrong?

Robert has 30

Richard has 20

Jerry has 10

Hence they split the 60 this way.

dragontamer4y ago

This is some "Sideways stories from Wayside School" logic here. (https://wayside-school.fandom.com/wiki/Joe_(book_chapter))

> This doesn’t make any sense. When I count the wrong way I get the right answer, and when I count right I get the wrong answer.

---------

The other story this reminds me of is Abbot and Costello's "7 x 13 == 28" skit.

Jensson4y ago

> If there are 60 cherries to be shared, then Richard and Jerry will have 60 - 40 = 20 cherries each.

gverrilla4y ago

> Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?

"and" here is kind of inexact, it implies a sum, so something else. "If Robert has 30 cherries, 10 more than Richard" would be better.

Jtsummers4y ago

  richard + jerry + robert = 60
  robert = 30
  robert = richard + 10

Trying to make Robert have 40 cherries makes the math conducted by the "AI" even more absurd, because it throws out the first fact (that there are 60 total).

psadri4y ago

This might work better if GPT3 is used to rewrite each statement into an algebraic equation. And then a equation solver is used to solve the system.

drzoltar4y ago

[0]: https://www.kaggle.com/c/the-allen-ai-science-challenge/over...

pred_4y ago

For a moment there, the title had me hoping that they were working on the generally undecidable https://en.m.wikipedia.org/wiki/Word_problem_(mathematics)

powera4y ago

Scoring 55% on a test like this should not be considered a great accomplishment. A sign of progress, yes, but not an accomplishment by itself.

This is still simply a system that is good at guessing. It does not know anything.

ur-whale4y ago

> It does not know anything.

I would argue that it "knows" an awful lot, but it can't actually reason with it.

However impressive GPT3 type models are, I am not particularly convinced that they're much more than glorified hashtables.

If the hash table is large enough, it can produce lot of answers to a lot of questions, or approximately imitate a lot of stuff it's seen before.

Whether it can actually combine "knowledge" it has stored in its weights into a pattern it's never seen before ... I'm not convinced.

Der_Einzige4y ago

Re: "Glorified Hastable"

There is a 1-1 correspondence between data compression and generative models. GPT-2 is a highly effective loseless data compression tool: https://bellard.org/textsynth/sms.html

Always wondered why this insight is not taught as much, especially in the context of things like dimensionality reduction...

1 more reply

mathteddybear4y ago

GEOS (2015) scores 49% on SAT problems and it is in geometry https://www.semanticscholar.org/paper/Solving-Geometry-Probl...

You were good at guessing!

gwern4y ago

SAT problems are multiple choice, with 5 options. So 50% is barely twice random guessing (1/5).

See how far randomly guessing an integer 1-1000 gets you with OP's word problems with freeform responses.

2 more replies

j / k navigate · click thread line to collapse