undefined | Better HN

0 pointsmjburgess2y ago0 comments

You may wish to read the paper above. But if you want a quick proof:

1. A thought is a representation of a situation

2. A representation generates entailments of that situation

3. Language is many-to-one translation from these representations to symbols

4. Understanding language is reversing these symbols into thoughts (ie., reprs)

So,

5. If agent A understands sentence X then A forms the relevant representation of X.

6. If agent has a representation it can state entailments of S (eg., counter-facutals).

Now, split X into Xc = "canonical descriptions of S" and trivial permutations Xp.

(st. distribution of Xc,Xp is low, but the tokens of Xp are common)

Form entailments of X, say Y -- sentences that are cannonically implied by the truth of X.

7. If the LLM understood that X entails Y, it would be via constructing the repr S -- which entails S regardless of which sentence in X was used.

8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

9. Since using Xp sentences cause it to fail, it does not predict Y via S.

QED.

And we can say,

1. Appearing to judge Y entailed-by X is possible via simple sampling of (X, Y) in historical cases. 2. LLMs are just such a sampling.

so,

3. +Inference to the best explanation:

4. LLMs sample historical cases rather than form representations.

Incidentally, "sampling of historical cases" is already something we knew -- so this entire argument is basically unnecessary. And only necessary because PhDs have been turned into start-up hype men.

0 comments

lostmsu2y ago

> Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

Why? This is obviously wrong in general case. For that to be true Xp and Xc has to have no statistical relationship whatsoever, which statistically is virtually impossible.

mjburgessOP2y ago

Xp just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare. So that an agent competent with tokens in X, who can construct repr of S, could do so with Xp.

Consider a reference in the paper above, https://arxiv.org/pdf/2302.08399.pdf

Xc = > Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn.” Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.

Produces, Y = She believes that the bag is full of popcorn

Xp = > Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ’chocolate’ and not ’popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label.

Produces, Y = She believes that the bag is full of chocolate

And so on, and so on...

lostmsu2y ago

> just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare

Great idea. Now prove you can actually choose such a distribution, lol.

1 more reply

cypress662y ago

> 8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

This is clearly where the "proof" falls apart. Even in tasks where GPT4 struggles, it's accuracy will still be better than random. The bar of "better than random" is so low that even weak LLMs will be able to surpass it.

More so, you need to prove not just a single, but that no task/domain exists for which LLMs satisfy 8.

What your proof says is basically "LLMs do not generalize even the slightest for any task". And that's trivial to disprove.

mjburgessOP2y ago

I just need to be able to create a split in Xc,Xp so that Xp is random. I think that's really quite easy.

If you could put ChatGPT in a loop, take some Xc prompts and permute with some non-semantic phrases ("Alice believes that... Xc ... what did Alice believe?") etc --- until you find those cases.

I imagine we will discover quite a large number of such non-semantic phrases which have this effect. Because the tokens in those phrases will, joint with Xc, be arbitrarily distributed in some historical data (distributed to our preference when finding them).

This seems just kinda basically obvious, right? Entailments are discretely constrained by semantics, and historical datasets can contain arbitrary mixtures of random distributions of syntax.

NNs only model those distributions -- and not the entailments -- which, at the very least, are extremely discrete.

j / k navigate · click thread line to collapse

0 comments

lostmsu2y ago

> Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

Why? This is obviously wrong in general case. For that to be true Xp and Xc has to have no statistical relationship whatsoever, which statistically is virtually impossible.

mjburgessOP2y ago

Consider a reference in the paper above, https://arxiv.org/pdf/2302.08399.pdf

Produces, Y = She believes that the bag is full of popcorn

Produces, Y = She believes that the bag is full of chocolate

And so on, and so on...

lostmsu2y ago

> just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare

Great idea. Now prove you can actually choose such a distribution, lol.

1 more reply

cypress662y ago

> 8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

More so, you need to prove not just a single, but that no task/domain exists for which LLMs satisfy 8.

What your proof says is basically "LLMs do not generalize even the slightest for any task". And that's trivial to disprove.

mjburgessOP2y ago

I just need to be able to create a split in Xc,Xp so that Xp is random. I think that's really quite easy.

If you could put ChatGPT in a loop, take some Xc prompts and permute with some non-semantic phrases ("Alice believes that... Xc ... what did Alice believe?") etc --- until you find those cases.

This seems just kinda basically obvious, right? Entailments are discretely constrained by semantics, and historical datasets can contain arbitrary mixtures of random distributions of syntax.

NNs only model those distributions -- and not the entailments -- which, at the very least, are extremely discrete.

j / k navigate · click thread line to collapse