undefined | Better HN

0 pointsYeGoblynQueenne1y ago0 comments

>> The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.

>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.

>> You can't deduce much from negative results in research beside it requiring more work.

I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.

0 comments

GistNoesis1y ago

Gradient descent usually get stuck in local minimum, it depends on the shape of the energy landscape, that's expected behavior.

The current wisdom is that by optimizing for multiple tasks simultaneously, it makes the energy landscape smoother. One task allow to discover features which can be used to solve other tasks.

Useful features that are used by many tasks can more easily emerge from the sea of useless features. If you don't have sufficiently many distinct tasks the signal doesn't get above the noise and is much harder to observe.

That the whole point of "Generalist" intelligence in the scaling hypothesis.

For problems where you can write a solution manually you can also help the training procedure by regularising your problem by adding the auxiliary task of predicting some custom feature. Alternatively you can "Generatively Pretrain" to obtain useful feature, replacing custom loss function by custom data.

The paper is a useful characterisation of the energy landscape of various formal tasks in isolation, but doesn't investigate the more general simpler problem that occur in practice.

j / k navigate · click thread line to collapse