Socratic Models – Composing Zero-Shot Multimodal Reasoning with Language (opens in new tab)

(socraticmodels.github.io)

115 pointsparsadotsh4y ago38 comments

38 comments

nynx4y ago

This is super impressive. Transformers have consistently done better than almost anyone thought.

I still hold the opinion that we’re going to need to move to spiking neuron (SNN) models in the future to keep growing the networks. Spiking networks require lots of storage, but a lot, lot less compute. They also propagate additional information in the _timing_ of the spikes, not just the values. There are a lot of low-hanging fruit in SNNs and I think people are still trying to copy biological systems too much.

Unfortunately, the main issue with SNNs is that no one has figured out a way to train them as effectively as ANNs.

vagabund4y ago

The comments of every ML paper posted on this site are dominated by people either baselessly discounting the results as a party trick or illusion, or shoehorning in their conjecture about what approach the field is overlooking.

As someone just trying to learn more about the implications of new research, I find myself resorting to /r/machinelearning, or even twitter threads, to get timely and informed discussions. That's a shame, given what HN sets out to be.

mountainriver4y ago

As an ML engineer I found the comment insightful. I agree HN takes a critical approach to list ML but that’s largely because there’s been so much snake oil with it

nynx4y ago

I’m certainly not discounting the results and I don’t see anything wrong with suggesting what I think would generally be a good path to look at in the future.

vagabund4y ago

It's not wrong per se, and I'm obviously in no place to police the discussion, but it's only tangentially related to the post and often clouds out what would be a more pointed deliberation over this research.

Maybe I'm expecting too much of HN, but I've seen these same two top level comments under myriad ML posts.

Sorry for the meta-discussion that's gotten us further away from this really remarkable paper.

2 more replies

ceeplusplus4y ago

As a community grows it attracts people who don't have the same background that drew the original members of the community together, so it becomes inevitable to see this kind of layman commentary. I've seen it happen to r/hardware which has been taken over by gamers with no CS background and AMD shareholders when it used to have a lot of knowledgable people commenting.

nynx4y ago

I don't claim to be an expert, but I actually do undergraduate neuromorphic computing research. So, I don't know much, but I do know a little about what I'm talking about.

gwern4y ago

Don't forget /r/mlscaling!

visarga4y ago

> Spiking networks require lots of storage, but a lot, lot less compute.

One way or another we need a 1000x increase in efficiency to be able to run these models on edge hardware with full privacy and outside the control of the big corporations.

Funny that Gary Marcus is pleading on Twitter to get Dall-E 2 access in order to formulate his response. He isn't getting access yet. https://twitter.com/GaryMarcus/status/1513215530366234625

That kind of gate-keeping is possible because the costs of training and inferencing these models is too high today.

prox4y ago

What’s the current problem with control here? Outside of the loop layman here.

teruakohatu4y ago

These transformer models are so huge, they require extremely expensive and specialist hardware beyond what enthusiasts and even many academica access to.

There is no chance in the near future consumers or Edge devices will be able to run these models locally, data is going to have to be fed back into the cloud.

2 more replies

derefr4y ago

> a lot of storage

Is this fundamental, or just a problem with mapping these models to our current serially-bottlenecked compute architectures? Could a move to “hyperconverged infrastructure in-the-small” — striping DRAM or NVMe and tiny RISC cores together on a die, where each CPU gets its own storage (or, you might say, where each small cluster of storage cells has its own tiny CPU attached), such that one stick has millions of independent+concurrent [+slow+memory-constrained] processors — resolve these difficulties?

nynx4y ago

They require roughly the same amount of storage as modern ANN networks except that "neurons/synapses" may have some additional state that needs to be stored. Compared to the compute they require in relation to the compute needed for large-scale ANNs though, the storage is a lot.

axg114y ago

At this point I'm comfortable in putting transformers as one of the top three developments in machine learning history. The way things are headed, they may turn out to be one of the most important "discoveries" ever made by humankind.

I'm extremely optimistic about how transformers can recursively speed up progress in multiple areas of science. Transformers are reaching a point where they can demonstrate reasoning abilities within the ballpark of what you might expect from a human. For certain qualities, they far exceed what any human is capable of. One of those areas being depth of knowledge. Transformers (e.g. RETRO) can incorporate a library of knowledge far larger than any human can. Soon we will improve and harness this ability to the point where it may be pointless to create a scientific hypothesis without first "consulting" a large language model that is able to process the entire library of scientific publications.

visarga4y ago

This paper shows we can combine models like lego bricks even without end-to-end training using language as intermediate representation. That means more flexibility in training the models, each on its own dataset, and more ways they can be combined in. By getting rid of fine-tuning the models may retain their robustness to distribution shifts.

robbedpeter4y ago

I think of the logic directing input through various different models and recursion in multiple pass inference as similar to executive function in a brain. Models are getting better and smaller, to the point you can use things like gpt-neo 125m with under 2gb vram, and you can run many-shot prompts to achieve much higher quality results than zero-shot output from larger models.

Gpt-3 type models are very good at selecting for arbitrary qualities from among a list of options. Generating a list of 10 potential answers, then running prompts on the candidates to select for quality, accuracy, style, and so forth resembles the cyclic formulation of ideas in humans. The process used to generate essays and articles - draft, edit, revise, simplify, repeat until satisfied - can be implemented trivially. Those processes will transfer to larger models, and things like RETRO reduce resources by orders of magnitude.

Cognitive architecture seems to be an accurate descriptor of the use of multiple models and the logic layers for many-shot, many model development.

It may not be human level with zero-shot output, but how many humans produce human-level output in their stream-of-consciousness output? The act of consideration, recursing over an idea and refining it, is achievable with these models in a way that humans can debug and tweak cycle to cycle.

Multipass "consideration" and revision methodologies can capture almost any meta-cognitive processes used by humans, whether it's Socratic method or the AP style guide or an arbitrary jumble of rules derived from 4chan posters.

mountainriver4y ago

This is really awesome, multimodal is definitely where transformers are headed and holds the promise of solving a lot of the grounding issues we see with the current sota

robbedpeter4y ago

Elon's robots might actually work out, at least in software.

This type of methodology, doing meta-cognitive programming by linking together different models, is awesome. They're constructing low resolution imitations of brains - gpt-3 and BERT and the like can do things that no individual model can achieve. A predicate logic layer can document and explain decision history, and the other modules start to resemble something like the subconscious mind.

arjvik4y ago

We've come to the consensus that large language models are just stochastic parrots... What makes us think that we can achieve a higher level of intelligence by putting them in conversation?

I think the next step in NLP will be a drastic innovation on today's learning model.

moconnor4y ago

This is not the consensus among ML researchers. Transformers are showing strong generalisation[1] and their performance continues to surprise us as they scale[2].

The Socratic paper is not about “higher intelligence”, it’s about demonstrating useful behaviour purely by connecting several large models via language.

[1] https://arxiv.org/abs/2201.02177

[2] https://arxiv.org/abs/2204.02311

robbedpeter4y ago

There is no such consensus. Transformers navigate problem spaces with various mechanisms that include recursion, and multi-pass inference means the depth can be arbitrary. This means that models pick up on the functions that generate answers, not simple statistical relationships you see in Markov chains.

"Stochastic parrot" is a derogatory term and I've never seen anyone who actually understands the technology use that phrase unironically. If anything, it's a shibboleth for bias or ignorance.

exdsq4y ago

I asked something similar previously on HN and a researcher in the field said that scaling size/computation actually does keep showing significant improvements

mountainriver4y ago

We have not come to that consensus and large language models display really interesting capabilities like few shot learning, which before we thought would require a widely different architecture

nl4y ago

> We've come to the consensus that large language models are just stochastic parrots

Anyone who thinks this REALLY doesn't know how language models work. A properly trained LM will only parrot something back because of lack of diversity in training data. This does happen in some cases (eg, GPL license or something) but those are pretty unique cases.

People on HN seem to think this a lot, but they are just wrong.

chaxor4y ago

It'sespecially true for ML in general on HN, but it's generally true for a lot of areas in the public - people often mistake skepticism for expertise or knowledge. I think the phenomenon is similar to the large crowd that cries "the sample is too small" any time statistics are brought up.

It's the first thing anyone learns, and it's easy to do.

It's really unfortunate, but that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output - that's less true in e.g. protein folding)

nl4y ago

> people often mistake skepticism for expertise or knowledge

This is a pretty good insight.

> that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output

I think also in NLP people see output that is the same as some training data, so think it is copying it. It takes some a little bit more thought to think "ok if I asked 100 experts to try to write how to sort an array in Python" or their code is going to be very similar. This doesn't mean it is copied.

gjm114y ago

"Stochastic parrots" -- have you seen, e.g., the examples in the PaLM paper of how it does on "chained inference" tasks? I don't see how you can classify that as mere parroting.

visarga4y ago

"Stochastic parrots" is a disparaging term coined by SJW propaganda. As if the brain is not stochastic, or we don't parrot from cultural sources. Language models have been accused of bias and lack of explainability, but humans are biased too and can't really explain how we take decisions.

Overall this term says "limited to the intelligence of a parrot" which is false, models can solve math and coding problems, generate passable art, translate and speak in hundreds of languages and beat us at all board and card games. When was a parrot able to do that?

robbedpeter4y ago

The math the models are doing are similar to rote rule chaining as opposed to calculation. The errors made look like kludged together lookups. I wonder if you could sequence the training of a model so that you could reinforce calculations over lookups, to encourage the development of an accurate and advanced mathematics module.

Neural networks can do math, but a lookup and memorized value model is structurally a lot different than a calculator model. The difference between them is a matter of weights for any given architecture. Tokenizing properly for math would help, but doing bit level tokenizing would be best, because that would allow multimodal domains to integrate more readily (i.e. audio/video/text models could share learned features more easily than if you are using parsed or domain specific tokens.) It's a great time to be alive.

riku_iki4y ago

> it does on "chained inference" tasks

To me, it is more proof of "stochastic parrot" behavior: model seen most of the available math information in internet, and even with significant computational power, can solve only 58% of elementary school level questions, and they were probably those with clear examples in training data, and can't generalize on those beyond.

robbedpeter4y ago

In limited, often zero or one shot probing of the model, yes. Do multiple generations and recursive passes over the output to have the model select and iterate on a target and the utility goes way up. You can coax great output from small models, even the 125m parameter gpt-neo.

The process kinda goes like this -

Think of ten answers to this question: blah blah blah

From these ten answers, which are the best 3?

Of the three answers, which is the best?

Revise and edit the best answer to be simpler or more understandable.

Prompt engineering is a nascent field, and we haven't seen nuanced or sophisticated use of the tool yet. Most of the metrics reported in papers are barely better than a naive Turing test. It doesn't take much introspection to know that even humans endlessly iterate and revise their output, and the best extemporaneous speech doesn't match well curated and edited material. It shouldn't surprise us that similar editing and revision processes will benefit transformer output.

j / k navigate · click thread line to collapse

38 comments

nynx4y ago

This is super impressive. Transformers have consistently done better than almost anyone thought.

Unfortunately, the main issue with SNNs is that no one has figured out a way to train them as effectively as ANNs.

vagabund4y ago

mountainriver4y ago

As an ML engineer I found the comment insightful. I agree HN takes a critical approach to list ML but that’s largely because there’s been so much snake oil with it

nynx4y ago

I’m certainly not discounting the results and I don’t see anything wrong with suggesting what I think would generally be a good path to look at in the future.

vagabund4y ago

Maybe I'm expecting too much of HN, but I've seen these same two top level comments under myriad ML posts.

Sorry for the meta-discussion that's gotten us further away from this really remarkable paper.

2 more replies

ceeplusplus4y ago

nynx4y ago

I don't claim to be an expert, but I actually do undergraduate neuromorphic computing research. So, I don't know much, but I do know a little about what I'm talking about.

gwern4y ago

Don't forget /r/mlscaling!

visarga4y ago

> Spiking networks require lots of storage, but a lot, lot less compute.

One way or another we need a 1000x increase in efficiency to be able to run these models on edge hardware with full privacy and outside the control of the big corporations.

Funny that Gary Marcus is pleading on Twitter to get Dall-E 2 access in order to formulate his response. He isn't getting access yet. https://twitter.com/GaryMarcus/status/1513215530366234625

That kind of gate-keeping is possible because the costs of training and inferencing these models is too high today.

prox4y ago

What’s the current problem with control here? Outside of the loop layman here.

teruakohatu4y ago

These transformer models are so huge, they require extremely expensive and specialist hardware beyond what enthusiasts and even many academica access to.

There is no chance in the near future consumers or Edge devices will be able to run these models locally, data is going to have to be fed back into the cloud.

2 more replies

derefr4y ago

> a lot of storage

nynx4y ago

axg114y ago

visarga4y ago

robbedpeter4y ago

Cognitive architecture seems to be an accurate descriptor of the use of multiple models and the logic layers for many-shot, many model development.

mountainriver4y ago

This is really awesome, multimodal is definitely where transformers are headed and holds the promise of solving a lot of the grounding issues we see with the current sota

robbedpeter4y ago

Elon's robots might actually work out, at least in software.

arjvik4y ago

We've come to the consensus that large language models are just stochastic parrots... What makes us think that we can achieve a higher level of intelligence by putting them in conversation?

I think the next step in NLP will be a drastic innovation on today's learning model.

moconnor4y ago

This is not the consensus among ML researchers. Transformers are showing strong generalisation[1] and their performance continues to surprise us as they scale[2].

The Socratic paper is not about “higher intelligence”, it’s about demonstrating useful behaviour purely by connecting several large models via language.

[1] https://arxiv.org/abs/2201.02177

[2] https://arxiv.org/abs/2204.02311

robbedpeter4y ago

"Stochastic parrot" is a derogatory term and I've never seen anyone who actually understands the technology use that phrase unironically. If anything, it's a shibboleth for bias or ignorance.

exdsq4y ago

I asked something similar previously on HN and a researcher in the field said that scaling size/computation actually does keep showing significant improvements

mountainriver4y ago

We have not come to that consensus and large language models display really interesting capabilities like few shot learning, which before we thought would require a widely different architecture

nl4y ago

> We've come to the consensus that large language models are just stochastic parrots

People on HN seem to think this a lot, but they are just wrong.

chaxor4y ago

It's the first thing anyone learns, and it's easy to do.

nl4y ago

> people often mistake skepticism for expertise or knowledge

This is a pretty good insight.

> that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output

gjm114y ago

"Stochastic parrots" -- have you seen, e.g., the examples in the PaLM paper of how it does on "chained inference" tasks? I don't see how you can classify that as mere parroting.

visarga4y ago

robbedpeter4y ago

riku_iki4y ago

> it does on "chained inference" tasks

robbedpeter4y ago

The process kinda goes like this -

Think of ten answers to this question: blah blah blah

From these ten answers, which are the best 3?

Of the three answers, which is the best?

Revise and edit the best answer to be simpler or more understandable.

j / k navigate · click thread line to collapse