Neural Network Diffusion (opens in new tab)

(arxiv.org)

223 pointsvagabund2y ago86 comments

86 comments

I wasn't sure if this paper was parody on reading the abstract. It's not parody. Two things stand out to me: first is the idea of distilling these networks down into a smaller latent space, and then mucking around with that. That's interesting, and cross-sections a bunch of interesting topics like interpretability, compression, training, over- and under-.. The second is that they show the diffusion models don't just converge on similar parameters as the ones they train against/diffuse into, and that's also interesting.

I confess I'm not sure what I'd do with this in the random grab bag of Deep Learning knowledge I have, but I think it's pretty fascinating. I might like to see a trained latent encoder that works well on a bunch of different neural networks; maybe that thing would be a good tool for interpreting / inspecting.

daxfohl2y ago

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

Or maybe some metaparameter that mucks with the sizes during training produces better results. Start large to get a baseline, then reduce size to increase coherence and learning speed, then scale up again once that is maxed out.

SubiculumCode2y ago

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

vessenes2y ago

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

SubiculumCode2y ago

The deep problem of my life: I'm interested in so many things, but only have time to pursue one hobby and one neuroscience career. If it is indeed a good idea, its only from connecting gleaned generalizations with other gleaned generalizations; but the devil is often in the details; and I will never have enough time to try myself. :)

daxfohl2y ago

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

namibj2y ago

Hmmm, I could think of using it to update a DDPM with a conditioning input as the dataset expands from an RL/online process, without ruining the conditioning mechanism that's only trainable through the actual RL itself.

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

(Btw., if anyone has tips on how to not wreck the RL training's effort when updating the base model with the recently encountered semantically-valid training samples that can be used self-supervised, please tell. I'd hate to throw away the RL effort expended to aquire that much taking data for good self-supervised operation. It's already looking fairly expensive...)

daxfohl2y ago

You could use this and try to tease out something similar to https://news.ycombinator.com/item?id=39487124, but for NNs instead of images. Maybe it's possible to have this NN diffusion model explain the pieces of the NN they generate and why parameters have those values.

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

gwern2y ago

This doesn't seem all that impressive when you compare it to earlier work like 'g.pt' https://arxiv.org/abs/2209.12892 Peebles et al 2022. They cite it in passing, but do no comparison or discussion, and to my eyes, g.pt is a lot more interesting (for example, you can prompt it for a variety of network properties like low vs high score, whereas this just generates unconditionally) and more thoroughly evaluated. The autoencoder here doesn't seem like it adds much.

vagabundOP2y ago

Author thread: https://twitter.com/liuzhuang1234/status/1760195922502312197

squigz2y ago

Is there any sites for viewing Twitter threads without signing up?

f_devd2y ago

https://nitter.esmailelbob.xyz/liuzhuang1234/status/17601959...

(bit of trial and error from https://github.com/zedeus/nitter/wiki/Instances)

falcor842y ago

Seems like we're getting very close to recursive self-improvement [0].

[0] https://www.lesswrong.com/tag/recursive-self-improvement

astrange2y ago

No, this is an example of an existing technique called hypernetworks.

It's not "recursive self improvement", which is just a belief that magic is real and you can wish an AI into existence. In particular, this one needs too much training data, and you can't define "improvement" without knowing what to improve to.

FeepingCreature2y ago

All current LLMs are based on the premise that magic is real and you can wish intelligence into existence; it's called "scaling laws" and "emergent capabilities".

Recursive self-improvement isn't "maybe magic is real", it's "maybe the magic we already know about stays magical as we cast our spells with more mana."

z72y ago

Doesn't this line of reasoning imply that human intelligence is magical, i.e. is not the result of scaling/emergence?

2 more replies

killerstorm2y ago

> which is just a belief that magic is real

Is there a law of thermodynamics which prevents AI from writing code which would train a better AI? Never learned that one in school.

And FYI here's OpenAI plan to align superintelligence: "Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence."

I guess people working there believe in magic.

> and you can wish an AI into existence.

Eh? People believe that self-improvement might happen when AI is around human-level.

astrange2y ago

> Is there a law of thermodynamics which prevents AI from writing code which would train a better AI?

You need to apply Wittgenstein here.

This appears to be true because you haven't defined "better". If you define it, it'll become obvious that this is either false or true, but if it is true it'll be obvious in a way that doesn't make it sound interesting anymore.

(For one thing our current "AI" don't come from "writing code", they just come from training bigger models on the same data. For another, making changes to code doesn't make it exponentially better, and instead breaks it if you're not careful.)

> I guess people working there believe in magic.

Yes, OpenAI was literally founded by a computer worshipping religious cult.

> People believe that self-improvement might happen when AI is around human-level.

Humans don't have a "recursive self-improvement" ability.

Also not obvious that an AI that was both "aligned" and "capable of recursive self-improvement" would choose to do it; if you're an AI and you're making a new improved AI, how do you know it's aligned? It sounds unsafe.

6 more replies

koe1232y ago

> I guess people working there believe in magic.

I've been thinking about this recently. Personally, I've yet to see any compelling evidence that an LLM, let alone any AI, can operate really well "out of distribution". It's capabilities (in my experience) seem to be spanned by the data it's trained on. Hence, this supposed property that it can "train itself", generating new knowledge in the process, is yet to be proven in my mind.

That raises the question for me: why do OpenAI staff believe what they believe?

If I'm being optimistic, I suppose they may have seen unreleased tech, motivating their beliefs that seemingly AGI is on the horizon.

If I'm being cynical, the promise of AGI probably draws in much more investment. Thus, anyone with a stake in OpenAI has an incentive to promote this narrative of imminent AGI, regardless of how realistic it is technically.

This is of course just based on what I've seen and read, I'd love to see evidence that counter my claims.

1 more reply

rdedev2y ago

Even if recursive self improvement does work out my hunch is that is going to be logarithmic instead of exponential mostly down to just availability of data. It might go beyond human intelligence but I don't think it will reach singularity

1 more reply

advael2y ago

To be honest, I think a lot of smart people are willing to believe in magic when they've demonstrated some strong capability and the people funding their company want magic to happen.

1 more reply

woopsn2y ago

They do. Altman is saying their tech may be poised to capture the sum of all value in Earth's future light cone.

Saying "well that is not physically impermissible" doesn't make it real.

In any case nobody has ever shown that recursive self-improvement "takes off", and nor is that what we should expect a priori.

mattnewton2y ago

I upvoted because this was my first thought too, but reading the abstract and skimming the paper makes me think it’s not really an advance for general recursive improvement. I think the title makes people think this is a text -> model model, when it is really a bunch of model weights -> new model weights optimizer for a specific architecture and problem. Still a potentially very useful idea for learning from a bunch of training runs and very interesting work!

fnordpiglet2y ago

I suspect this is useful for porting one vector space to another which is an open problem when you’ve trained one model with one architecture and need to port it to another architecture without paying the full retraining cost.

GuB-422y ago

Doesn't look that different from what we are already doing. For example AlphaGo/AlphaZero/MuZero learn to play board games by playing repeatedly against itself, it is a self improvement loop leading to superhuman play. It was a major breakthrough for the game of Go, and it lead to advances in the field of machine learning, but we are still far from something resembling technological singularity.

GANs are another example of self-improvement. It was famous for creating "deep fakes". It works by pitting a fake generator and a fake detector against each other, resulting in a cycle of improvement. It didn't get much further than that, in fact, it is all about attention and transformers now.

This is just a way of optimizing parameters, it will not invent new techniques. It can say "put 1000 neurons there, 2000 there, etc...", but it still has to pick from what designers tell it to pick from. It may adjust these parameters better than a human can, leading to more efficient systems, I expect some improvement to existing systems, but not a breaking change.

pests2y ago

Go and Chess still has rules that are hard coded which at least gives a framework to optimize in. What rules do you give an LLM?

drdeca2y ago

Some sort of "generate descriptions of novel tasks including ways to evaluate performance at those tasks, evaluate quality of the generated tasks+evaluation-metrics, split tasks into subtasks, estimate difficulty of tasks in a way that is is judged on how it compares to a combined estimated difficulty of generated subtasks and to actual success rate and quality" sort of deal?

spangry2y ago

Physics.

1 more reply

AgentME2y ago

The real magic of recursive self improvement happens only after you have human-level AI that is able to match and surpass human ability in designing AI architectures. Escape-velocity-breaking recursive self improvement doesn't look like a human-made architecture being trained further, it looks like an AI understanding why transformers/etc were successful and coming up with an advancement over transformers.

philsnow2y ago

A rare opportunity for the other four-letter comic to be applicable: http://smbc-comics.com/comic/2011-12-13

(Though I suppose this skips Neuralink / step 3 and jumps right to step 4.)

bamboozled2y ago

The ai is ready to take off to perfection land

goggy_googy2y ago

"We synthesize 100 novel parameters by feeding random noise into the latent diffusion model and the trained decoder." Cool that patterns exist at this level, but also, 100 params means we have a long way to go before this process is efficient enough to synthesize more modern-sized models.

Scene_Cast22y ago

Yay, an alternative to backprop & SGD! Really interesting and impressive finding, I was surprised that the network generalizes.

justanotherjoe2y ago

fuck. I have an idea just like this one. I guess it's true that ideas are a dime a dozen. Diffusions bear a remarkable similarity to backpropagation to me. I thought that it could be used in place of it for some parts of a model.

Furthermore, I posit that resnet especially in transformers allows the model into a more exploratory behavior that is really powerful, and is a necessary component of the power of transformers. Transformers is just such a great architecture the more i think about it. It's doing so many things so right. Although this is not really related to the topic.

crotchfire2y ago

Actually it is related.

Transformers are just networks that learn to program the weights of other networks [1]. In the successful cases the programmed network has been quite primitive -- merely a key-value store -- in order to ensure that you can backpropagate errors from the programmed network's outputs all the way to the programmer network's inputs.

The present work extends this idea to a different kind of programmed network: a convolutional image-processing network.

There are many more breakthroughs to be achieved along this line of research -- it is a rich vein to mine. I believe our best shot at getting neural networks to do discrete math and symbolic logic, and to write nontrivial computer programs, will result from this line of research.

[1] https://arxiv.org/abs/2102.11174

goggy_googy2y ago

Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.

nerdponx2y ago

I'd have liked to see the distribution of generated model performance.

QuadmasterXLII2y ago

Fig 4b

marojejian2y ago

Am i missing something, or is this just a case of "amortized inference", where you train a model (here a diffusion one), to infer something that was previously found via optimization procedure? (here NN parameters).

jackblemming2y ago

The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.

hackerlight2y ago

According to Hinton, before transformers were shown to work well, learning model architectures was Google's main focus

hoc2y ago

Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?

jarrell_mark2y ago

Can this be used to fill in the missing information on the openworm nematode 302 neurons brain simulator?

amelius2y ago

Why does Figure 7 not include a validation curve (afaict only the training curve is shown)?

nullc2y ago

heh https://news.ycombinator.com/item?id=39208213#39211749

HanClinto2y ago

hah, nice! :D

t_serpico2y ago

i'd wager that adding noise to the weights in a principled fashion would accomplish something similar to this.

jerpint2y ago

I would really be surprised if just adding noise would give you convergence

j / k navigate · click thread line to collapse

86 comments

vessenes2y ago

daxfohl2y ago

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

SubiculumCode2y ago

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

vessenes2y ago

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

SubiculumCode2y ago

daxfohl2y ago

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

namibj2y ago

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

daxfohl2y ago

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

gwern2y ago

vagabundOP2y ago

Author thread: https://twitter.com/liuzhuang1234/status/1760195922502312197

squigz2y ago

Is there any sites for viewing Twitter threads without signing up?

f_devd2y ago

https://nitter.esmailelbob.xyz/liuzhuang1234/status/17601959...

(bit of trial and error from https://github.com/zedeus/nitter/wiki/Instances)

falcor842y ago

Seems like we're getting very close to recursive self-improvement [0].

[0] https://www.lesswrong.com/tag/recursive-self-improvement

astrange2y ago

No, this is an example of an existing technique called hypernetworks.

FeepingCreature2y ago

All current LLMs are based on the premise that magic is real and you can wish intelligence into existence; it's called "scaling laws" and "emergent capabilities".

Recursive self-improvement isn't "maybe magic is real", it's "maybe the magic we already know about stays magical as we cast our spells with more mana."

z72y ago

Doesn't this line of reasoning imply that human intelligence is magical, i.e. is not the result of scaling/emergence?

2 more replies

killerstorm2y ago

> which is just a belief that magic is real

Is there a law of thermodynamics which prevents AI from writing code which would train a better AI? Never learned that one in school.

I guess people working there believe in magic.

> and you can wish an AI into existence.

Eh? People believe that self-improvement might happen when AI is around human-level.

astrange2y ago

> Is there a law of thermodynamics which prevents AI from writing code which would train a better AI?

You need to apply Wittgenstein here.

> I guess people working there believe in magic.

Yes, OpenAI was literally founded by a computer worshipping religious cult.

> People believe that self-improvement might happen when AI is around human-level.

Humans don't have a "recursive self-improvement" ability.

6 more replies

koe1232y ago

> I guess people working there believe in magic.

That raises the question for me: why do OpenAI staff believe what they believe?

If I'm being optimistic, I suppose they may have seen unreleased tech, motivating their beliefs that seemingly AGI is on the horizon.

This is of course just based on what I've seen and read, I'd love to see evidence that counter my claims.

1 more reply

rdedev2y ago

1 more reply

advael2y ago

To be honest, I think a lot of smart people are willing to believe in magic when they've demonstrated some strong capability and the people funding their company want magic to happen.

1 more reply

woopsn2y ago

They do. Altman is saying their tech may be poised to capture the sum of all value in Earth's future light cone.

Saying "well that is not physically impermissible" doesn't make it real.

In any case nobody has ever shown that recursive self-improvement "takes off", and nor is that what we should expect a priori.

mattnewton2y ago

fnordpiglet2y ago

GuB-422y ago

pests2y ago

Go and Chess still has rules that are hard coded which at least gives a framework to optimize in. What rules do you give an LLM?

drdeca2y ago

spangry2y ago

Physics.

1 more reply

AgentME2y ago

philsnow2y ago

A rare opportunity for the other four-letter comic to be applicable: http://smbc-comics.com/comic/2011-12-13

(Though I suppose this skips Neuralink / step 3 and jumps right to step 4.)

bamboozled2y ago

The ai is ready to take off to perfection land

goggy_googy2y ago

Scene_Cast22y ago

Yay, an alternative to backprop & SGD! Really interesting and impressive finding, I was surprised that the network generalizes.

justanotherjoe2y ago

crotchfire2y ago

Actually it is related.

The present work extends this idea to a different kind of programmed network: a convolutional image-processing network.

[1] https://arxiv.org/abs/2102.11174

goggy_googy2y ago

Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.

nerdponx2y ago

I'd have liked to see the distribution of generated model performance.

QuadmasterXLII2y ago

Fig 4b

marojejian2y ago

jackblemming2y ago

The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.

hackerlight2y ago

According to Hinton, before transformers were shown to work well, learning model architectures was Google's main focus

hoc2y ago

Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?

jarrell_mark2y ago

Can this be used to fill in the missing information on the openworm nematode 302 neurons brain simulator?

amelius2y ago

Why does Figure 7 not include a validation curve (afaict only the training curve is shown)?

nullc2y ago

heh https://news.ycombinator.com/item?id=39208213#39211749

HanClinto2y ago

hah, nice! :D

t_serpico2y ago

i'd wager that adding noise to the weights in a principled fashion would accomplish something similar to this.

jerpint2y ago

I would really be surprised if just adding noise would give you convergence

j / k navigate · click thread line to collapse