Denoising Diffusion models from first principle in Julia (opens in new tab)

(liorsinai.github.io)

109 pointsthe_origami_fox3y ago21 comments

21 comments

This claims to explain diffusion models from first principles, but the issue with explaining how they work is we don't know how they work.

The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works: https://arxiv.org/abs/2208.09392

espadrine3y ago

> The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works

I’ll admit it is amusing that some assumptions on why it works were incorrect. The core idea of a Markov chain[0] where each state change leads to higher likelihood, is bound to work, even if the rest doesn’t.

In my mind, the Muse paper[1] gets closer to why it works: ultimately, the denoiser tries to match the latent space for an implicit encoder. The Muse system does this more directly and more effectively, by using cross-entropy loss on latent tokens instead.

In a way, the whole problem is no different from a language translation task. The only difference is that the output needs to be decoded into pixels instead of BPE tokens.

[0]: https://arxiv.org/abs/1503.03585

[1]: https://arxiv.org/abs/2301.00704

mlajtos3y ago

Huh, that is quite fascinating paper – we can learn to invert any image degradation and use it as generative model? Hm. Is there any research of using some U-net as the degradation function?

throwitfaar3y ago

Thanks for the link, cold diffusion is a great idea. Only 2 comments on HN about it four months ago.

adgjlsfhk13y ago

This is really cool, and I think possibly the best demonstration I've seen in a while of the power of Julia for tasks outside of science. The ability to use a high level fast and flexible language instead of being forced to compromise on a slow high level language that wraps fast but rigid libraries allows you to do some really cool stuff.

mk_stjames3y ago

Just in case the author is reading: in the introduction, you use the term "virtual RAM" to describe the GPU VRAM needed for running stable diffusion, but what that VRAM actually stands for is 'Video RAM'.

the_origami_foxOP3y ago

Auuthor here. Will update.

adammarples3y ago

I get the error trying to run MNIST(:train) that modules are not callable. When I search for the error I see a link to the repo issues raising this issue but the issue itself is missing. Was it deleted rather than closed? Weird.

radarsat13y ago

I've been trying out developing my own diffusion models from scratch lately, to understand this approach better and to compare against similar trials I previously did with GAN. My impression from reading posts like these was that it would be relatively easy once you understand it.. with the advantage that you get a nice normal supervised MSE target to train against, instead of having to deal with the instabilities of GANs.

I have found in practice that they do not deliver on this front. The loss curve you get is often just a big thick noisy straight line, completely devoid of information about whether it's converging. And the convergence seems to be greatly dependent on model choices and the beta schedule you choose. It's not clear to me at all how to choose those things in a principled manner. Until you train for a long time, you just basically get noise, so it's hard to know when to restart an experiment or keep going. Do I need 10 steps, 100, 1000? I found that training even longer, and longer, and longer, it does get better and better, very slowly, even though this is not displayed in the loss curve, and there seems to be no indication of when the model has "converged" in any meaningful sense. My understanding of why this is the case is that due to the integrative nature of the sampling process, even tiny errors in approximating the noise add up to large divergences.

I've also tried making it conditional on vector quantization codes, and it seems to fail to use them nearly as well as VQGAN does. At least I haven't had much success doing it directly in the diffusion model. After reading more into it, I found that most diffusion-based models actually use a conditional GAN to develop a latent space and a decoder, and the diffusion model is used to generate samples in the latent space. This strikes me that the diffusion model then can never actually do better than the associated GAN's decoder, which surprised me to realize since it's usually proposed as an alternative to GAN.

So, overall I'm failing to grasp the advantages this approach really has over just using a GAN. Obviously it works fantastically for these large scale generative projects, but I don't understand why it's better, to be honest, despite having read every article out there telling me again and again the same things about how it works. E.g. DALLE-1 used VQGAN, not diffusion, and people were pretty wowed by it. I'm not sure why DALLE-2's improvements can be attributed to their change to a diffusion process, if they are still using a GAN to decode the output.

Looking for some intuition, if anyone can offer some. I understand that the nature of how it iteratively improves the image allows it to deduce large-scale and small-scale features progressively, but it seems to me that the many upscaling layers of a large GAN can do the same thing.

the_origami_foxOP3y ago

Author here. I have noticed similar behaviour. As part of this exercise I tried to train a model to generate Pokemon based on This Pokemon Does Not Exist by HuggingFace. However my models only converged to nosiy smudges after 50 iterations and so I excluded it from the posts (I do mention my experiments at the end of part 2).

My first assumption was that the mdoel I was training was too small: 13 million parameters as opposed to the 1.3 billion in ruDALL-E (not sure how much of this is only the diffusion model). So that's a 100x smaller. I want to experiment with upscaling it.

Reading this I'm wondering if there's more I need to do. For example, training a conditioned model - "cheat" by given it in the index of the Pokemon during training but then you sample without an index - or make the model predict the standard deviation (beta tilde). Or as you say, work with loss functions.

More work to be done here.

GaggiX3y ago

Dalle 2 does not use any adversarial loss (so no GAN), it uses a text2image diffusion based model and two diffusion based upscaler, VQGAN is an autoencoder, alone it can't do much, Dalle 1 works thx to the autoregressive model (also no GAN), Stable Diffusion uses an autoencoder because running a diffusion model on a 1024/768/512 image is really inefficient as the model has no bottleneck, the autoencoder has an adversarial loss but upscaling a 64x64x4 latent up to a 512x512x3 image is a much simpler job than generating the 64x64x4 from scratch, that's why you need a diffusion or an autoregressive model as a base.

radarsat13y ago

Thanks for the corrections, I was including autoencoders that use an additional adversarial loss (such as VQGAN) when I said GAN.

> Dalle 1 works thx to the autoregressive model (also no GAN)

It uses an autoregressive model to predict codes for a pretrained VQGAN, doesn't it?

Doesn't Stable Diffusion's autoencoder also use an adversarial loss? Otherwise wouldn't it suffer the typical blurring problems well known to MSE?

1 more reply

siraben3y ago

What datasets were you using, how large was the model and what was the noise schedule? I’ve been contemplating implementing my own from scratch as well. I’m surprised that training with conditional labels did not help as much.

radarsat13y ago

Data is mel spectrograms. To be clear about the conditional labels, I was trying to get it to come up with a vector quantized code, so it's not conditionally labeled but rather I was using an embedding layer with a VQ layer to have it come up with its own codebook. This works well with VQGAN so I was surprised that for diffusion it just keeps setting all the codes to the same value and ignoring them, but maybe I'm doing something wrong. Still working on it.

I'm just expressing here that my expectation was that this method would be less finicky than GAN because it uses an MSE loss, but unfortunately it seems to have its own difficulties. No silver bullet, I guess. The integration sampling can be quite sensitive to imperfections and diverge easily, at least in early stages of training.

I decided to write this because it feels like the early days of GAN where overall there seems to be lots of these "explain diffusion from scratch" type articles out there, but not yet a lot discussing common pitfalls and how to deal with them.

2 more replies

KRAKRISMOTT3y ago

Are there any hosted Julia ML services like Grid.ai for Python?

huitzitziltzin3y ago

Julia Computing offers Julia Hub. Not sure if that’s what you’re looking for but maybe.

IlyaOrson3y ago

Awesome post!

j / k navigate · click thread line to collapse

21 comments

astrange3y ago

This claims to explain diffusion models from first principles, but the issue with explaining how they work is we don't know how they work.

The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works: https://arxiv.org/abs/2208.09392

espadrine3y ago

> The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works

In a way, the whole problem is no different from a language translation task. The only difference is that the output needs to be decoded into pixels instead of BPE tokens.

[0]: https://arxiv.org/abs/1503.03585

[1]: https://arxiv.org/abs/2301.00704

mlajtos3y ago

Huh, that is quite fascinating paper – we can learn to invert any image degradation and use it as generative model? Hm. Is there any research of using some U-net as the degradation function?

throwitfaar3y ago

Thanks for the link, cold diffusion is a great idea. Only 2 comments on HN about it four months ago.

adgjlsfhk13y ago

mk_stjames3y ago

the_origami_foxOP3y ago

Auuthor here. Will update.

adammarples3y ago

radarsat13y ago

the_origami_foxOP3y ago

More work to be done here.

GaggiX3y ago

radarsat13y ago

Thanks for the corrections, I was including autoencoders that use an additional adversarial loss (such as VQGAN) when I said GAN.

> Dalle 1 works thx to the autoregressive model (also no GAN)

It uses an autoregressive model to predict codes for a pretrained VQGAN, doesn't it?

Doesn't Stable Diffusion's autoencoder also use an adversarial loss? Otherwise wouldn't it suffer the typical blurring problems well known to MSE?

1 more reply

siraben3y ago

radarsat13y ago

2 more replies

KRAKRISMOTT3y ago

Are there any hosted Julia ML services like Grid.ai for Python?

huitzitziltzin3y ago

Julia Computing offers Julia Hub. Not sure if that’s what you’re looking for but maybe.

IlyaOrson3y ago

Awesome post!

j / k navigate · click thread line to collapse