Learnings from 4 months of Image-Video VAE experiments (opens in new tab)

(linum.ai)

129 pointsschopra9093mo ago16 comments

16 comments

Hi HN, I’m one of the two authors of the post and the Linum v2 text-to-video model (https://news.ycombinator.com/item?id=46721488). We're releasing our Image-Video VAE (open weights) and a deep dive on how we built it. Happy to answer questions about the work!

plastic31692mo ago

Great work! I have been wondering what would it take to train with higher image bit depth (10 or 12b) and/or using camera footage only, not already heavily processed images? The usefulness of video generation in most professional use cases is limited because models are too end to end and completely contaminated with stock footage. Maybe quantities of training material needed is simply not there?

Not blaming you, but asking as I don’t usually have access to professionals working with video training.

schopra909OP2mo ago

It’s a great question. In terms of pre-training even if they were was enough data at that quality, storing it and either demuxing it into raw frames OR compressing it with a sufficiently powerful encoder likely would cost a lot of $. But there’s a case to potentially use a much smaller subset of that data to dial in aesthetics towards the end of training. The gotcha there would come in terms of data diversity. Often you see that models will adapt to the new distribution and forget patterns from the old data. It’s hard to disentangle a model learning clarity of detail from concepts, so you might forget key ideas when picking up these details. Nevertheless maybe there is a way to use small amounts of this data in a RL finetuning setup? In our experience RL post training changes very little in the underlying model weights — so it might be a “light” enough touch to elicit the the desired details.

selridge2mo ago

No questions but I appreciate the write-up! Thank you for sharing.

jjcm2mo ago

As someone currently working on their own VAE, you reasoning for why you went with WAN 2.1 and your learnings for what you think you did wrong really resonated with me, specifically:

> Looking back, we should have just filtered out these samples from the dataset and moved on.

I hadn't even considered to look and see if poor data quality was resulting in an inability to recreate. This is a good gotchya to look out for. Appreciate the deep dive here!

greatgib2mo ago

Very nice well written article!

The kind that I like so much on HN. It tickle your mind but is still clear enough for an advanced beginner.

Sacco2152mo ago

That's really cool work!

I've done some work in this area and here are my two cents:

1. Convolution-based architectures are terrible: I've trained Convolution based architectures and they were almost never scalable. Lately I've switched to transformer based AE and they are soo much better! We even managed to get Chinchilla-style scaling laws out of Transformer AE.

2. VAEs are terrible for downstream tasks: We've tried training Video diffusion models out of MAE and VAE (same architecture) and the MAE is hands down better.

3. This whole field is not science. There is no rigorous way of defining what a "good latent" really is. End-to-end methods (such as PixNerd) are the future, since they eliminate the need to hand-design and optimize the interface between separate components. That being said I've never seen a neural-field based video model and I've done some limited experiments on it with underwhelming results

syntaxing2mo ago

It’s been a while but I’m pretty sure the original deepfake used VAE as well. Super powerful idea and architecture

maciejzj2mo ago

Happy to see this architecture stil being handy for other people. Right now, I am using a heavily modified VAE for generation of synthetic satellite imagery. I am working with very limited dataset sizes, so I have chosen VAE over diffusion as a starting point.

lastdong2mo ago

This seems like a great model to experiment fine tuning with original art, given it’s relatively small and with open license. Is that a fair assessment?

Thanks for the great write up and making it available to us all.

schopra909OP2mo ago

yep, Apache 2.0! so anyone's welcome to download and hack away

asaiacai2mo ago

its cool to see the iterative improvements to your model laid out, but for everything that workedm i imagine there were at least a million other things you also tried but didnt work out. whats your process of trying these different techniques/architectures? do you just wait for one experiment to finish and visually inspect the results everytime. seems hard since these take a while to train. how do you shorten the feedback loop in this space?

schopra909OP2mo ago

honestly, it's really hard to shorten the feedback loop in this space. For this, we really just did run one experiment at a time and visually inspect the results everywhere. when you're going 0 -> 1, you're looking for "signs of life" to make sure the basic thing is working. when it comes to testing which (of the infinite levers) to the pull, a lot of it comes from intuition (which i know isn't the most fun answer). we spent a week or so just running experiments on the amount of compression we could squeeze out the VAE without significant degradation in the final results). In hindsight, spending a week on that seems like a waste, since we got the 8x spatial, 4x compression within the first 1-2 days. But in the moment, you're often unsure WHAT will be the key unlock. So, when you're in the middle of storm you're running a quick bayesian process in your head, measuring what you might learn from the outcome of the experiment vs. the time/money it would take to run the experiment. And you, hope that your intuitions become stronger over time, as you take more repetitions. More money, might help the problem (e.g. parallel experiments, more detailed explorations). But, I don't think money is a cure-all. At some point, you get lost in the sauce trying to tie the threads between all the empirical findings you have at your finger tips. Maybe one day AI models could help here integrating these all results. As it stands, they still struggle to reason about this stuff, in context of other research papers and findings (likely because all the context on arxiv is so noisy; you can't trust any particular finding and verifying findings is so hard to do, that it's hard to meta-reason about your experiments correctly).

DonThomasitos2mo ago

Nice summary! I missed the mention of EQ-VAE when it comes to generation quality. Tiny trick, huge impact! Have you tried it?

schopra909OP2mo ago

Hadn’t seen that before! Seems very in line with what with the broader points about regularization. In table 4 they show faster convergence in 200 epochs when used alongside REPA. I’d be curious to see if it ended up beating REPA by itself with full 800 epochs of training — or if something about this new latent space, leads to plateauing itself (learns faster but caps out on expressivity). We’ve seen that phenomena before in other situations (eg UNET learns faster than DiT because of convolutions, but stops learning beyond a certain point).

pwillia72mo ago

This is very cool thanks for sharing

j / k navigate · click thread line to collapse

16 comments

schopra909OP3mo ago

plastic31692mo ago

Not blaming you, but asking as I don’t usually have access to professionals working with video training.

schopra909OP2mo ago

selridge2mo ago

No questions but I appreciate the write-up! Thank you for sharing.

jjcm2mo ago

As someone currently working on their own VAE, you reasoning for why you went with WAN 2.1 and your learnings for what you think you did wrong really resonated with me, specifically:

> Looking back, we should have just filtered out these samples from the dataset and moved on.

I hadn't even considered to look and see if poor data quality was resulting in an inability to recreate. This is a good gotchya to look out for. Appreciate the deep dive here!

greatgib2mo ago

Very nice well written article!

The kind that I like so much on HN. It tickle your mind but is still clear enough for an advanced beginner.

Sacco2152mo ago

That's really cool work!

I've done some work in this area and here are my two cents:

2. VAEs are terrible for downstream tasks: We've tried training Video diffusion models out of MAE and VAE (same architecture) and the MAE is hands down better.

syntaxing2mo ago

It’s been a while but I’m pretty sure the original deepfake used VAE as well. Super powerful idea and architecture

maciejzj2mo ago

lastdong2mo ago

This seems like a great model to experiment fine tuning with original art, given it’s relatively small and with open license. Is that a fair assessment?

Thanks for the great write up and making it available to us all.

schopra909OP2mo ago

yep, Apache 2.0! so anyone's welcome to download and hack away

asaiacai2mo ago

schopra909OP2mo ago

DonThomasitos2mo ago

Nice summary! I missed the mention of EQ-VAE when it comes to generation quality. Tiny trick, huge impact! Have you tried it?

schopra909OP2mo ago

pwillia72mo ago

This is very cool thanks for sharing

j / k navigate · click thread line to collapse