Not blaming you, but asking as I don’t usually have access to professionals working with video training.
> Looking back, we should have just filtered out these samples from the dataset and moved on.
I hadn't even considered to look and see if poor data quality was resulting in an inability to recreate. This is a good gotchya to look out for. Appreciate the deep dive here!
The kind that I like so much on HN. It tickle your mind but is still clear enough for an advanced beginner.
I've done some work in this area and here are my two cents:
1. Convolution-based architectures are terrible: I've trained Convolution based architectures and they were almost never scalable. Lately I've switched to transformer based AE and they are soo much better! We even managed to get Chinchilla-style scaling laws out of Transformer AE.
2. VAEs are terrible for downstream tasks: We've tried training Video diffusion models out of MAE and VAE (same architecture) and the MAE is hands down better.
3. This whole field is not science. There is no rigorous way of defining what a "good latent" really is. End-to-end methods (such as PixNerd) are the future, since they eliminate the need to hand-design and optimize the interface between separate components. That being said I've never seen a neural-field based video model and I've done some limited experiments on it with underwhelming results
Thanks for the great write up and making it available to us all.