3D Novel View Synthesis with Diffusion Models (opens in new tab)

(3d-diffusion.github.io)

106 pointsdougabug3y ago13 comments

13 comments

This approach is interesting in that it applies image-to-image diffusion modeling to autoregressively generate 3D consistent novel views, starting with even a single reference 2D image. Unlike some other approaches, a NeRF is not needed as an intermediate representation.

nl3y ago

Interesting to compare this to DreamFusion[1] which came out of Google Research (which also did this paper) today.

In DreamFusion they do use a NeRF representation.

[1] https://dreamfusion3d.github.io/gallery.html

oifjsidjf3y ago

>> In order to maximize the reproducibility of our results, we provide code in JAX (Bradbury et al., 2018) for our proposed X-UNet neural architecture from Section 2.3

Nice.

OpenAI shitting their pants even more.

astrange3y ago

Oh, OpenAI does more or less release that much. People don't have issues implementing the models from their papers.

What they don't do is release the actual models and datasets, and it's very expensive to retrain those.

nl3y ago

They released CLIP (both model and code[1]), which is very broadly used in Dall-E alternatives. For example Stable Diffusion uses it.

They also release Whisper model and code[2]

[1] https://github.com/openai/CLIP

[2] https://github.com/openai/whisper

rasz3y ago

This is one of the building blocks absolutely required for Full Self driving to ever work.

btw I like how it hallucinated bumper carrier mounted Spare Wheel based on the size of tires, heavy duty roof rack and bull bars while ground truth render was in a much less likely configuration of stock undercarriage frame hanger/no spare.

mlajtos3y ago

Ok, NeRFs were a distraction then.

dougabugOP3y ago

No, NeRFs are more interpretable because they directly model field densities which absorb and emit light. In this respect they are something akin to a neural version of photogrammetry. They don’t need to train on a large corpus of images, because they can reconstruct directly from a collection of posed images.

On the other hand, diffusion models can learn fairly arbitrary distributions of signals, so by exploiting this learned prior together with view consistency, they can be much more sample efficient than ordinary NeRFs. Without learning such a prior, 3D reconstruction from a single image is extremely ill-posed (much like monocular depth estimation).

dicknuckle3y ago

I'm entirely unfamiliar with this, but is there a future where we can take a few pictures of something physical, and have AI generate a 3d model that we can then modify and 3d print?

Asking as someone who's dreadfully slow at 3d modeling.

dougabugOP3y ago

Yes, the future is now. It’s still early, extracted model quality will undoubtedly improve dramatically over time.

https://blogs.nvidia.com/blog/2022/09/23/3d-generative-ai-re...

https://research.nvidia.com/publication/2021-11_extracting-t...

foobarbecue3y ago

You're probably looking for multiview photogrammetry. Also known as "structure from motion." https://github.com/alicevision/Meshroom is a good free tool for this, but the most popular one is probably AGISoft MetaShape.

dr_dshiv3y ago

It seems like this be used to create multiple views for fine tuning Stable Diffusion (textual inversion), from a single image.

muschellij23y ago

Soon to be the Face Back APP!

j / k navigate · click thread line to collapse

13 comments

dougabugOP3y ago

nl3y ago

Interesting to compare this to DreamFusion[1] which came out of Google Research (which also did this paper) today.

In DreamFusion they do use a NeRF representation.

[1] https://dreamfusion3d.github.io/gallery.html

oifjsidjf3y ago

>> In order to maximize the reproducibility of our results, we provide code in JAX (Bradbury et al., 2018) for our proposed X-UNet neural architecture from Section 2.3

Nice.

OpenAI shitting their pants even more.

astrange3y ago

Oh, OpenAI does more or less release that much. People don't have issues implementing the models from their papers.

What they don't do is release the actual models and datasets, and it's very expensive to retrain those.

nl3y ago

They released CLIP (both model and code[1]), which is very broadly used in Dall-E alternatives. For example Stable Diffusion uses it.

They also release Whisper model and code[2]

[1] https://github.com/openai/CLIP

[2] https://github.com/openai/whisper

rasz3y ago

This is one of the building blocks absolutely required for Full Self driving to ever work.

mlajtos3y ago

Ok, NeRFs were a distraction then.

dougabugOP3y ago

dicknuckle3y ago

I'm entirely unfamiliar with this, but is there a future where we can take a few pictures of something physical, and have AI generate a 3d model that we can then modify and 3d print?

Asking as someone who's dreadfully slow at 3d modeling.

dougabugOP3y ago

Yes, the future is now. It’s still early, extracted model quality will undoubtedly improve dramatically over time.

https://blogs.nvidia.com/blog/2022/09/23/3d-generative-ai-re...

https://research.nvidia.com/publication/2021-11_extracting-t...

foobarbecue3y ago

dr_dshiv3y ago

It seems like this be used to create multiple views for fine tuning Stable Diffusion (textual inversion), from a single image.

muschellij23y ago

Soon to be the Face Back APP!

j / k navigate · click thread line to collapse