Text-to-4D Dynamic Scene Generation (opens in new tab)

(make-a-video3d.github.io)

134 pointsSebastian_093y ago17 comments

17 comments

These videos look too much like the things and their movement that I see in dreams. They are blurryish but makes sense but actually don't. e.g. the running rabbit, its legs are moving but its not. This is almost exactly how I remember dreams, when I see people moving, I can rarely notice their limbs moving accordingly. When I look at my own hands they might have more than 5 five fingers and very vague and blurry hand lines. When i try to run or walk, or fly its just as weird as these videos.

This reminds of how the first generation of these kind of image generators were said to be 'dreaming'. This also makes me think that do our brains really work like these algorithms (or these algos are mimicking brains very correctly).

radarsat13y ago

> trained only on Text-Image pairs and unlabeled videos

This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.

Sebastian_09OP3y ago

Link to paper https://arxiv.org/abs/2301.11280, dynamic visualisations only work in Chrome (?)

jerpint3y ago

Can confirm it doesn’t work on brave on mobile

dukeofdoom3y ago

Getting something that generates multiple angles of the same subject in different typical poses would go a long way. I can get midjourney to kind of do this by asking for "multiple angles", but it's hit or mis.

littlestymaar3y ago

I've expected NERF + Diffusion models for a while, but it looks like there's still a lot of work needed before it gets practical.

GaggiX3y ago

Performing these optimization processes during inference time has never been very practical for generative tasks, as it requires a lot of time, memory (to store the gradient) and the quality is usually mediocre. I still remember VQGAN+CLIP, the optimization process was to find a latent embedding that would maximize the cosine similarity between the CLIP encoded image and the CLIP encoded prompt, It worked but not very practical.

jackling3y ago

I really wish these datasets were more openly accessiable. I always want to try replicating these models but it seems that the data is the blocker. Renting the compute needed to create an inferiror model does not seem to be an issue, it's always the data.

nl3y ago

They generate training data using text to image (plus lots of additional work). Most of the paper is about this process.

jug3y ago

Here we go again. The samples look uncannily similar to the early text-to-image stuff we had.

ajjenkins3y ago

Can someone explain what’s 4D about this? Is it 4D because the 3D models are animated (moving)?

spdustin3y ago

4D: Height, width, depth, and time.

stale20023y ago

Another paper, with no code released?

What's the point then?

kamray233y ago

It's perfectly reasonable to release a publicly accessible paper while keeping the code to yourself, especially if you're Meta or OpenAI and wish to commercialize it at some point.

You can recreate things from papers fine. I've done it for several projects, it's often nicer than just copy-pasting in code and it fixes issues where one side is uisng Montreal's AI toolkit and another is using pytorch and one other is using keras.

Although for a tool like this, they clearly used pre-trained models as a large component, ones with publicly accessible weights as well. So replicating it will probably happen in the coming months if Meta doesn't (understandably) release the code they very clearly plan to use for their own Metaverse product.

thfuran3y ago

Sure, it's perfectly reasonable to release such a paper as PR. I don't think it's perfectly reasonable for any academic journal to accept it. Leaving the code out of a paper about claims regarding the code is like leaving the experiment design out of a material science paper.

nl3y ago

In addition it's worth noting that Meta is generally good at releasing source code.

Often there's a paper deadline and the code still needs tidying up, or the same codebase supports additional models that are published in additional papers.

Keep an eye on the facebookreaseach GitHub for this in the next few months.

radarsat13y ago

Code is nice, but a paper should be written sufficiently well that it gets the ideas across such that the solution can be replicated. The ideas are the point, not the implementation.

j / k navigate · click thread line to collapse

17 comments

smusamashah3y ago

radarsat13y ago

> trained only on Text-Image pairs and unlabeled videos

This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.

Sebastian_09OP3y ago

Link to paper https://arxiv.org/abs/2301.11280, dynamic visualisations only work in Chrome (?)

jerpint3y ago

Can confirm it doesn’t work on brave on mobile

dukeofdoom3y ago

littlestymaar3y ago

I've expected NERF + Diffusion models for a while, but it looks like there's still a lot of work needed before it gets practical.

GaggiX3y ago

jackling3y ago

nl3y ago

They generate training data using text to image (plus lots of additional work). Most of the paper is about this process.

jug3y ago

Here we go again. The samples look uncannily similar to the early text-to-image stuff we had.

ajjenkins3y ago

Can someone explain what’s 4D about this? Is it 4D because the 3D models are animated (moving)?

spdustin3y ago

4D: Height, width, depth, and time.

stale20023y ago

Another paper, with no code released?

What's the point then?

kamray233y ago

It's perfectly reasonable to release a publicly accessible paper while keeping the code to yourself, especially if you're Meta or OpenAI and wish to commercialize it at some point.

thfuran3y ago

nl3y ago

In addition it's worth noting that Meta is generally good at releasing source code.

Often there's a paper deadline and the code still needs tidying up, or the same codebase supports additional models that are published in additional papers.

Keep an eye on the facebookreaseach GitHub for this in the next few months.

radarsat13y ago

Code is nice, but a paper should be written sufficiently well that it gets the ideas across such that the solution can be replicated. The ideas are the point, not the implementation.

j / k navigate · click thread line to collapse