Genie: Generative Interactive Environments (opens in new tab)

(sites.google.com)

82 pointskuter2y ago16 comments

16 comments

> Genie is capable of converting a variety of different prompts into interactive, playable environments that can be easily created, stepped into, and explored

If these are generating a fully interactive environments, why are all the clips ~1 second long?

Based on the first sentence in your paper, I would have expected a playable example as a demo. Or 20.

But reading a bit further into the paper, it sounds like the model needs to be actively running inference and will generate the next frame on the fly as actions are taken- is that correct?

jparkerholder2y ago

That is correct! The model generates each frame on the fly.

polygamous_bat2y ago

Firstly, do these models learn a good physics grounding for nonsense actions? Like keep pressing down even when you are in the ground? Or will they phase you through the ground?

Secondly, why are all videos like half a second long? I thought video generation came much farther than this. My guess would be that the world models unravel at any length longer than that, which is (and has always been) the problem with models such as these. Minus the video generation part, we had pretty good world models for games already, see Dreamer line of work: https://danijar.com/project/dreamerv3/

jparkerholder2y ago

Author here :) Re: 1) typically no, but of course it can hallucinate just like LLMs. 2) Agreed but the key point missing is Dreamer is trained from an RL environment with action labels. Genie is trained exclusively from videos and learns an action space. This is the first version of something that is now possible and will only improve with scale.

polygamous_bat2y ago

Thanks for braving the crowd here, you will unfortunately only find hard questions.

Anyway, about my second question: why are the videos only half second ish long? Does the model unravel after that?

Also

> This is the first version of something that is now possible and will only improve with scale.

11b params is already pretty large considering the stable diffusion and LLM scale. How much higher do we need to scale until we get something useful beyond simple setups?

jparkerholder2y ago

The bigger issue is lack of generating novel content rather than a total "unravel". We focus on OOD images because our motivation is generating diverse environments, but these are much harder to play for longer vs images closer to the training videos. It is interesting because one of the things you gain when going from 1B->10B is the OOD images working at all. Note it is not even trivial to detect the character given our model does not train with any labels or have any inductive biases to do so.

Point of clarification -- we don't expect bigger models to be the only way to improve this and are working on innovations on the modeling side, however we don't want to overlook the significance of scaling either :)

1 more reply

nycdatasci2y ago

The results seem quite bad. Compare the static image and "game" in this one example: Static Image: https://lh3.googleusercontent.com/c0GV4hG0Xg0eqpsUS1z62v6aJ2... "Game": https://lh5.googleusercontent.com/L_WsAa1saPmj29DSKda_fzk15y...

In the video, the character becomes a pixelated mess. In the static image, the character is clearly on rocks in the foreground, but in the "game" we see the character magically jumping from the foreground rocks to the background structure which also contains significant distortions.

The extremely short demo videos make it slightly harder to catch these obvious issues.

polygamous_bat2y ago

What is the video resolution, 64x64? And even then it becomes blurry. Seems like another Google flag-plant-y paper filled with hot air that we will never see the source code or model for because it will expose how poor its capabilities are relative to competitors.

The internal politics at these places must be exhausting. Industry research was supposed to be free from the publish or perish mindset, but it seems like it just got replaced by a different kind of need for posturing.

jparkerholder2y ago

Hey author here :) First, tough crowd, love it, always great to get feedback because we are actively working on improving the model. We are very happy to admit it is not perfect, but given not many people thought this was possible a year ago, I am quite excited to see the next step of improvement. This is like the GPT1 of foundation world models, and we have a fair few ideas in the works to speed up progress.

The resolution is 90p but we use an upsampler to make it 360p for examples on the website.

nullptr_deref2y ago

How can I get started with this kind of research? Is it even possible without a PhD? Thanks.

1 more reply

sqreept2y ago

I've read twice the announcement and I can't tell what this is good for. Can you please dumb it down for me?

snide2y ago

I'm old an immediately assumed this would link to historical retrospective of GEnie

https://en.wikipedia.org/wiki/GEnie

mdrzn2y ago

Seems very interesting, but as soon as I see "Google Research" or "Deepmind" now it's an instant turndown. Too much PR, not enough substance. Not targeting directly you guys with this research, but the company you work for.

joloooo2y ago

Looking forward to following your progress. I've been wanting to see how we might replace polygons for gaming long term, this seems like a step in the right direction.

j / k navigate · click thread line to collapse

16 comments

jasonjmcghee2y ago

> Genie is capable of converting a variety of different prompts into interactive, playable environments that can be easily created, stepped into, and explored

If these are generating a fully interactive environments, why are all the clips ~1 second long?

Based on the first sentence in your paper, I would have expected a playable example as a demo. Or 20.

But reading a bit further into the paper, it sounds like the model needs to be actively running inference and will generate the next frame on the fly as actions are taken- is that correct?

jparkerholder2y ago

That is correct! The model generates each frame on the fly.

polygamous_bat2y ago

Firstly, do these models learn a good physics grounding for nonsense actions? Like keep pressing down even when you are in the ground? Or will they phase you through the ground?

jparkerholder2y ago

polygamous_bat2y ago

Thanks for braving the crowd here, you will unfortunately only find hard questions.

Anyway, about my second question: why are the videos only half second ish long? Does the model unravel after that?

Also

> This is the first version of something that is now possible and will only improve with scale.

11b params is already pretty large considering the stable diffusion and LLM scale. How much higher do we need to scale until we get something useful beyond simple setups?

jparkerholder2y ago

1 more reply

nycdatasci2y ago

The extremely short demo videos make it slightly harder to catch these obvious issues.

polygamous_bat2y ago

jparkerholder2y ago

The resolution is 90p but we use an upsampler to make it 360p for examples on the website.

nullptr_deref2y ago

How can I get started with this kind of research? Is it even possible without a PhD? Thanks.

1 more reply

sqreept2y ago

I've read twice the announcement and I can't tell what this is good for. Can you please dumb it down for me?

snide2y ago

I'm old an immediately assumed this would link to historical retrospective of GEnie

https://en.wikipedia.org/wiki/GEnie

mdrzn2y ago

joloooo2y ago

Looking forward to following your progress. I've been wanting to see how we might replace polygons for gaming long term, this seems like a step in the right direction.

j / k navigate · click thread line to collapse