Diffusion models are real-time game engines (opens in new tab)

(gamengen.github.io)

1149 pointsjmorgan1y ago409 comments

409 comments

So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

Anyway, a fun idea that worked! Love those.

wavemode1y ago

> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).

This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.

dewarrn11y ago

Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

lawlessone1y ago

I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad

robotresearcher1y ago

Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.

1 more reply

bamboozled1y ago

Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.

3 more replies

throwway_2783141y ago

Work which exaggerates the blindness.

The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.

And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.

Why would you think the gorillas were unusual?

2 more replies

nmstoker1y ago

I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.

wavemode1y ago

Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.

But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.

2 more replies

whiteboardr1y ago

But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

robotresearcher1y ago

In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.

2 more replies

bee_rider1y ago

In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.

beepbooptheory1y ago

What would the model provide if not what we see on the screen?

1 more reply

mensetmanusman1y ago

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

debo_1y ago

It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!

codeflo1y ago

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

bee_rider1y ago

It it like some kind of weird dream doom.

hoosieree1y ago

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL1y ago

So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.

Groxx1y ago

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

Workaccount21y ago

I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.

idunnoman12221y ago

Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.

1 more reply

idunnoman12221y ago

Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.

nielsbot1y ago

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

alickz1y ago

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

noiv1y ago

I think the real question is does the player get shot from behind?

1 more reply

refibrillator1y ago

Just want to clarify a couple possible misconceptions:

The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.

Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.

De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.

In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.

rvnx1y ago

The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.

The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.

Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.

But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.

The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.

It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.

mensetmanusman1y ago

Research is the acquisition of knowledge that may or may not have practical applications.

They succeeded in the research, gained knowledge, and might be able to do something awesome with it.

It’s a success even if they don’t sell anything.

1 more reply

jetrink1y ago

I don't think you've understood the project completely. The model accepts player input, so frame 601 could be quite different if the player decided to turn left rather than right, or chose that moment to fire at an exploding barrel.

1 more reply

nine_k1y ago

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

I would call it the world's least efficient video compression.

What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?

PoignardAzur1y ago

> But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.

So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.

bergen1y ago

How is what you think they say not clear?

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

Sharlin1y ago

No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.

taneq1y ago

It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.

TeMPOraL1y ago

It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.

mensetmanusman1y ago

They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.

WithinReason1y ago

If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.

nine_k1y ago

But it's trained on the actual screen pixel data, AFAICT. It's literally a visual imagination model, not gameplay / geometry imagination model. They had to make special provisions to the pixel data on the HUD which by its nature different than the pictures of a 3D world.

pradn1y ago

> Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.

Even Google only has one Gemini (in a few versions).

fennecfoxy1y ago

If anybody Google would know most about that after their LLM memo all that time ago (basically "we're losing because we're trying to fight/compete with OS models"): https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

raghavbali1y ago

Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

bubaumba1y ago

> nice reminder that open models are useful to

You didn't say open _what_ models. Was that intentional?

Philpax1y ago

They did, SD 1.4

wkcheng1y ago

It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

lokimedes1y ago

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin1y ago

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

lokimedes1y ago

I fabulate about this in another comment below:

> Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

(I obviously don't know what I'm talking about, just a fellow aphant)

2 more replies

zimpenfish1y ago

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett1y ago

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

lokimedes1y ago

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint1y ago

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk1y ago

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado1y ago

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

stevenhuang1y ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint1y ago

Umm, that’s a theory.

mind-blight1y ago

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

bangaladore1y ago

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.

Philpax1y ago

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos1y ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh071y ago

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever1y ago

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman1y ago

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh071y ago

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___1y ago

Did they take in the entire history as context?

slashdave1y ago

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin1y ago

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave1y ago

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

3 more replies

InDubioProRubio1y ago

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.

slashdave1y ago

Correct, for the sprites. However, the walls in Doom are texture mapped, and so have the same issue as videos. Interesting, though, because I assume the antialiasing is something approximate, given the extreme demands on CPUs of the era.

SeanAnderson1y ago

After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

ollin1y ago

We can't assess the quality of gameplay ourselves of course (since the model wasn't released), but one author said "It's playable, the videos on our project page are actual game play." (https://x.com/shlomifruchter/status/1828850796840268009) and the video on top of https://gamengen.github.io/ starts out with "these are real-time recordings of people playing the game". Based on those claims, it seems likely that they did get a playable system in front of humans by the end of the project (though perhaps not by the time the draft was uploaded to arXiv).

Chance-Device1y ago

I also thought this, but refer back to the paper, not the abstract:

> A is the set of key presses and mouse movements…

> …to condition on actions, we simply learn an embedding A_emb for each action

So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.

Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.

So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.

So yes, the users are playing it.

However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.

psb2171y ago

The agent never interacts with the simulator during training or evaluation. There is no user, there is only an agent which trained to play the real game and which produced the sequences of game frames and actions that were used to train the simulator and to provide ground truth sequences of game experience for evaluation. Their evaluation metrics are all based on running short simulations in the diffusion model which are initiated with some number of conditioning frames taken from the real game engine. Statements in the paper like: "GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware." are wildly misleading.

foota1y ago

I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?

Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.

Chance-Device1y ago

I’m not sure how that would help vs just training the model with the conditionings described in the paper.

I’m not very familiar with Gaussian splats models, but aren’t they just a way of constructing images using multiple superimposed parameterized Gaussian distributions, sort of like the Fourier series does with waveforms using sine and cosine waves?

I’m not seeing how that would apply here but I’d be interested in hearing how you would do it.

1 more reply

dewarrn11y ago

The paper should definitely be more clear on this point, but there's a sentence in section 5.2.3 that makes me think that this was playable and played: "When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better." It may be a failure of imagination, but I can't think of another reasonable way of interpreting "playing with the model manually".

77341281y ago

What you're describing reminded me of this cool project:

https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"

refibrillator1y ago

You are incorrect, this is an interactive simulation that is playable by humans.

> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.

The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.

Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.

psb2171y ago

If the generative model/simulator can run at 20FPS, then obviously in principle a human could play the game in simulation at 20 FPS. However, they do no evaluation of human play in the paper. My guess is that they limited human evals to watching short clips of play in the real engine vs the simulator (which conditions on some number of initial frames from the engine when starting each clip...) since the actual "playability" is not great.

FrustratedMonky1y ago

Yeah. If isn't doing this, then what could it be doing that is worth a paper? "real-time user input and adjusts its output accordingly"

rvnx1y ago

There is a hint in the paper itself:

It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"

So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.

GaggiX1y ago

>I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly

Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.

teamonkey1y ago

I think someone is playing it, but it has a reduced set of inputs and they're playing it in a very specific way (slowly, avoiding looking back to places they've been) so as not to show off the flaws in the system.

The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.

I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".

As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.

lewhoo1y ago

The movement of the player seems jittery a bit so I inferred something similar on that basis.

bob10291y ago

Were the agents playing at 20 real FPS, or did this occur like a Pixar movie offline?

SeanAnderson1y ago

Ehhh okay, I'm not as convinced as I was earlier. Sorry for misleading. There's been a lot of back-and-forth.

I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.

This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"

led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.

I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:

"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."

and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?

Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.

Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.

Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.

psb2171y ago

It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work...

pajeets1y ago

I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game

Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd

this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.

zzanz1y ago

The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.

fngjdflmdflg1y ago

>Technically speaking, isn't this the greatest possible anti-Doom

When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.

ynniv1y ago

It's dreaming Doom.

birracerveza1y ago

We made machines dream of Doom. Insane.

1 more reply

qingcharles1y ago

Do Robots Dream of E1M1?

elwell1y ago

Droom

bugglebeetle1y ago

Pierre Menard, Author of Doom.

el_memorioso1y ago

I applaud your erudition.

airstrike1y ago

OK, this is the single most perfect comment someone could make on this thread. Diffusion me impressed.

jl61y ago

Knee Deep in the Death of the Author.

1attice1y ago

that took a moment, thank you

Terr_1y ago

> the Doom with the highest possible hardware requirement?

Isn't that possible by setting arbitrarily high goals for ray-cast rendering?

Vecr1y ago

It's the No-Doom.

WithinReason1y ago

Undoom?

riwsky1y ago

It’s a mood.

jeffhuys1y ago

Bliss

x-complexity1y ago

> Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?

Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.

The next step of anti-Doom would be a model generating the model, generating the Doom output.

nurettin1y ago

Isn't this technically a model (training step) generating a model (a neural network) generating Doom output?

yuchi1y ago

“…now it can implement Doom!”

rldjbpin1y ago

to me the closer analogy here is the "running minecraft inside minecraft" (https://news.ycombinator.com/item?id=32901461)

godelski1y ago

Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space

Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps

This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

snickmy1y ago

Those are valid points, but irrelevant for the context of this research.

Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.

tobr1y ago

I suppose it also doesn't really matter what kinds of resources the game originally requires. The diffusion model isn't going to require twice as much memory just because the game does. Presumably you wouldn't even necessarily need to be able to render the original game in real time - I would imagine the basic technique would work even if you used a state of the Hollywood-quality offline renderer to render each input frame, and that the performance of the diffusion model would be similar?

godelski1y ago

Well the majority of ML systems are compression machines (entropy minimizers), so ideally you'd want to see if you can learn the assets and game mechanics through play alone (what this paper shows). Better would be to do so more efficiently than that devs themselves, finding better compression. Certainly the game is not perfectly optimized. But still, this is a step in that direction. I mean no one has accomplished this before so even with a model with far higher capacity it's progress. (I think people are interpreting my comment as dismissive. I'm critiquing but the key point I was making was about how there's likely better architectures, training methods, and all sorts of stuff to still research. Personally I'm glad there's still more to research. That's the fun part)

pickledoyster1y ago

>you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste

godelski1y ago

I'm not sure what you're saying is irrelevant.

1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".

2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)

3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)

4) the point that you can rip a game

I'm really not sure what you're contesting to because I said several things.

  > it lacks basic things like pre-computing, storing, etc.

It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.

snickmy1y ago

Sorry, probably didn't explain myself well enough

1) yes you are correct. the point i was making is that, in the context of the discovery/research, that's outside the scope, and 'easier' to do, as it has been done in other verticals (ie.: e2e self driving)

2) yep, aligned here

3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.

4) The network does store information, it just doesn't store a gameplay information, which could be forced, but as per point 1, it is , and I think it is the right approach, beyond the scope of this research

1 more reply

danielmarkbruce1y ago

Is it a breakthrough? Weather models are miles ahead of this as far as I can tell.

dTal1y ago

>What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute

That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

phh1y ago

> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.

Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.

godelski1y ago

  > Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

You're jumping ahead there and I'm not convinced you could do this ever (unless you're model is already a great physics engine). The paper itself has feeds the controls into the network. But a flight sim will be harder better you'd need to also feed in air conditions. I just don't see how you could do this from video alone, let alone just video from the cockpit. Humans could not do this. There's just not enough information.

1 more reply

camtarn1y ago

Plus, presumably, either training it on pilot inputs (and being able to map those to joystick inputs and mouse clicks) or having the user have an identical fake cockpit to play in and a camera to pick up their movements.

And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.

isaacfung1y ago

The possibility seems far beyond gaming(given enough computation resources).

You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).

Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.

dvngnt_1y ago

wouldnt make more sense to train using microsoft flight simulator the same way they did DOOM, but im not sure what the point is if the game already exists

Sohcahtoa821y ago

It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

ploxiln1y ago

The skepticism and criticism in this thread is against the hype of AI, it's implied by people saying "this is so amazing" that they think that in some near future you can create any video game experience you can imagine by just replacing all the software with some AI models, rendering the whole game.

When in reality this is the least efficient and reliable form of Doom yet created, using literally millions of times the computation used by the first x86 PCs that were able to render and play doom in real-time.

But it's a funny party trick, sure.

joegibbs1y ago

Yes it's less efficient and reliable than regular Doom, but it's not just limited to Doom. You could have it simulate a very graphically advanced game that barely runs on current hardware and it would run at the exact same speed as Doom.

1 more reply

Gooblebrai1y ago

So true. The hustle culture is an spreading disease that has replaced the fun maker culture from the 80s/90s.

It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.

nuancebydefault1y ago

Today this exercise feels pointless. However. I remember the days when there were articles written about the possibility for "internet radio". In stead of good old broadcasting waves through the ether and simply thousands of radios tuning in, some server would send a massive amount of packets over a massive kilometers of copper to thousands of endpoints. Ad absurdum the endpoints would even send ack packages upstream to the poor server to keep connections alive. It seemed like a huge waste of computing power, wire and energy.

And here we are, binging netflix movies over such copper wires.

I'm not saying games will be replaced by diffusion models dreaming up next images based on user input, but a variation of that might end up in a form of interactive art creation or a new form of entertainment.

ninetyninenine1y ago

I don’t think this is not useful. This is a stepping stone for generating entire novel games.

Sohcahtoa821y ago

> This is a stepping stone for generating entire novel games.

I don't see how.

This game "engine" is purely mapping [pixels, input] -> new pixels. It has no notion of game state (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again), not to mention that it requires the game to already exist in order to train it.

I suppose, in theory, you could train the network to include game state in the input and output, or potentially even handle game state outside the network entirely and just make it one of the inputs, but the output would be incredibly noisy and nigh unplayable.

And like I said, all of it requires the game to already exist in order to train the network.

airstrike1y ago

> (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again)

Sounds like a great game.

> not to mention that it requires the game to already exist in order to train it

Diffusion models create new images that did not previously exist all of the time, so I'm not sure how that follows. It's not hard to extrapolate from TFA to a model that generically creates games based on some input

ninetyninenine1y ago

>It has no notion of game state (so you can kill an enemy, turn your back, then turn around again)

Well you see a wall you turn around then turn back the wall is still there. With enough training data the model will be able to pick up the state of the enemy because it has ALREADY learned the state of the wall due to much more numerous data on the wall. It's probably impractical to do this, but this is only a stepping stone like said.

> not to mention that it requires the game to already exist in order to train it.

Is this a problem? Do games not exist? Not only due we have tons of games, but we also have in theory unlimited amounts of training data for each game.

1 more reply

throwthrowuknow1y ago

Read the paper. It is capable of maintaining state for a fairly long time including updating the UI elements.

msk-lywenn1y ago

I’d like to now to carbon footprint of that fun.

HellDunkel1y ago

Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job. The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.

zamadatix1y ago

This seems more a critique on the particular model (as in the resulting diffusion model generated), not on diffusion models in general. It's also a bit misstated - this doesn't require a working car on the road to do its job (present tense), it required one to train it to do its job (past tense) and it's not particularly clear why a game engine using concepts gained from how another worked should cease to be a game engine. For diffusion models in general and not the specifically trained example here I don't see why one would assume the approach can't also work outside of the particular "test tracks" it was trained on, just as a typical diffusion model works on more than generating the exact images it was trained on (can interpolate and apply individual concepts to create a novel output).

HellDunkel1y ago

my point is something else: a game engine is something which can be separated from a game and put to use somewhere else. this is basically the definition of „engine“. the above is not an engine but a game without any engine at all therefor should not be called „engine“.

MasterScrat1y ago

Interesting point.

In a way this is a "simulated game engine", trained from actual game engine data. But I would argue a working simulated game engine becomes a game engine of its own, as it is then able to "propell the game" as you say. The way it achieves this becomes irrelevant, in one case the content was crafted by humans, in the other case it mimics existing game content, the player really doesn't care!

> An engine would also work offroad.

Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places. I'd even say extrapolation capabilities of such a model could be better than a traditional game engine, as it can make things up as it goes, while if you accidentally cross a wall in a typical game engine the screen goes blank.

HellDunkel1y ago

The game doom is more than a game engine, isnt it? I‘d be okay with calling the above a „simulated game“ or a „game“. My point is: let‘s not conflate the idea of a „game engine“ which is a construct of intellectual concepts put together to create a simulation of „things happening in time“ and deriving output (audio and visual). the engine is fed with input and data (levels and other assets) and then drives(EDIT) a „game“.

training the model with a final game will never give you an engine. maybe a „simulated game“ or even a „game“ but certainly not an „engine“. the latter would mean the model would be capable to derive and extract the technical and intellectual concepts and apply them elsewhere.

jsheard1y ago

> Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places.

They easily could have demonstrated this by seeding the model with images of Doom maps which weren't in the training set, but they chose not to. I'm sure they tried it and the results just weren't good, probably morphing the map into one of the ones it was trained on at the first opportunity.

refibrillator1y ago

There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.

IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.

mlsu1y ago

With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.

At which point, you effectively would be interpolating in latent space through the source code to actually "render" the game. You'd have an entire latent space computer, with an engine, assets, textures, a software renderer.

With a sufficiently powerful computer, one could imagine what interpolating in this latent space between, say Factorio and TF2 (2 of my favorites). And tweaking this latent space to your liking by conditioning it on any number of gameplay aspects.

This future comes very quickly for subsets of the pipeline, like the very end stage of rendering -- DLSS is already in production, for example. Maybe Nvidia's revenue wraps back to gaming once again, as we all become bolted into a neural metaverse.

God I love that they chose DOOM.

Jensson1y ago

> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.

Neural nets are not guaranteed to converge to anything even remotely optimal, so no that isn't how it works. Also even though neural nets can approximate any function they usually can't do it in a time or space efficient manner, resulting in much larger programs than the human written code.

mlsu1y ago

Could is certainly a better word, yes. There is no guarantee that it will happen, only that it could. The existence of LLMs is proof of that; imagine how large and inefficient a handwritten computer program to generate the next token would be. On the flipside, human beings very effectively predicting the next token, and much more, on 5 watts is proof that LLM in their current form certainly are not the most efficient method for generating next token.

I don't really know why everyone is piling on me here. Sorry for a bit of fun speculating! This model is on the continuum. There is a latent representation of Doom in weights. some weights, not these weights. Therefore some representation of doom in a neural net could become more efficient over time. That's really the point I'm trying to make.

godelski1y ago

  > With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM.

You and I have very different definitions of compression

https://news.ycombinator.com/item?id=41377398

  > Someone in the field could probably correct me on that.

^__^

_hark1y ago

The raw capacity of the network doesn't tell you how complex the weights actually are. The capacity is only an upper bound on the complexity.

It's easy to see this by noting that you can often prune networks quite a bit without any loss in performance. I.e. the effective dimension of the manifold the weights live on can be much, much smaller than the total capacity allows for. In fact, good regularization is exactly that which encourages the model itself to be compressible.

1 more reply

energy1231y ago

The source code lacks information required to render the game. Textures for example.

TeMPOraL1y ago

Obviously assets would get encoded too, in some form. Not necessarily corresponding to the original bitmaps, if the game does some consistent post-processing, the encoded thing would more likely be (equivalent to) the post-processed state.

1 more reply

mistercheph1y ago

That’s just an artifact of the language we use to describe an implementation detail, in the sense GP means it, the data payload bits are not essentially distinct from the executable instruction bits

electrondood1y ago

The Holographic Principle is the idea that our universe is a projection of a higher dimensional space, which sounds an awful lot like the total simulation of an interactive environment, encoded in the parameter space of a neural network.

The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?

basch1y ago

Similarly, you could run a very very simple game engine, that outputs little more than a low resolution wireframe, and upscale it. Put all of the effort into game mechanics and none into visual quality.

I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.

troupo1y ago

> one can imagine a near future where text prompts are enough to create a fun new game

Sit down and write down a text prompt for a "fun new game". You can start with something relatively simple like a Mario-like platformer.

By page 300, when you're about halfway through describing what you mean, you might understand why this is wishful thinking

reverius421y ago

If it can be trained on (many) existing games, then it might work similarly to how you don't need to describe every possible detail of a generated image in order to get something that looks like what you're asking for (and looks like a plausible image for the underspecified parts).

troupo1y ago

Things that might work plausible in a static image will not look plausible when things are moving, especially in the game.

Also: https://news.ycombinator.com/item?id=41376722

Also: define "fun" and "new" in a "simple text prompt". Current image generators suck at properly reflecting what you want exactly, because they regurgitate existing things and styles.

slashdave1y ago

> where text prompts are enough to create a fun new game!

Not really. This is a reproduction of the first level of Doom. Nothing original is being created.

SomewhatLikely1y ago

Video games are gonna be wild in the near future. You could have one person talking to a model producing something that's on par with a AAA title from today. Imagine the 2d sidescroller boom on Steam but with immersive photorealistic 3d games with hyper-realistic physics (water flow, fire that spreads, tornados) and full deformability and buildability because the model is pretrained with real world videos. Your game is just a "style" that tweaks some priors on look, settings, and story.

user4326781y ago

Sorry, no offence, but you sound like those EA execs wearing expensive suits and never played a single video game in their entire life. There’s a great documentary on how Half Life was made. Gabe Newell was interviewed by someone asking “why you did that and this, it’s not realistic”, where he answered “because it’s more fun this way, you want realism — just go outside”.

radarsat11y ago

Most games are conditioned on text, it's just that we call it "source code" :).

(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)

magicalhippo1y ago

This got me thinking. Anyone tried using SD or similar to create graphics for the old classic text adventure games?

danjl1y ago

So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?

kragen1y ago

here are some ideas:

- you could build a non-real-time version of the game engine and use the neural net as a real-time approximation

- you could edit videos shot in real life to have huds or whatever and train the neural net to simulate reality rather than doom. (this paper used 900 million frames which i think is about a year of video if it's 30fps, but maybe algorithmic improvements can cut the training requirements down) and a year of video isn't actually all that much—like, maybe you could recruit 500 people to play paintball while wearing gopro cameras with accelerometers and gyros on their heads and paintball guns, so that you could get a year of video in a weekend?

injidup1y ago

Why games? I will train it on 1 years worth of me attending Microsoft teams meetings. Then I will go surfing.

kqr1y ago

Even if you spend 40 hours a week in video conferences, you'll have to work for over four years to get one years' worth of footage. Of course, by then the models will be even better and so you might actually have a chance of going surfing.

I guess I should start hoarding video of myself now.

1 more reply

akie1y ago

Ready to pay for this

ccozan1y ago

most underrated comment here!

w_for_wumbo1y ago

That feels like the endgame of video game generation. You select an art style, a video and the type of game you'd like to play. The game is then generated in real-time responding to each action with respect to the existing rule engine.

I imagine a game like that could get so convincing in its details and immersiveness that one could forget they're playing a game.

troupo1y ago

There are thousands of games that mimic each other, and only a handful of them are any good.

What makes you think a mechanical "predict next frame based on existing games" will be any good?

1 more reply

numpad01y ago

IIRC, both 2001(1968) and Solaris(1972) depict that kind of things as part of alien euthanasia process, not as happy endings

2 more replies

aithrowaway19871y ago

Have you ever played a video game? This is unbelievably depressing. This is a future where games like Slay the Spire, with a unique art style and innovative gameplay simply are not being made.

Not to mention this childish nonsense about "forget they're playing a game," as if every game needs to be lifelike VR and there's no room for stylization or imagination. I am worried for the future that people think they want these things.

2 more replies

omegaworks1y ago

EXISTENZ IS PAUSED!

THBC1y ago

Holodeck is just around the corner

1 more reply

qznc1y ago

The Cloud Gaming platforms could record things for training data.

modeless1y ago

If you train it on multiple games then you could produce new games that have never existed before, in the same way image generation models can produce new images that have never existed before.

jsheard1y ago

It's unlikely that such a procedurally generated mashup would be perfectly coherent, stable and most importantly fun right out of the gate, so you would need some way to reach into the guts of the generated game and refine it. If properties as simple as "how much health this enemy type has" are scattered across an enormous inscrutable neural network, and may not even have a single consistent definition in all contexts, that's going to be quite a challenge. Nevermind if the game just catastrophically implodes and you have to "debug" the model.

lewhoo1y ago

From what I understand that could make the engine much less stable. The key here is repetitiveness.

billconan1y ago

maybe the next step is adding text guidance and generating non-existing games.

notfed1y ago

I think the same comment could be said about generative images, no?

passion__desire1y ago

Maybe, in future, techniques of Scientific Machine Learning which can encode physics and other known laws into a model would form a base model. And then other models on top could just fine tune aspects to customise a game.

attilakun1y ago

If only there was a rich 3-dimensional physical environment we could draw training data from.

slashdave1y ago

Well, yeah. Image diffusion models only work because you can provide large amounts of training data. For Doom it is even simpler, since you don't need to deal with compositing.

dtagames1y ago

A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

kqr1y ago

> even rules which are not visible on-screen.

If a rule was changed but it's never visible on the screen, did it really change?

> It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

znx_01y ago

> If a rule was changed but it's never visible on the screen, did it really change?

Well for "some" games it does really change

darby_nine1y ago

> Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

It's much simpler than actually creating a game....

stnmtn1y ago

If someone told you 10 years ago that they were going to create something where you could play a whole new level of Doom, without them writing a single line of game logic/rendering code, would you say that that is simpler than creating a demo by writing the game themselves?

1 more reply

throwthrowuknow1y ago

They only trained it on one game and only embedded the control inputs. You could train it on many games and embed a lot more information about each of them which could possibly allow you to specify a prompt that would describe the game and then play it.

calebh1y ago

One thing I'd like to see is to take a game rendered with low poly assets (or segmented in some way) and use a diffusion model to add realistic or stylized art details. This would fix the consistency problem while still providing tangible benefits.

momojo1y ago

The title should be "Diffusion Models can be used to render frames given user input"

sharpshadow1y ago

So all it did is generate a video of the gameplay which is slightly different from the video it used for training?

TeMPOraL1y ago

No, it implements a 3D FPS that's interactive, and renders each frame based on your input and a lot of memorized gameplay.

sharpshadow1y ago

But is it playing the actual game or just making a interactive video of it?

3 more replies

alkonaut1y ago

The job of the game engine is also to render the world given only the worlds properties (textures, geometries, physics rules, ...), and not given "training data that had to be supplied from an already written engine".

I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.

helloplanets1y ago

So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

toppy1y ago

I think you've just encoded the title of the paper

1 more reply

panki271y ago

> Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

arc-in-space1y ago

This, watching the generated clips feels uncomfortable, like a nightmare. Geometry is "swimming" with camera movement, objects randomly appear and disappear, damage is inconsistent.

The entire thing would probably crash and burn if you did something just slightly unusual compared to the training data, too. People talking about 'generated' games often seem to fantasize about an AI that will make up new outcomes for players that go off the beaten path, but a large part of the fun of real games is figuring out what you can do within the predetermined constraints set by the game's code. (Pen-and-paper RPGs are highly open-ended, but even a Game Master needs to sometimes protects the players from themselves; whereas the current generation of AI is famously incapable of saying no.)

aithrowaway19871y ago

I also noticed that they played AI DOOM very slowly: in an actual game you are running around like a madman, but in the video clips the player is moving in a very careful, halting manner. In particular the player only moves in straight lines or turns while stationary, they almost never turn while running. Also didn't see much strafing.

I suspect there is a reason for this: running while turning doesn't work properly and makes it very obvious that the system doesn't have a consistent internal 3D view of the world. I'm already getting motion sickness from the inconsistencies in straight-line movement, I can't imagine turning is any better.

freestyle241471y ago

It made me laugh. Maybe they pulled random people from the hallway who had never seen the original Doom (or any FPS), or maybe only selected people who wore glasses and forgot them at their desk.

meheleventyone1y ago

It's telling IMO that they only want people opinions based on our notoriously faulty memories rather than sitting comparable situations next to one another in the game and simulation then analyzing them. Several things jump out watching the example video.

GaggiX1y ago

>rather than sitting comparable situations next to one another in the game and simulation then analyzing them.

That's literally how the human rating was setup if you read the paper.

meheleventyone1y ago

I think you misunderstand me. I don't mean a snap evaluation and deciding between two very-short competing videos which is what the participants were doing. I mean doing an actual analysis of how well the simulation matches the ground truth of the game.

What I'd posit is that it's not actually a very good replication of the game but very good a replicating short clips that almost look like the game and the short time horizons are deliberately chosen because the authors know the model lacks coherence beyond that.

1 more reply

golol1y ago

What I understand is the folloeing: If this works so well, why didn't we have good video generation much earlier? After diffusion models were seen to work the most obvious thing to do was to generate the next frame based on previous framrs but... it took 1-2 years for good video models to appear. For example compare Sora generating minecraft video versus this method generating minecraft video. Say in both cases the player is standing on a meadow with fee inputs and watching some pigs. In the Sora video you'd expect the typical glitched to appear, like erratic, sliding movement, overlapping legs, multiplication of pigs etc. Would these glitches not appear in the GameNGen video? Why?

Closi1y ago

Because video is much more difficult than images (it's lots of images that have to be consistent across time, with motion following laws of physics etc), and this is much more limited in terms of scope than pure arbitrary video generation.

golol1y ago

This misses the point, I'm comparing two methods of generating minecraft videos.

soulofmischief1y ago

By simplifying the problem, we are better able to focus on researching specific aspects of generation. In this case, they synthetically created a large, highly domain-specific training set and then used this to train a diffusion model which encodes input parameters instead of text.

Sora was trained on a much more diverse dataset, and so has to learn more general solutions in order to maintain consistency, which is harder. The low resolution and simple, highly repetitive textures of doom definitely help as well.

In general, this is just an easier problem to approach because of the more focused constraints. It's also worth mentioning that noise was added during the process in order to make the model robust to small perturbations.

pantalaimon1y ago

I would have thought it is much easier to generate huge amounts of game footage for training, but as I understand this is not what was done here.

mo_421y ago

An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

radarsat11y ago

There has definitely been research for simulating physics based on observation, especially in fluid dynamics but also for rigid body motion and collision. It's important for robotics applications actually. You can bet people will be applying this technique in those contexts.

I think for real world application one challenge is going to be the "action" signal which is a necessary component of the conditioning signal that makes the simulation reactive. In video games you can just record the buttons, but for real world scenarios you need difficult and intrusive sensor setups for recording force signals.

(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)

cubefox1y ago

A popular theory in neuroscience is that this is what the brain does:

https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

It's called predictive coding. By trying to predict sensory stimuli, the brain creates a simplified model of the world, including common sense physics. Yann LeCun says that this is a major key to AGI. Another one is effective planning.

But while current predictive models (autoregressive LLMs) work well on text, they don't work well on video data, because of the large outcome space. In an LLM, text prediction boils down to a probability distribution over a few thousand possible next tokens, while there are several orders of magnitude more possible "next frames" in a video. Diffusion models work better on video data, but they are not inherently predictive like causal LLMs. Apparently this new Doom model made some progress on that front though.

ccozan1y ago

Howver, this is due how we actually digitize video. From a human point a view, looking in my room reduces the load to the _objects_ in the room and everyhing else is just noise ( like the color of the wall could be just a single item to remember, while otherwise in the digital world, it needs to remember all the pixels )

icoder1y ago

This is impressive. But at the same time, it can't count. We see this every time, and I understand why it happens, but it is still intriguing. We are so close or in some ways even way beyond, and yet at the same time so extremely far away, from 'our' intelligence.

(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)

(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

marci1y ago

'our' intelligence may not be the best thing we can make. It would be like trying to only make planes that flaps wings or trucks with legs. A bit like using a llm to do multiplication. Not the best tool. Biomimcry is great for inspiration, but shouldn't be a 1-to-1 copy, especialy in different scale and medium.

icoder1y ago

Sure, although I still think a system with less of a contrast between how well it performs 'modally' and how bad it performs incidentally, would be more practical.

What I wonder is whether LLM's will inherently always have this dichotomy and we need something 'extra' (reasoning, attention or something les biomimicried), or whether this will eventually resolves itself (to an acceptable extend) when they improve even further.

lIl-IIIl1y ago

How does it know how many times it needs to shoot the zombie before it dies?

Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

I love how the player's health goes down when he stands in the radioactive green water.

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

meheleventyone1y ago

> I love how the player's health goes down when he stands in the radioactive green water.

This is one of the bits that was weird to me, it doesn't work correctly. In the real game you take damage at a consistent rate, in the video the player doesn't and whether the player takes damage or not seems highly dependent on some factor that isn't whether or not the player is in the radioactive slime. My thought is that its learnt something else that correlates poorly.

golol1y ago

It gets a number of previous frames as input I think.

lupusreal1y ago

> In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

They trained this thing on bot gameplay, so I bet it does poorly when advanced strategies like deliberately inducing mob infighting are employed (the bots probably didn't do that a lot, of at all.)

masterspy71y ago

There's been a ton of work to generate assets for games using AI: 3d models, textures, code, etc. None of that may even be necessary with a generative game engine like this! If you could scale this up, train on all games in existence, etc. I bet some interesting things would happen

rererereferred1y ago

But can you grab what this Ai has learned and generate the 3d models, maps and code to turn it into an actual game that can run on a user's PC? That would be amazing.

passion__desire1y ago

Jensen Huang's vision that future games will be generated rather than rendered is coming true.

kleiba1y ago

What would be the point? This model has been trained on an existing game, so turning it back into assets, maps, and code would just give you a copy of the original game you started with. I suppose you could create variations of it then... but:

You don't even need to do all of that - this trained model already is the game, i.e., it's interactive, you can play the game.

whamlastxmas1y ago

I would absolutely love if they could take this demo, add a new door that isn’t in the original, and see what it generates behind that door

nolist_policy1y ago

Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?

zbendefy1y ago

I think some state is also being given (or if its not, it could be given) to the network, like 3d world position/orientation of the player, that could help the neural network anchor the player in the world.

smusamashah1y ago

Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

kqr1y ago

> Is it possible to break the camera free and roam around the map freely and view it from different angles?

I would assume only if the training data contained this type of imagery, which it did not. The training data (from what I understand) consisted only of input+video of actual gameplay, so that is what the model is trained to mimick.

This is like a dog that has been trained to form English words – what's impressive is not that it does it well, but that it does it at all.

Sohcahtoa821y ago

> Therefore I don't think it has any clue about the 3D world of the game at all.

AI models don't "know" things at all.

At best, they're just very fuzzy predictors. In this case, given the last couple frames of video and a user input, it predicts the next frame.

It has zero knowledge of the game world, game rules, interactions, etc. It's merely a mapping of [pixels, input] -> pixels.

ravetcofx1y ago

There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.

Kapura1y ago

What is useful about this? I am a game programmer, and I cannot imagine a world where this improves any part of the development process. It seems to me to be a way to copy a game without literally copying the assets and code; plagiarism with extra steps. What am I missing?

arduinomancer1y ago

How does the model “remember” the whole state of the world?

Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

a_e_k1y ago

Watch closely in the videos and you'll see that enemies often respawn when offscreen and sometimes when onscreen. Destroyed barrels come back, ammo count and health fluctuates weirdly, etc. It's still impressive, but its not perfect in that regard.

Sharlin1y ago

Not unlike in (human) dreams.

Jensson1y ago

It doesn't even remember the state of the game you look at. Doors spawning right in front of you, particle effects turning into enemies mid flight etc, so just regular gen AI issues.

Edit: Can see this in the first 10 seconds of the first video under "Full Gameplay Videos", stairs turning to corridor turning to closed door for no reason without looking away.

csmattryder1y ago

There's also the case in the video (0:59) where the player jumps into the poison but doesn't take damage for a few seconds then takes two doses back-to-back - they should've taken a hit of damage every ~500-1000ms(?)

Guessing the model hasn't been taught enough about that, because most people don't jump into hazards.

raincole1y ago

It doesn't. You need to put the world state in the input (the "prompt", even it doesn't look like prompt in this case). Whatever not in the prompt is lost.

rldjbpin1y ago

this is truly a cool demo, but a very misleading title.

to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!

using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.

dabochen1y ago

So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

If so, is it more like imagination/hallucination rather than rendering?

famouswaffles1y ago

It's conditioned on previous frames AND player actions so it's interactive.

rrnechmech1y ago

> To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

jamilton1y ago

I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

Any other similar existing datasets?

A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

jamilton1y ago

Although ideally a follow up work would be something where there won’t be any potential legal trouble with releasing the complete model so people can play it.

A similar approach but with a game where the exact input is obvious and unambiguous from the graphics alone so that you can use unannotated data might work. You’d just have to create a model to create the action annotations. I’m not sure what the point would be, but it sounds like it’d be interesting.

throwthrowuknow1y ago

Several thoughts for future work:

1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.

2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.

3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?

4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.

bufferoverflow1y ago

That's probably how our reality is rendered.

TheRealPomax1y ago

If by "game" you mean "literal hallucination" then yes. But if we're not trying to click-bait, then no: it's not really a game when there is no permanence or determinism to be found anywhere. It might be a "game-flavoured dream simulator", but it's absolutely not a game engine.

t1c1y ago

They got DOOM running on a diffusion engine before GTA 6

broast1y ago

Maybe one day this will be how operating systems work.

misterflibble1y ago

Don't give them ideas lol terrifying stuff if that happens!

KhoomeiK1y ago

NVIDIA did something similar with GANs in 2020 [1], except users could actually play those games (unlike in this diffusion work which just plays back simulated video). Sentdex later adapted this to play GTA with a really cool demo [2].

[1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

[2] https://www.youtube.com/watch?v=udPY5rQVoW0

dysoco1y ago

Ah finally we are starting to see something gaming related. I'm curious as to why we haven't seen more of neural networks applied to games even in a completely experimental fashion; we used to have a lot of little experimental indie games such as Façade (2005) and I'm surprised we don't have something similar years after the advent of LLMs.

We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

raincole1y ago

> We could have mods for old games that generate voices for the characters for example

You mean in real time? Or just in general?

There are a lot of mods that use AI-generated voices. I'll say it's the norm of modding community now.

troupo1y ago

Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

This is not a game engine.

Creating a new good game? Good luck with that.

throwmeaway2221y ago

You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

gwern1y ago

People may recall GameGAN from May 2020: https://arxiv.org/abs/2005.12126#nvidia https://nv-tlabs.github.io/gameGAN/#nvidia https://github.com/nv-tlabs/GameGAN_code

kcaj1y ago

Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.

jetrink1y ago

What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.

Workaccount21y ago

>Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control

https://deepmind.google/discover/blog/rt-2-new-model-transla...

yair99dd1y ago

Yotube user hu-po streams critical in-depth streams of Ai papers. Here is his take on this (and other relevant) paper https://www.youtube.com/live/JZgqQB4Aekc

lynx231y ago

Hehe, this sounds like the backstory of a remake of the Terminator, or "I have no mouth, but I must scream." In the aftermath of AI killing off humanity, researchers look deeply into how this could have ahppened. And after a number of dead ends, they finally realize: it was trained, in its infancy, on Doom!

wantsanagent1y ago

Anyone have reliable numbers on the file sizes here? Doom.exe from my searches was around 715k, and with all assets somewhere around 10MB. It looks like the SD 1.4 files are over 2GB, so it's likely we're looking at a 200-2000x increase in file size depending on if you think of this as an 'engine' or the full game.

lukol1y ago

I believe future game engines will be state machines with deterministic algorithms that can be reproduced at any time. However, rendering said state into visual / auditory / etc. experiences will be taken over by AI models.

This will also allow players to easily customize what they experience without changing the core game loop.

nuz1y ago

I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)

JDEngi1y ago

This is going to be the future of cloud gaming, isn't it? In order to deal with the latency, we just generate the next frame locally, and we'll have the true frame coming in later from the cloud, so we're never dreaming too far ahead of the actual game.

KETpXDDzR1y ago

I think the correct title should be "Diffusion Models Are Fake Real-Time Game Engines". I don't think just more training will ever be sufficient to create a complete game engine. It would need to "understand" what it's doing.

ciroduran1y ago

Congrats on running Doom on an Diffusion Model :D

I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

seydor1y ago

I wonder how far it is from this to generating language reasoning about the game from the game itself, rather than learning a large corpus of language, like LLMs do. That would be a true grounded language generator

golol1y ago

Certain categories of youtube videos can also be viewed as some sort of game where the actions are the audio/transcript advanced a couple of seconds. Add two eggs. Fetch the ball. I'm walking in the park.

darrinm1y ago

So… is it interactive? Playable? Or just generating a video of gameplay?

vunderba1y ago

From the article: We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

The demo is actual gameplay at ~20 FPS.

darrinm1y ago

It confused me that their stated evaluations by humans are comparing video clips rather than evaluating game play.

furyofantares1y ago

Short clips are the only way a human will make any errors determining which is which.

1 more reply

holoduke1y ago

I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.

jumploops1y ago

This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

Instead of working through a game, it’s building generic UI components and using common abstractions.

qnleigh1y ago

Could a similar scheme be used to drastically improve the visual quality of a video game? You would train the model on gameplay rendered at low and high quality (say with and without ray tracing, and with low and high density meshing), and try to get it to convert a quick render into something photorealistic on the fly.

When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.

agys1y ago

Isn't that what Nvidia’s Ray Reconstruction and DLSS (frame generation and upscaler) are doing, more or less?

qnleigh1y ago

At a high level I guess so. I don't know enough about Ray Reconstruction (though the results are impressive), but I was thinking of something more drastic than DLSS. Diffusion models on static images can turn a cartoon into a photorealistic image. Doing something similar for a game, where a low-quality render is turned into something that would otherwise take seconds to render, seems qualitatively quite different from DLSS. In principle a model could fill in huge amounts of detail, like increasing the number of particles in a particle-based effect, adding shading/lighting effects...

LtdJorge1y ago

So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?

lackoftactics1y ago

I think Alan's conservative countdown to AGI will need to be updated after this. https://lifearchitect.ai/agi/ This is really impressive stuff. I thought about it a couple of months ago, that probably this is the next modality worth exploring for data, but didn't imagine it would come so fast. On the other side, the amount of compute required is crazy.

acoye1y ago

Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.

acoye1y ago

I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.

amunozo1y ago

This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.

harha_1y ago

This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?

aithrowaway19871y ago

I am struggling to understand a single implication of this! How does this generalize to anything other than other than playing retro games in the most expensive way possible? The very intention of this project is overfitting to data in a non-generalizable way! Maybe it's just pure engineering, that good ANNs are getting cheap and fast. But this project still seems to have the fundamental weaknesses of all AI projects:

- needs a huge amount of data, which a priori precludes a lot of interesting use cases

- flashy-but-misleading demos which hide the actual weaknesses of the AI software (note that the player is moving very haltingly compared to a real game of DOOM, where you almost never stop moving)

- AI nailing something really complicated for humans (98% effective raycasting, 98% effective Python codegen) while failing to grasp abstract concepts rigorously understood by fish (object permanence, quantity)

I am genuinely struggling to see this as a meaningful step forward. It seems more like a World's Fair exhibit - a fun and impressive diversion, but probably not a vision of the future. Putting it another way: unlike AlphaGo, Deep Blue wasn't really a technological milestone so much as a sociological milestone reflecting the apex of a certain approach to AI. I think this DOOM project is in a similar vein.

harha_1y ago

I agree with you, when I made this comment I was simply excited but that didn't last too long. I find this technology both exciting and dystopian, the latter because the dystopic use of it is already happening all over the internet. For now, it's been used only for entertainment AFAIK, which is the kind of use I don't like either, because I prefer human created entertainment over this crap.

maxglute1y ago

RL tetris effect hallucination.

Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.

nicman231y ago

what i want from something like this is a mix. a model that can infinitely "zoom" into an object's texture which even if not perfect it would be fine and a model that would create 3d geometry from bump maps / normals

mobiuscog1y ago

Video Game streamers are next in line to be replaced by AI I guess.

EcommerceFlow1y ago

Jensen said that this is the future of gaming a few months ago fyi.

Fraterkes1y ago

Thousands of different people have been speculating about this kind of thing for years.

weakfish1y ago

Who is that?

kqr1y ago

I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.

gwbas1c1y ago

Am I the only one who thinks this is faked?

It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

GaggiX1y ago

>Am I the only one who thinks this is faked?

Yes.

amelius1y ago

Yes, and you can use an LLM to simulate role playing games.

piperswe1y ago

This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.

jasonkstevens1y ago

AI no longer plays Doom-it is Doom.

aghilmort1y ago

looking forward to &/or wondering about overlap with notion of ray tracing LLMs

itomato1y ago

The gibs are a dead giveaway

joseferben1y ago

impressive, imagine this but photo realistic with vr goggles.

thegabriele1y ago

Wow, I bet Boston Dynamics and such are quite interested

YeGoblynQueenne1y ago

Misleading Titles Are Everywhere These Days.

danielmarkbruce1y ago

What is the point of this? It's hard to see how this is useful. Maybe it's just an exercise to show what a diffusion model can do?

richard___1y ago

Uhhh… demos would be more convincing with enemies and decreasing health

Kiro1y ago

I see enemies and decreasing health on hit. But even if it lacked those, it seems like a pretty irrelevant nitpick that is completely underplaying what we're seeing here. The fact that this is even possible at all feels like science fiction.

dean24321y ago

So in the future we can play FPS games given any setting? Pog

sitkack1y ago

What most programmers don't understand, that in the very near future, the entire application will be delivered by an AI model, no source, no text, just connect to the app over RDP. The whole app will be created by example, the app developer will train the app like a dog trainer trains a dog.

Jonovono1y ago

I think it's possible AI models will generate dynamic UI for each client and stream the UI to clients (maybe eventually client devices will generate their UI on the fly) similar to Google Stadia. Maybe some offset of video that allows the remote to control it. Maybe Wasm based - just stream wasm bytecode around? The guy behind VLC is building a library for ulta low latency: https://www.kyber.video/techology.

I was playing around with the idea in this: https://github.com/StreamUI/StreamUI. Thinking is take the ideas of Elixir LiveView to the extreme.

sitkack1y ago

I am so glad you posted, this is super cool!

I too have been thinking about how to push dynamic wasm to the client for super low latency UIs.

LiveView is just the beginning. Your readme is dreamy. I'll dive into your project at the end of Sept when I get back into deep tech.

ukuina1y ago

So... https://websim.ai except over pixels instead of in your browser?

sitkack1y ago

Yes, and that is super neat.

Grimblewald1y ago

that might work for some applications, especially recreational things, I think we're a while away from it doing away with all things, especially where deterministic behavior, efficiency, or reliability are important.

sitkack1y ago

Problems for two papers down the line.

j / k navigate · click thread line to collapse

409 comments

vessenes1y ago

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Anyway, a fun idea that worked! Love those.

wavemode1y ago

> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

dewarrn11y ago

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

lawlessone1y ago

I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad

robotresearcher1y ago

Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.

1 more reply

bamboozled1y ago

Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.

3 more replies

throwway_2783141y ago

Work which exaggerates the blindness.

Why would you think the gorillas were unusual?

2 more replies

nmstoker1y ago

wavemode1y ago

2 more replies

whiteboardr1y ago

But does it need to be frame-based?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

robotresearcher1y ago

In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.

2 more replies

bee_rider1y ago

In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.

beepbooptheory1y ago

What would the model provide if not what we see on the screen?

1 more reply

mensetmanusman1y ago

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

debo_1y ago

It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!

codeflo1y ago

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

bee_rider1y ago

It it like some kind of weird dream doom.

hoosieree1y ago

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL1y ago

Groxx1y ago

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

Workaccount21y ago

idunnoman12221y ago

Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.

1 more reply

idunnoman12221y ago

Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.

nielsbot1y ago

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

alickz1y ago

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

noiv1y ago

I think the real question is does the player get shot from behind?

1 more reply

refibrillator1y ago

Just want to clarify a couple possible misconceptions:

Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.

De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.

rvnx1y ago

The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.

The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.

Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.

But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.

The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.

It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.

mensetmanusman1y ago

Research is the acquisition of knowledge that may or may not have practical applications.

They succeeded in the research, gained knowledge, and might be able to do something awesome with it.

It’s a success even if they don’t sell anything.

1 more reply

jetrink1y ago

1 more reply

nine_k1y ago

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

I would call it the world's least efficient video compression.

PoignardAzur1y ago

> But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.

So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.

bergen1y ago

How is what you think they say not clear?

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

Sharlin1y ago

No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.

taneq1y ago

It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.

TeMPOraL1y ago

mensetmanusman1y ago

WithinReason1y ago

If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.

nine_k1y ago

pradn1y ago

> Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

Even Google only has one Gemini (in a few versions).

fennecfoxy1y ago

raghavbali1y ago

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

bubaumba1y ago

> nice reminder that open models are useful to

You didn't say open _what_ models. Was that intentional?

Philpax1y ago

They did, SD 1.4

wkcheng1y ago

lokimedes1y ago

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin1y ago

lokimedes1y ago

I fabulate about this in another comment below:

(I obviously don't know what I'm talking about, just a fellow aphant)

2 more replies

zimpenfish1y ago

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett1y ago

lokimedes1y ago

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint1y ago

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk1y ago

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado1y ago

stevenhuang1y ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint1y ago

Umm, that’s a theory.

mind-blight1y ago

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

bangaladore1y ago

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Philpax1y ago

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos1y ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh071y ago

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever1y ago

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman1y ago

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh071y ago

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___1y ago

Did they take in the entire history as context?

slashdave1y ago

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin1y ago

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave1y ago

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

3 more replies

InDubioProRubio1y ago

slashdave1y ago

SeanAnderson1y ago

ollin1y ago

Chance-Device1y ago

I also thought this, but refer back to the paper, not the abstract:

> A is the set of key presses and mouse movements…

> …to condition on actions, we simply learn an embedding A_emb for each action

So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.

Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.

So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.

So yes, the users are playing it.

However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.

psb2171y ago

foota1y ago

I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?

Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.

Chance-Device1y ago

I’m not sure how that would help vs just training the model with the conditionings described in the paper.

I’m not seeing how that would apply here but I’d be interested in hearing how you would do it.

1 more reply

dewarrn11y ago

77341281y ago

What you're describing reminded me of this cool project:

https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"

refibrillator1y ago

You are incorrect, this is an interactive simulation that is playable by humans.

> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.

The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.

Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.

psb2171y ago

FrustratedMonky1y ago

Yeah. If isn't doing this, then what could it be doing that is worth a paper? "real-time user input and adjusts its output accordingly"

rvnx1y ago

There is a hint in the paper itself:

It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"

So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.

GaggiX1y ago

>I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly

teamonkey1y ago

lewhoo1y ago

The movement of the player seems jittery a bit so I inferred something similar on that basis.

bob10291y ago

Were the agents playing at 20 real FPS, or did this occur like a Pixar movie offline?

SeanAnderson1y ago

Ehhh okay, I'm not as convinced as I was earlier. Sorry for misleading. There's been a lot of back-and-forth.

led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.

I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:

Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.

psb2171y ago

It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work...

pajeets1y ago

I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game

zzanz1y ago

fngjdflmdflg1y ago

>Technically speaking, isn't this the greatest possible anti-Doom

ynniv1y ago

It's dreaming Doom.

birracerveza1y ago

We made machines dream of Doom. Insane.

1 more reply

qingcharles1y ago

Do Robots Dream of E1M1?

elwell1y ago

Droom

bugglebeetle1y ago

Pierre Menard, Author of Doom.

el_memorioso1y ago

I applaud your erudition.

airstrike1y ago

OK, this is the single most perfect comment someone could make on this thread. Diffusion me impressed.

jl61y ago

Knee Deep in the Death of the Author.

1attice1y ago

that took a moment, thank you

Terr_1y ago

> the Doom with the highest possible hardware requirement?

Isn't that possible by setting arbitrarily high goals for ray-cast rendering?

Vecr1y ago

It's the No-Doom.

WithinReason1y ago

Undoom?

riwsky1y ago

It’s a mood.

jeffhuys1y ago

Bliss

x-complexity1y ago

> Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?

Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.

The next step of anti-Doom would be a model generating the model, generating the Doom output.

nurettin1y ago

Isn't this technically a model (training step) generating a model (a neural network) generating Doom output?

yuchi1y ago

“…now it can implement Doom!”

rldjbpin1y ago

to me the closer analogy here is the "running minecraft inside minecraft" (https://news.ycombinator.com/item?id=32901461)

godelski1y ago

Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space

Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

snickmy1y ago

Those are valid points, but irrelevant for the context of this research.

tobr1y ago

godelski1y ago

pickledoyster1y ago

>you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste

godelski1y ago

I'm not sure what you're saying is irrelevant.

1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".

2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)

4) the point that you can rip a game

I'm really not sure what you're contesting to because I said several things.

  > it lacks basic things like pre-computing, storing, etc.

It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.

snickmy1y ago

Sorry, probably didn't explain myself well enough

2) yep, aligned here

3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.

1 more reply

danielmarkbruce1y ago

Is it a breakthrough? Weather models are miles ahead of this as far as I can tell.

dTal1y ago

>What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute

That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

phh1y ago

> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.

Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.

godelski1y ago

  > Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

1 more reply

camtarn1y ago

isaacfung1y ago

The possibility seems far beyond gaming(given enough computation resources).

Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.

dvngnt_1y ago

wouldnt make more sense to train using microsoft flight simulator the same way they did DOOM, but im not sure what the point is if the game already exists

Sohcahtoa821y ago

It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

ploxiln1y ago

But it's a funny party trick, sure.

joegibbs1y ago

1 more reply

Gooblebrai1y ago

So true. The hustle culture is an spreading disease that has replaced the fun maker culture from the 80s/90s.

It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.

nuancebydefault1y ago

And here we are, binging netflix movies over such copper wires.

ninetyninenine1y ago

I don’t think this is not useful. This is a stepping stone for generating entire novel games.

Sohcahtoa821y ago

> This is a stepping stone for generating entire novel games.

I don't see how.

And like I said, all of it requires the game to already exist in order to train the network.

airstrike1y ago

> (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again)

Sounds like a great game.

> not to mention that it requires the game to already exist in order to train it

ninetyninenine1y ago

>It has no notion of game state (so you can kill an enemy, turn your back, then turn around again)

> not to mention that it requires the game to already exist in order to train it.

Is this a problem? Do games not exist? Not only due we have tons of games, but we also have in theory unlimited amounts of training data for each game.

1 more reply

throwthrowuknow1y ago

Read the paper. It is capable of maintaining state for a fairly long time including updating the UI elements.

msk-lywenn1y ago

I’d like to now to carbon footprint of that fun.

HellDunkel1y ago

zamadatix1y ago

HellDunkel1y ago

MasterScrat1y ago

Interesting point.

> An engine would also work offroad.

HellDunkel1y ago

jsheard1y ago

> Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places.

refibrillator1y ago

There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

mlsu1y ago

God I love that they chose DOOM.

Jensson1y ago

mlsu1y ago

godelski1y ago

  > With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM.

You and I have very different definitions of compression

https://news.ycombinator.com/item?id=41377398

  > Someone in the field could probably correct me on that.

^__^

_hark1y ago

The raw capacity of the network doesn't tell you how complex the weights actually are. The capacity is only an upper bound on the complexity.

1 more reply

energy1231y ago

The source code lacks information required to render the game. Textures for example.

TeMPOraL1y ago

1 more reply

mistercheph1y ago

electrondood1y ago

The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?

basch1y ago

I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.

troupo1y ago

> one can imagine a near future where text prompts are enough to create a fun new game

Sit down and write down a text prompt for a "fun new game". You can start with something relatively simple like a Mario-like platformer.

By page 300, when you're about halfway through describing what you mean, you might understand why this is wishful thinking

reverius421y ago

troupo1y ago

Things that might work plausible in a static image will not look plausible when things are moving, especially in the game.

Also: https://news.ycombinator.com/item?id=41376722

Also: define "fun" and "new" in a "simple text prompt". Current image generators suck at properly reflecting what you want exactly, because they regurgitate existing things and styles.

slashdave1y ago

> where text prompts are enough to create a fun new game!

Not really. This is a reproduction of the first level of Doom. Nothing original is being created.

SomewhatLikely1y ago

user4326781y ago

radarsat11y ago

Most games are conditioned on text, it's just that we call it "source code" :).

(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)

magicalhippo1y ago

This got me thinking. Anyone tried using SD or similar to create graphics for the old classic text adventure games?

danjl1y ago

So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?

kragen1y ago

here are some ideas:

- you could build a non-real-time version of the game engine and use the neural net as a real-time approximation

injidup1y ago

Why games? I will train it on 1 years worth of me attending Microsoft teams meetings. Then I will go surfing.

kqr1y ago

I guess I should start hoarding video of myself now.

1 more reply

akie1y ago

Ready to pay for this

ccozan1y ago

most underrated comment here!

w_for_wumbo1y ago

I imagine a game like that could get so convincing in its details and immersiveness that one could forget they're playing a game.

troupo1y ago

There are thousands of games that mimic each other, and only a handful of them are any good.

What makes you think a mechanical "predict next frame based on existing games" will be any good?

1 more reply

numpad01y ago

IIRC, both 2001(1968) and Solaris(1972) depict that kind of things as part of alien euthanasia process, not as happy endings

2 more replies

aithrowaway19871y ago

Have you ever played a video game? This is unbelievably depressing. This is a future where games like Slay the Spire, with a unique art style and innovative gameplay simply are not being made.

2 more replies

omegaworks1y ago

EXISTENZ IS PAUSED!

THBC1y ago

Holodeck is just around the corner

1 more reply

qznc1y ago

The Cloud Gaming platforms could record things for training data.

modeless1y ago

If you train it on multiple games then you could produce new games that have never existed before, in the same way image generation models can produce new images that have never existed before.

jsheard1y ago

lewhoo1y ago

From what I understand that could make the engine much less stable. The key here is repetitiveness.

billconan1y ago

maybe the next step is adding text guidance and generating non-existing games.

notfed1y ago

I think the same comment could be said about generative images, no?

passion__desire1y ago

attilakun1y ago

If only there was a rich 3-dimensional physical environment we could draw training data from.

slashdave1y ago

Well, yeah. Image diffusion models only work because you can provide large amounts of training data. For Doom it is even simpler, since you don't need to deal with compositing.

dtagames1y ago

A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

kqr1y ago

> even rules which are not visible on-screen.

If a rule was changed but it's never visible on the screen, did it really change?

> It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

znx_01y ago

> If a rule was changed but it's never visible on the screen, did it really change?

Well for "some" games it does really change

darby_nine1y ago

> Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

It's much simpler than actually creating a game....

stnmtn1y ago

1 more reply

throwthrowuknow1y ago

calebh1y ago

momojo1y ago

The title should be "Diffusion Models can be used to render frames given user input"

sharpshadow1y ago

So all it did is generate a video of the gameplay which is slightly different from the video it used for training?

TeMPOraL1y ago

No, it implements a 3D FPS that's interactive, and renders each frame based on your input and a lot of memorized gameplay.

sharpshadow1y ago

But is it playing the actual game or just making a interactive video of it?

3 more replies

alkonaut1y ago

helloplanets1y ago

So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

toppy1y ago

I think you've just encoded the title of the paper

1 more reply

panki271y ago

> Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

arc-in-space1y ago

This, watching the generated clips feels uncomfortable, like a nightmare. Geometry is "swimming" with camera movement, objects randomly appear and disappear, damage is inconsistent.

aithrowaway19871y ago

freestyle241471y ago

It made me laugh. Maybe they pulled random people from the hallway who had never seen the original Doom (or any FPS), or maybe only selected people who wore glasses and forgot them at their desk.

meheleventyone1y ago

GaggiX1y ago

>rather than sitting comparable situations next to one another in the game and simulation then analyzing them.

That's literally how the human rating was setup if you read the paper.

meheleventyone1y ago

1 more reply

golol1y ago

Closi1y ago

golol1y ago

This misses the point, I'm comparing two methods of generating minecraft videos.

soulofmischief1y ago

pantalaimon1y ago

I would have thought it is much easier to generate huge amounts of game footage for training, but as I understand this is not what was done here.

mo_421y ago

An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

radarsat11y ago

(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)

cubefox1y ago

A popular theory in neuroscience is that this is what the brain does:

https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

ccozan1y ago

icoder1y ago

(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

marci1y ago

icoder1y ago

Sure, although I still think a system with less of a contrast between how well it performs 'modally' and how bad it performs incidentally, would be more practical.

lIl-IIIl1y ago

How does it know how many times it needs to shoot the zombie before it dies?

Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

I love how the player's health goes down when he stands in the radioactive green water.

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

meheleventyone1y ago

> I love how the player's health goes down when he stands in the radioactive green water.

golol1y ago

It gets a number of previous frames as input I think.

lupusreal1y ago

> In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

They trained this thing on bot gameplay, so I bet it does poorly when advanced strategies like deliberately inducing mob infighting are employed (the bots probably didn't do that a lot, of at all.)

masterspy71y ago

rererereferred1y ago

But can you grab what this Ai has learned and generate the 3d models, maps and code to turn it into an actual game that can run on a user's PC? That would be amazing.

passion__desire1y ago

Jensen Huang's vision that future games will be generated rather than rendered is coming true.

kleiba1y ago

You don't even need to do all of that - this trained model already is the game, i.e., it's interactive, you can play the game.

whamlastxmas1y ago

I would absolutely love if they could take this demo, add a new door that isn’t in the original, and see what it generates behind that door

nolist_policy1y ago

Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?

zbendefy1y ago

smusamashah1y ago

Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

kqr1y ago

> Is it possible to break the camera free and roam around the map freely and view it from different angles?

This is like a dog that has been trained to form English words – what's impressive is not that it does it well, but that it does it at all.

Sohcahtoa821y ago

> Therefore I don't think it has any clue about the 3D world of the game at all.

AI models don't "know" things at all.

At best, they're just very fuzzy predictors. In this case, given the last couple frames of video and a user input, it predicts the next frame.

It has zero knowledge of the game world, game rules, interactions, etc. It's merely a mapping of [pixels, input] -> pixels.

ravetcofx1y ago

There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.

Kapura1y ago

arduinomancer1y ago

How does the model “remember” the whole state of the world?

Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

a_e_k1y ago

Sharlin1y ago

Not unlike in (human) dreams.

Jensson1y ago

It doesn't even remember the state of the game you look at. Doors spawning right in front of you, particle effects turning into enemies mid flight etc, so just regular gen AI issues.

Edit: Can see this in the first 10 seconds of the first video under "Full Gameplay Videos", stairs turning to corridor turning to closed door for no reason without looking away.

csmattryder1y ago

Guessing the model hasn't been taught enough about that, because most people don't jump into hazards.

raincole1y ago

It doesn't. You need to put the world state in the input (the "prompt", even it doesn't look like prompt in this case). Whatever not in the prompt is lost.

rldjbpin1y ago

this is truly a cool demo, but a very misleading title.

dabochen1y ago

So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

If so, is it more like imagination/hallucination rather than rendering?

famouswaffles1y ago

It's conditioned on previous frames AND player actions so it's interactive.

rrnechmech1y ago

I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

jamilton1y ago

I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

Any other similar existing datasets?

A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

jamilton1y ago

Although ideally a follow up work would be something where there won’t be any potential legal trouble with releasing the complete model so people can play it.

throwthrowuknow1y ago

Several thoughts for future work:

bufferoverflow1y ago

That's probably how our reality is rendered.

TheRealPomax1y ago

t1c1y ago

They got DOOM running on a diffusion engine before GTA 6

broast1y ago

Maybe one day this will be how operating systems work.

misterflibble1y ago

Don't give them ideas lol terrifying stuff if that happens!

KhoomeiK1y ago

[1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

[2] https://www.youtube.com/watch?v=udPY5rQVoW0

dysoco1y ago

We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

raincole1y ago

> We could have mods for old games that generate voices for the characters for example

You mean in real time? Or just in general?

There are a lot of mods that use AI-generated voices. I'll say it's the norm of modding community now.

troupo1y ago

Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

This is not a game engine.

Creating a new good game? Good luck with that.

throwmeaway2221y ago

You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

gwern1y ago

People may recall GameGAN from May 2020: https://arxiv.org/abs/2005.12126#nvidia https://nv-tlabs.github.io/gameGAN/#nvidia https://github.com/nv-tlabs/GameGAN_code

kcaj1y ago

Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.

jetrink1y ago

Workaccount21y ago

https://deepmind.google/discover/blog/rt-2-new-model-transla...

yair99dd1y ago

Yotube user hu-po streams critical in-depth streams of Ai papers. Here is his take on this (and other relevant) paper https://www.youtube.com/live/JZgqQB4Aekc

lynx231y ago

wantsanagent1y ago

lukol1y ago

This will also allow players to easily customize what they experience without changing the core game loop.

nuz1y ago

I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)

JDEngi1y ago

KETpXDDzR1y ago

ciroduran1y ago

Congrats on running Doom on an Diffusion Model :D

I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

seydor1y ago

golol1y ago

darrinm1y ago

So… is it interactive? Playable? Or just generating a video of gameplay?

vunderba1y ago

From the article: We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

The demo is actual gameplay at ~20 FPS.

darrinm1y ago

It confused me that their stated evaluations by humans are comparing video clips rather than evaluating game play.

furyofantares1y ago

Short clips are the only way a human will make any errors determining which is which.

1 more reply

holoduke1y ago

I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.

jumploops1y ago

This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

Instead of working through a game, it’s building generic UI components and using common abstractions.

qnleigh1y ago

agys1y ago

Isn't that what Nvidia’s Ray Reconstruction and DLSS (frame generation and upscaler) are doing, more or less?

qnleigh1y ago

LtdJorge1y ago

So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?

lackoftactics1y ago

acoye1y ago

Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.

acoye1y ago

I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.

amunozo1y ago

This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.

harha_1y ago

This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?

aithrowaway19871y ago

- needs a huge amount of data, which a priori precludes a lot of interesting use cases

- flashy-but-misleading demos which hide the actual weaknesses of the AI software (note that the player is moving very haltingly compared to a real game of DOOM, where you almost never stop moving)

harha_1y ago

maxglute1y ago

RL tetris effect hallucination.

Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.

nicman231y ago

mobiuscog1y ago

Video Game streamers are next in line to be replaced by AI I guess.

EcommerceFlow1y ago

Jensen said that this is the future of gaming a few months ago fyi.

Fraterkes1y ago

Thousands of different people have been speculating about this kind of thing for years.

weakfish1y ago

Who is that?

kqr1y ago

I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

gwbas1c1y ago

Am I the only one who thinks this is faked?

It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

GaggiX1y ago

>Am I the only one who thinks this is faked?

Yes.

amelius1y ago

Yes, and you can use an LLM to simulate role playing games.

piperswe1y ago

This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.

jasonkstevens1y ago

AI no longer plays Doom-it is Doom.

aghilmort1y ago

looking forward to &/or wondering about overlap with notion of ray tracing LLMs

itomato1y ago