This reminds of how the first generation of these kind of image generators were said to be 'dreaming'. This also makes me think that do our brains really work like these algorithms (or these algos are mimicking brains very correctly).
This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.
What's the point then?
You can recreate things from papers fine. I've done it for several projects, it's often nicer than just copy-pasting in code and it fixes issues where one side is uisng Montreal's AI toolkit and another is using pytorch and one other is using keras.
Although for a tool like this, they clearly used pre-trained models as a large component, ones with publicly accessible weights as well. So replicating it will probably happen in the coming months if Meta doesn't (understandably) release the code they very clearly plan to use for their own Metaverse product.
Often there's a paper deadline and the code still needs tidying up, or the same codebase supports additional models that are published in additional papers.
Keep an eye on the facebookreaseach GitHub for this in the next few months.