When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).
You might be interested in project HeadOn at TU München:
https://www.niessnerlab.org/projects/thies2018headon.html
Justus Thies gave a presentation at our University about a year ago. IIRC they don't use any fancy NN stuff but instead extract the face geometry using a stereo camera and use interpolation to project the movement onto a target mesh. Using stereo goggles for a VR meeting and various other applications were discussed during the presentation, but the main focus was of course on entertaining the audience with fake videos.
On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.
At some point you can probably reconstruct large portions of scenes in existing movies and change perspectives, especially if you had good techniques (perhaps AI based) for filling occlusions in the data.
Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?
To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?
This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.
This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.
Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.
Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.
Although I can get rid of it if I focus.
Is this really happening with a "deep network in the browser"? It looks like it is happening on a server, then the 3D result is viewed in a browser.
This effect is well-studied and also happens with a real physical mask when viewed from the inside: