That said, it looks like the tech currently works at 1-2m range. I’d guess this is because that’s an easy range for a projector to work at and provide high res, in which case there’s no reason you couldn’t get a zoom setup going.
Perhaps some enterprising director of the next Disney theater production will put the money in to get this working reliably on stage.
Human movements are super stereotyped in the face and you could likely predict with decent accuracy the next frame given the previous frames.
That said as you mentioned faces are pretty stereotyped. We "solved" face mapping two decades ago with Tim Cootes and Paul Ekman's work. We're able to quickly map rough estimates using traditional haar cascade classifiers and Viola-Jones with AdaBoost. Neural Networks may help, but we have other solutions that also handle the problem with "relative" ease (ignoring lighting, occlusion, etc.)
try mediapipe w/ gpu support
Is the source code available?
https://cycling74.com/forums/n4m-facemesh-handpose-google-me...
I'm wondering why the physical alignment is so important. Are camera distortion models and mapping, view projection, etc. just too slow or low quality to run?
I suppose I'll have to take a look at their paper later.
Or for fake propaganda videos. Dangerous time for these kinds of progress.