Unsupervised learning of probably symmetric deformable 3D objects from images (opens in new tab)

(robots.ox.ac.uk)

181 pointsjotaf6y ago35 comments

35 comments

Stuff like this can obviously be used to make things like deepfakes 'better'. But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.

When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).

st_goliath6y ago

> But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.

You might be interested in project HeadOn at TU München:

https://www.niessnerlab.org/projects/thies2018headon.html

Justus Thies gave a presentation at our University about a year ago. IIRC they don't use any fancy NN stuff but instead extract the face geometry using a stereo camera and use interpolation to project the movement onto a target mesh. Using stereo goggles for a VR meeting and various other applications were discussed during the presentation, but the main focus was of course on entertaining the audience with fake videos.

On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.

Frost1x6y ago

These photogrametric, structure from motion, structured light, etc. techniques have been around awhile so I don't think this changes too much though it may make it a bit easier to generate realistic depth maps for other purposes.

At some point you can probably reconstruct large portions of scenes in existing movies and change perspectives, especially if you had good techniques (perhaps AI based) for filling occlusions in the data.

vidarh6y ago

One of Vernor Vinge's novels describes 3d models transmitted as part of video conferencing, and re-skinning used to fool an adversary (while depending on causing a noisy, reduced bandwidth connection to make it plausible why the model is imperfect, I seem to remember).

jcims6y ago

Oh man, this plus the obs thread earlier plus zoom. The mind boggles haha.

thomasahle6y ago

Video here: https://youtu.be/5rPJyrU-WE4

Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?

gwern6y ago

Lots of people don't have perfectly good voices, and if you can copy the best voices, why settle for lesser ones? There are lots of voices people like more than yours, likely.

To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?

This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.

dointheatl6y ago

Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor, while text to speech requires no extra effort on their part?

thaumasiotes6y ago

> Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor

This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.

1 more reply

Tade06y ago

None of the authors appear to be native English speakers, so perhaps they're self-conscious about their accents?

Frost1x6y ago

Could also have speech problems. Could be lazy. Could want to save time. Could be useful at producing consistent CC information across mediums. Could allow people to choose arbitrary voice synthesis in the future which super futurists may like the idea of. Could have used a translator to produce the text (I haven't listened) and not know English atall.

Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.

1 more reply

vidarh6y ago

My thought as well. The TTS is good enough that it won't take much of an accent before the accent is harder to understand than the TTS as well. I know my own accent is strong enough that I'd have to put in very conscious effort to be easier to understand than this video.

BorisTheBrave6y ago

I've found it's surprisingly hard to get a satisfactory recording setup - noise, volume, echoes.

fatso7846y ago

I did research on a speech-to-text-to-speech system, and lots of non-native English speakers were self-conscious of their speech and prefer text-to-speech that wasn't in the style of their original voice.

Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.

jcl6y ago

Judging from SIGGRAPH videos/presentations, it's pretty common for graphics researchers who are non-native speakers to use text-to-speech or a native-speaking acquaintance for narration. I think it's done explicitly to help comprehension, although I think self-consciousness or fear of public speaking plays a role, too.

bryanrasmussen6y ago

Well I have a slight speech impediment that probably 1 in a 1000 people notices maybe once in a while and then thinks they must be mistaken when speaking to me in person, but if they were listening to a video they might here it clearer.

Although I can get rid of it if I focus.

usrusr6y ago

In addition to the possible reasons stated by peer comments, revisability. One-click builds are just as attractive for writing as they are for programming.

deckar016y ago

"We store a copy of the uploaded image ..."

Is this really happening with a "deep network in the browser"? It looks like it is happening on a server, then the 3D result is viewed in a browser.

Karuma6y ago

Yeah, I'm glad they changed the completely inaccurate title. Right now it doesn't even work due to server overload...

ijpsud6y ago

Not loading properly for me :( I think server may be being overloaded by HN traffic?

thedance6y ago

Unsupervised Static Asset Server Converts Probably Simple HTTP Requests to 404 Errors.

nalaka6y ago

Yes- you nailed it. :)

1 more reply

nalaka6y ago

Looks like HN has broken their server. I am getting a "Failed to send request to server!" error when I upload an image.

coding1236y ago

I think we slammed it dead.

hirako20006y ago

If it was truly processing on the browser, we wouldn't have jagged it so easily.

michaelt6y ago

This looks impressive - shame there aren't any examples larger than postage-stamp-sized in the paper or the video.

awinter-py6y ago

if you rotate it 180 deg Z axis, it's like she's watching you

schoen6y ago

Very striking!

This effect is well-studied and also happens with a real physical mask when viewed from the inside:

https://en.wikipedia.org/wiki/Hollow-Face_illusion

h91wka6y ago

Please fix the title of submission. It is " Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild". No anime examples to be found in the paper :(

michaelt6y ago

Figure 11, "Reconstruction on abstract face drawings", has a reconstruction of Naruto's face. The results are... unsettling.

dang6y ago

Submitted title was "Deep network in the browser converts to 3D any image of a person/cat/anime face". Changed now.

itronitron6y ago

because we can

j / k navigate · click thread line to collapse

35 comments

jcims6y ago

When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).

st_goliath6y ago

You might be interested in project HeadOn at TU München:

https://www.niessnerlab.org/projects/thies2018headon.html

On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.

Frost1x6y ago

vidarh6y ago

jcims6y ago

Oh man, this plus the obs thread earlier plus zoom. The mind boggles haha.

thomasahle6y ago

Video here: https://youtu.be/5rPJyrU-WE4

Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?

gwern6y ago

Lots of people don't have perfectly good voices, and if you can copy the best voices, why settle for lesser ones? There are lots of voices people like more than yours, likely.

This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.

dointheatl6y ago

Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor, while text to speech requires no extra effort on their part?

thaumasiotes6y ago

> Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor

This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.

1 more reply

Tade06y ago

None of the authors appear to be native English speakers, so perhaps they're self-conscious about their accents?

Frost1x6y ago

Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.

1 more reply

vidarh6y ago

BorisTheBrave6y ago

I've found it's surprisingly hard to get a satisfactory recording setup - noise, volume, echoes.

fatso7846y ago

Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.

jcl6y ago

bryanrasmussen6y ago

Although I can get rid of it if I focus.

usrusr6y ago

In addition to the possible reasons stated by peer comments, revisability. One-click builds are just as attractive for writing as they are for programming.

deckar016y ago

"We store a copy of the uploaded image ..."

Is this really happening with a "deep network in the browser"? It looks like it is happening on a server, then the 3D result is viewed in a browser.

Karuma6y ago

Yeah, I'm glad they changed the completely inaccurate title. Right now it doesn't even work due to server overload...

ijpsud6y ago

Not loading properly for me :( I think server may be being overloaded by HN traffic?

thedance6y ago

Unsupervised Static Asset Server Converts Probably Simple HTTP Requests to 404 Errors.

nalaka6y ago

Yes- you nailed it. :)

1 more reply

nalaka6y ago

Looks like HN has broken their server. I am getting a "Failed to send request to server!" error when I upload an image.

coding1236y ago

I think we slammed it dead.

hirako20006y ago

If it was truly processing on the browser, we wouldn't have jagged it so easily.

michaelt6y ago

This looks impressive - shame there aren't any examples larger than postage-stamp-sized in the paper or the video.

awinter-py6y ago

if you rotate it 180 deg Z axis, it's like she's watching you

schoen6y ago

Very striking!

This effect is well-studied and also happens with a real physical mask when viewed from the inside:

https://en.wikipedia.org/wiki/Hollow-Face_illusion

h91wka6y ago

Please fix the title of submission. It is " Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild". No anime examples to be found in the paper :(

michaelt6y ago

Figure 11, "Reconstruction on abstract face drawings", has a reconstruction of Naruto's face. The results are... unsettling.

dang6y ago

Submitted title was "Deep network in the browser converts to 3D any image of a person/cat/anime face". Changed now.

itronitron6y ago

because we can

j / k navigate · click thread line to collapse