Deep Convolutional Inverse Graphics Network (opens in new tab)

(willwhitney.github.io)

107 pointstejask11y ago22 comments

22 comments

This is really clever. So basically iiuc, they set up a network to encode down to a representation that consists of parameters for a rendering engine. In order to ensure that this is the representation that is learned, the decoding stage is used to re-render the image subject to transformations and perform the decoding based on a an initial reduction phase after rendering. I.e. it is like an autoencoder, but the inner-most reduced representation is forced to be related to a graphics rendering engine by manipulating related transformation parameters.

Not only is this interesting from the point of view of using it for learning how to generate images, but it is a novel way to force a semantic internal representation instead of leaving it up to a regularisation strategy and interpreting the sparse encoding post-hoc. It forces the internal representation to be inherently "tweakable."

cs70211y ago

This can also be used for object recognition against invariant 3D representations, potentially with more accuracy than traditional convolutional neural net architectures.

Consider: their proof-of-concept face-recognition model achieves performance comparable to traditional convnets on faces with varying degree of pose, lighting, shape and texture, even though it was trained completely unsupervised. I would expect this type of model to beat the state of the art in face recognition and other similar tasks when fined-tuned with supervised training in the not-too-distant future.

rirarobo11y ago

Very cool work, I'm happy to see more people thinking about deep networks along these lines. It seems that this is very similar to a recent work put on arxiv back in November,

"Learning to Generate Chairs with Convolutional Neural Networks". http://arxiv.org/abs/1411.5928

They also have a very cool video of the generation process: https://youtu.be/QCSW4isBDL0

It's very interesting to see two groups independently developing almost identical networks for inverse graphics tasks, both using pose, shape, and view parameters to guide learning. I think that continuing in this direction could provide a lot of insight into how these deep networks work, and lead to new improvements for recognition tasks too.

@tejask - You should probably cite the above paper, and thanks for providing code! awesome!

tejaskOP11y ago

thanks for the references! I like that many people are doing such things. After looking at the chairs paper, it seems like they render images given pose,shape,view etc (supervised setting). However, in our model, there is a twist as it is trained either completely unsupervised or biased to separate those variables (but it is never given the true values of those parameters ... just raw data).

svantana11y ago

This is very nice, however I wish they would have used a traditional rendering technique (e.g. raytracing) for the decoder stage. It would have been more difficult to compute the gradient, but maybe not too bad if employing some type of automatic differentiation. If they had done it that way, the renderings could scale to any resolution (post-learning) and employ all types of niceities such as depth of field, sub-surface scattering, etc. Instead we're left with these very blocky, quantized convolution-style images.

tejaskOP11y ago

One of the authors here. You are absolutely right! In fact, I am currently doing something similar but it is not working as well yet. As far as this work is concerned, we wanted to see how model-free can we go.

rndn11y ago

I don’t understand much of the paper but it looks awesome! I have two questions: Am I understanding it correctly that one would need to convert the internal representation to a textured triangle mesh in order to use ray tracing in the decoder stage? Is the encoder effectively similar to scene reconstruction via structure from motion?

1 more reply

poslathian11y ago

Reminds me of being blown away in 2007 by Vetter and Blanz chasing a similar aim: https://m.youtube.com/watch?v=jrutZaYoQJo

ericjang11y ago

Whoa. Basically like http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf except it skips the (explicit) 3D mesh reconstruction altogether and goes straight to the rendered output.

FallDead11y ago

In laymen's terms this does what ?

possibilistic11y ago

It appears that they have developed a novel neural network architecture such that it is possible to understand what each of the layers encode. (Classically, the result of training might not be intelligible--the learned function is encoded in weights and parameters, and aren't meant for human readability or consumption.)

Another interesting property is the "pipeline" and how they seemed to have developed the math to make back propagation work around it. Each step in the pipeline performs some convolution or transformation function.

I haven't read the paper, but I'd be curious to see if they can reuse components of this pipeline in conjunction with one another. Perhaps it wouldn't be immediately possible (I imagine the parameters would have to be adjusted in some shape or form), but a plug-and-play system of pre-trained functions would be nothing sort of amazing.

(I may be incorrect in my analysis. I'm drawing on the ML and image processing I took in undergrad.)

kveykva11y ago

Network learns a system of lighting and geometry, so you can manipulate a set of codes that represent some variables of that geometry and the positions of those lights.

tejaskOP11y ago

In summary, the most interesting part for the general audience might be the following question -- can we learn a 3D rendering engine just from images or videos without any hand-engineering?

Apart from the interesting applications for computer graphics (like rendering novel viewpoints of an object from various viewpoints), this can also be directly used for vision applications. This is because computer vision can be thought of as the inverse of computer graphics.

Goal of computer graphics: scene description -> images

and

Goal of vision: images -> scene description.

Therefore, training a neural network to behave like a graphics engine is interesting from both these perspectives. We are a LONG way from even scratching the surface.

1 more reply

_0ffh11y ago

Haven't read the paper yet, but sounds similar in concept to what Geoff Hinton aims at for image recognition networks.

tejaskOP11y ago

Yes this is very much inspired by Geoff's work.

amelius11y ago

So, in essence, this network can learn to "unproject" images.

Since projection is a lossy operation, a projected image has potentially multiple inverses. And this makes me wonder how this system deals with the situation where two or more inverses exist and are equally likely.

tejaskOP11y ago

This is an interesting question. Technically, we capture a probability distribution in the code layer (between encoder and decoder). So you can sample from it multiple times and assess uncertainty. However, we have not really studied this.

j / k navigate · click thread line to collapse

22 comments

radarsat111y ago

cs70211y ago

This can also be used for object recognition against invariant 3D representations, potentially with more accuracy than traditional convolutional neural net architectures.

rirarobo11y ago

Very cool work, I'm happy to see more people thinking about deep networks along these lines. It seems that this is very similar to a recent work put on arxiv back in November,

"Learning to Generate Chairs with Convolutional Neural Networks". http://arxiv.org/abs/1411.5928

They also have a very cool video of the generation process: https://youtu.be/QCSW4isBDL0

@tejask - You should probably cite the above paper, and thanks for providing code! awesome!

tejaskOP11y ago

svantana11y ago

tejaskOP11y ago

rndn11y ago

1 more reply

poslathian11y ago

Reminds me of being blown away in 2007 by Vetter and Blanz chasing a similar aim: https://m.youtube.com/watch?v=jrutZaYoQJo

ericjang11y ago

Whoa. Basically like http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf except it skips the (explicit) 3D mesh reconstruction altogether and goes straight to the rendered output.

FallDead11y ago

In laymen's terms this does what ?

possibilistic11y ago

(I may be incorrect in my analysis. I'm drawing on the ML and image processing I took in undergrad.)

kveykva11y ago

Network learns a system of lighting and geometry, so you can manipulate a set of codes that represent some variables of that geometry and the positions of those lights.

tejaskOP11y ago

In summary, the most interesting part for the general audience might be the following question -- can we learn a 3D rendering engine just from images or videos without any hand-engineering?

Goal of computer graphics: scene description -> images

and

Goal of vision: images -> scene description.

Therefore, training a neural network to behave like a graphics engine is interesting from both these perspectives. We are a LONG way from even scratching the surface.

1 more reply

_0ffh11y ago

Haven't read the paper yet, but sounds similar in concept to what Geoff Hinton aims at for image recognition networks.

tejaskOP11y ago

Yes this is very much inspired by Geoff's work.

amelius11y ago

So, in essence, this network can learn to "unproject" images.

tejaskOP11y ago

j / k navigate · click thread line to collapse