Not only is this interesting from the point of view of using it for learning how to generate images, but it is a novel way to force a semantic internal representation instead of leaving it up to a regularisation strategy and interpreting the sparse encoding post-hoc. It forces the internal representation to be inherently "tweakable."
Consider: their proof-of-concept face-recognition model achieves performance comparable to traditional convnets on faces with varying degree of pose, lighting, shape and texture, even though it was trained completely unsupervised. I would expect this type of model to beat the state of the art in face recognition and other similar tasks when fined-tuned with supervised training in the not-too-distant future.
"Learning to Generate Chairs with Convolutional Neural Networks". http://arxiv.org/abs/1411.5928
They also have a very cool video of the generation process: https://youtu.be/QCSW4isBDL0
It's very interesting to see two groups independently developing almost identical networks for inverse graphics tasks, both using pose, shape, and view parameters to guide learning. I think that continuing in this direction could provide a lot of insight into how these deep networks work, and lead to new improvements for recognition tasks too.
@tejask - You should probably cite the above paper, and thanks for providing code! awesome!
Another interesting property is the "pipeline" and how they seemed to have developed the math to make back propagation work around it. Each step in the pipeline performs some convolution or transformation function.
I haven't read the paper, but I'd be curious to see if they can reuse components of this pipeline in conjunction with one another. Perhaps it wouldn't be immediately possible (I imagine the parameters would have to be adjusted in some shape or form), but a plug-and-play system of pre-trained functions would be nothing sort of amazing.
(I may be incorrect in my analysis. I'm drawing on the ML and image processing I took in undergrad.)
Apart from the interesting applications for computer graphics (like rendering novel viewpoints of an object from various viewpoints), this can also be directly used for vision applications. This is because computer vision can be thought of as the inverse of computer graphics.
Goal of computer graphics: scene description -> images
and
Goal of vision: images -> scene description.
Therefore, training a neural network to behave like a graphics engine is interesting from both these perspectives. We are a LONG way from even scratching the surface.
Since projection is a lossy operation, a projected image has potentially multiple inverses. And this makes me wonder how this system deals with the situation where two or more inverses exist and are equally likely.