The Confusion of Variational Autoencoders (opens in new tab)

(jaan.io)

52 pointsjaan9y ago11 comments

11 comments

What puzzles me with the variational autoencoder is that there is no reason to expect the covariance of p(z|x) to be diagonal. This sounds like such a crude approximation that there ought to little benefits to even treat it as a distribution rather than a point mass. And yet it seems to do rather well (though not as well as GAN which do represent arbitrary distributions).

taliesinb9y ago

VAEs can be extended to make the latent variables dependent. OpenAI's inverse autoregressive flow is one recent way that is particularly efficient: http://arxiv.org/pdf/1606.04934v1.pdf. Linear IAF is the simplest form of this, with it you can model normal z having an arbitrary covariance matrix.

But aside from that, there is an information-theoretic view on why you might prefer VAEs over AEs. In short, having p(z|x) not be point-mass (aka an ordinary AE) allows you to bound the information flow through the bottleneck. KL loss on p(z|x) forces the network to be honest about how much information it is cramming into z for the purposes of reconstruction.

To unpack that a bit: in theory, even a single real-valued latent variable z could store an arbitrary amount of information (if the encoder and decoder conspired cleverly enough). But if you make z stochastic, or in other words if your encoder's job is to calculate the parameters of a distribution from which you sample z, you're essentially introducing a noisy channel in the middle of your network, and you can then bound how much information is flowing across that channel. But to do that you still need to use KL divergence loss to encourage p(z|x) to approximate your chosen latent distribution, otherwise your encoder and decoder might cheat, e.g. by using near-point-mass z as a way to turn back into ordinary AEs again.

Or in deep learning speak, it's a form of regularization with a particularly rich and interpretable statistical motivation.

svantana9y ago

IMHO this comment is much better than the original blog post, well done!

murbard29y ago

I get the regularization part, but don't you get essentially the same regularization from using a sparse autoencoder? If the encoder realizes it doesn't have much information, it will turn on few units.

What I don't really intuit is: is it just basically doing regularization, or is the interpretation in terms of learning to infer the posterior meaningful?

1 more reply

phreeza9y ago

Isn't that a desirable feature though? It means your latent features are uncorrelated, which arguably makes them more interpretable? For example you could get gender and hair color instead of (0.5gender + 0.5colo)r and (0.5 gender - 0.5 hair color)

murbard29y ago

They shouldn't be uncorrelated given x. https://en.wikipedia.org/wiki/Berkson%27s_paradox

conjectures9y ago

There were some nice things about this article. However I wouldn't recommend it as a cure for confusion. E.g.

> in mean-field variational inference, we have parameters for each datapoint ... In the variational autoencoder setting, we do amortized inference where there is a set of global parameters ...

Mean-field implies the variational posterior is modelled as factorising over the different latent variables involved. Some latent variables can be local (unique to a data point) and some can be global (shared across data points).

jayajay9y ago

Recently, someone shared a link on Hacker News to this website: https://pomax.github.io/nrGrammar/. If you look carefully in section 1.1.4, which aims to visually compare the differences between the Hiragana and Katakana scripts, you can see that there is a "logic" in transitioning from a character in Hiragana to the same character in Katakana. In the same way, it seems that an autoencoder is capable of capturing this logic.

dietrichepp9y ago

Note that this is because a subset of the kana were derived from the same Han characters in both scripts, but this does not apply to all kana.

tiiualto9y ago

Küll on keeruline, ma ähin ja puhin, aga ikka ei taipa mõhkugi!

j / k navigate · click thread line to collapse

11 comments

murbard29y ago

taliesinb9y ago

Or in deep learning speak, it's a form of regularization with a particularly rich and interpretable statistical motivation.

svantana9y ago

IMHO this comment is much better than the original blog post, well done!

murbard29y ago

What I don't really intuit is: is it just basically doing regularization, or is the interpretation in terms of learning to infer the posterior meaningful?

1 more reply

phreeza9y ago

murbard29y ago

They shouldn't be uncorrelated given x. https://en.wikipedia.org/wiki/Berkson%27s_paradox

conjectures9y ago

There were some nice things about this article. However I wouldn't recommend it as a cure for confusion. E.g.

> in mean-field variational inference, we have parameters for each datapoint ... In the variational autoencoder setting, we do amortized inference where there is a set of global parameters ...

jayajay9y ago

dietrichepp9y ago

Note that this is because a subset of the kana were derived from the same Han characters in both scripts, but this does not apply to all kana.

tiiualto9y ago

Küll on keeruline, ma ähin ja puhin, aga ikka ei taipa mõhkugi!

j / k navigate · click thread line to collapse