Deep dive into variational auto-encoder (Part 1)

This is a deep dive to machine learning technique variational auto-encoder (VAE).

General Theory
- Model
- Objective function
Model for MNIST dataset
References

General Theory

Model

The digit images () are generated by an unknown process , where is an unobservable latent variable. Here we plug in a specific into this function, and it specifies the intensity distribution of the image pixels of .

We don’t know the true parameter but we know the general form of the parametric function . Note that denotes variable of parameters, while denotes the truth value. Latent variable follows distribution .

The likelihood function measures how well describes the observed digit . By definition because generates the observation. However, is intractable, meaning that we cannot evaluate or differentiate for every and . Same for and . Since is hard to evaluate, we introduce a new function to approximate it. The following diagram sums it all up:

graph LR P("Draw from
distribution P(z)") --> A A(("latent
variable z")) --> H(Hidden
process ) H -->X((observed
image x)) X-->Q("q(z|X)") Q-->P style H fill:#bbf,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5, 5

In summary, below are what each variable means.

Variable	Description
	image of digit
	latent variable that generates the image
	Parameter of the true model
	True parameter. is maximum
	Likelihood function - How likely describes
	Prior distribution of
	Likelihood or generative function of given
	Likelihood or generative function of given
	Approximate function to

Objective function

In variational auto encoder, the objective function is the likelihood function of the true model . Ideally we want to find the optimal that best matches the observed images . can be expressed in terms of Kullback-Leiber divergence . measures how well approximates :

where

can be taken out of the expectation because it does not depend on . Since is non-negative, serves as a lower bound to . In other words:

The trick of VAE is to maximize instead of because it can be calculated for many problems.

But since is hard to calculate, it is useful to rewrite as

Model for MNIST dataset

To trian the model, we will maximize the lower bound

For this problem, we choose the prior distribution of the latent variable to be the standard normal distribution which has zero mean and unit variance, i.e. . Why can we do that? Because we don’t know the distribution and may as well choose to work with a easier one! But we will see in a moment that for the purpose it will serve it doesn’t really matter.

OK, that takes care of . How about ? To be consistent, we also model it with a normalize distribution, but it could have non-zero mean and non-unity variance. Mathematically,

How can it be different from the prior? The idea is, we have 10 digits to encode. Each one will have distribution deviates from zero and collectively they form 10 distinct clusters. But if we look at over all digits, it will still follow the standard normal distribution . In a moment, you will see the prior regularizes the learned parameters and to pull them back to the standard normal .

Now we can calculate the first term, . Using the identity of KL divergence between two normal distributions

We get the first term

is the dimension of the latent variable .

Let’s build some intuition!

is maximized at (plot)
is maximized when (hope this is obvious…)

So the learning prefers to follow the standard normal distribution as much as possible. In other words, the first term is a regularization term to make sure that the learned distribution of is not too crazy.

OK, let’s understand the second term . For binary images (we will use binarized MNIST), is Bernoulli distribution . For a single pixel, it is

is the observed pixel intensity and can only be 0 or 1 (binary). Basically, this is a measure of how well the random variable matches the observed intensity. Suppose it is a dark pixel () and the model predicts , the model would be considered to be doing pretty well and score 0.9 (with 1.0 being the full mark). Note that you don’t see on right hand side of the equation because it is implicitly in as is generated by . For an image with pixels,

How to evaluate the expectation value ? Since we draw from the approximate , we can take an simple average over all images, or use only one image per evaluation and don’t worry about it.

References

Auto-Encoding Variational Bayes - original paper