4 minute read

Variational Inference


Recap: KL Divergence

KL divergence is a way to compute the distance between two distributions $p$ and $q$.

\[\begin{aligned} \text{KL}(q \; \| \; p) = \mathbb{E}_{q(z)} \left[\text{ln}\frac{q(z)}{p(z)} \right] \end{aligned}\]


Key properties of KL divergence are nonnegativity and asymmmetricity. Due to asymmetricity, KL divergence cannot be metric.

  • Nonnegative: $\text{KL}(q \; || \; p) \geq 0$ with equality when $q = p$
  • Asymmetric: $\text{KL}(q \; || \; p) \neq \text{KL}(p \; || \; q)$

For more detail explanation of KL divergence, see.

Latent Variable Models

Let $x$ be the observations and $\theta$ be an unknown parameter of probability distribution. To estimate $\theta$, one of the most widely used estimator is maximum likelihood estimator $\widehat{\theta}_{\text{MLE}}$:

\[\begin{aligned} \widehat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax }} p_\theta (x) \end{aligned}\]

But, this optimization problem is sometimes tricky. For example, in practice, we are often in the situation where part of the data is missing. One trick for this is to introduce latent variables $z$. Latent variables are unobserved variables that we want to incorporate into the model because it makes problem easier in some cases. We already saw some cases in the previous posts, GMM & K-Means Clustering.

Then, we can represent the likelihood $p_\theta (x)$ as

\[\begin{aligned} p_\theta (x) = \int p_\theta (x, z) dz \end{aligned}\]


After introducing $z$, we still want to maximize $p_\theta (x)$ w.r.t. $\theta$. However, it requires the following difficult marginalization of $z$. In other words, intergration of marginalizing the latent variable $z$ is intractable in many cases. One way to solve this problem is the Monte Carlo method. The other method is Variational Inference, which will be mainly covered in this post.


Variational Inference

Variational Inference is an example of Variational Bayesian methods, a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. [2]

The posterior is an important element in various applications such as MAP. And it provides some ways for estimating a posterior distribution, which is usually intractable to compute explicitly. Variational inference is most frequently used situation in which we introduce a latent random variable $z$, or the model may involves some hidden random variable $z$. In that case, let’s consider the posterior distribution $p(z \mid x)$ via Bayes’ Theorem:

\[\begin{aligned} p(z \mid x) = \frac{p(x \mid z) p(z)}{p(x)} \end{aligned}\]

where $x$ is an observed data.


In practice, we often have difficulties to compute $p(x)$, since we should marginalize $z$ i.e., $p(x) = \int p(x, z) dz$, which integral may be intractable. In such cases, we frequently approximate $p(z \mid x)$ by $q(z)$ instead of computing it directly. And, one of these methods of approximation is variational inference.


Why it is called “variational” inference?

The term "variational" comes from the area of calculus of variations. It studies optimization over functions of functions (for example, problems like: given a family of curves in 2D between two points, find one with smallest length).

Let $\mathcal{F}$ be a set of functions, and let any function $f \in \mathcal{F}$ where $f: U \to V$ be given. Let $g$ be a function of functions that maps $\mathcal{F}$ to some set $W$. Then, we may want to optimize

\[\begin{aligned} \underset{f \in \mathcal{F}}{\text{arg max }} g(f). \end{aligned}\]

And, in variational inference, we want to minimize a function (KL divergence) of functions $\text{KL}(\cdot \; || \; p(z \mid x))$, and it corresponds to the problem addressed in the calculus of variations.


Evidence Lower BOund (ELBO)

Let $q(z)$ be the probability distribution of $z$. ELBO $F(q, \theta)$ provides a lower bound of the log evidence, i.e.

\[\begin{aligned} \text{ln } p_\theta (x) \geq F(q, \theta). \end{aligned}\]

Note that maximizing the evidence $p_\theta (x)$ with respect to $\theta$ in this case is equivalent to maximum likelihood estimation.


So, if we maximize $F(q, \theta)$ w.r.t. $q$ and $\theta$, we can approximately optimize the evidence.

We can derive ELBO easily as follows:

\[\begin{aligned} \text{ln } p_\theta (x) &= \mathbb{E}_{q(z)} [\text{log } p_\theta (x)] \\ &= \mathbb{E}_{q(z)} \left[\text{log} \frac{p_\theta (x) p_\theta (z \mid x)}{p_\theta (z \mid x)}\right] \\ &= \mathbb{E}_{q(z)} \left[\text{log} \frac{p_\theta (x, z) q(z)}{p_\theta (z \mid x) q(z)}\right] \\ &= \mathbb{E}_{q(z)} \left[\text{log} \frac{p_\theta (x, z)}{q(z)} \right] + \mathbb{E}_{q(z)} \left[\text{log} \frac{q(z)}{p_\theta (z \mid x)}\right] \\ &= F(q, \theta) + \text{KL}(q(z) \; \| \; p_\theta (z \mid x)). \end{aligned}\]

Thus, since KL divergence is non-negative,

\[\begin{aligned} \text{ln } p_\theta (x) = F(q, \theta) + \text{KL}(q(z) \; \| \; p_\theta (z \mid x)) \geq F(q, \theta) \end{aligned}\]


Now it is clear that maximizing ELBO can approxmately maximize $p_\theta (x)$. But, in the view of variation inference, maximizing ELBO minimizes the KL divergence between $q(z)$ and $p_\theta (z \mid x)$. In other words, approximating the posterior by $q(z)$ is equivalent to maximizing ELBO:

\[\begin{aligned} \text{KL}(q(z) \; \| \; p_\theta (z | x)) = \text{ln } p_\theta (x) - F(q, \theta). \end{aligned}\]


In summary, the followings are equivalent:

  • Maximizing ELBO
  • Maximizing Likelihood (approximately)
  • Approximating the posterior $p ( z \mid x)$ by $q(z)$




Reference

[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
[2] Wikipedia, Variational Bayesian methods
[3] What does “variational” mean?

Leave a comment