5 minute read

EM Algorithm

EM algorithm is a technique where can iteratively obtain maximum likelihood estimation of latent variable model. One of the most known examples of it is Gaussian Mixture Model. For the sake of understanding EM algorithm, I recommend reading the last post: Variation Inference.


$\mathbf{Algorithm.}$ It alternates between performing an expectation step and a maximization step.

  • E-step: Derive a function for the expectation of the log-likelihood function evaluated using the distribution of the latent variables based on the current estimate for the parameters.
  • M-step: Compute the values of the parameters maximizing the expected log-likelihood function obtained in the E-step.

The estimated parameters are then used to determine the distribution of the latent variables in the next E step.


Actually, we can consider EM algorithm as a special case of Variational Inference which is applicable when we can find $p_{\theta} (z \mid x)$ analytically. Recall that maximizing ELBO can approximately maximize the likelihood $p_\theta (x)$ and the main purpose of variational inference is approximating the posterior $p_\theta (z \mid x)$ by intractable distribution $q(z)$. And, it is intuitive to choose the posterior with the current estimation of parameter $p_{\theta_t} (z \mid x)$ as the candidate of $q$.

Then, ELBO becomes

\[\begin{aligned} F(q, \theta) &= \mathbb{E}_{q(z)} \left[\text{ln} \frac{p_\theta (x, z)}{q(z)} \right] \\ &= \int q(z) \cdot \text{ln } p_\theta (x, z) dz - \int q(z) \cdot \text{ln } q(z) dz \\ &= \int p_{\theta_{t}} (z | x) \cdot \text{ln } p_\theta (x, z) dz - \int p_{\theta_{t}} (z | x) \cdot \text{ln } p_{\theta_{t}} (z | x) dz \\ &= Q(\theta, \theta_{t}) - \mathbb{H}(q) \end{aligned}\]

The second term $\mathbb{H}(q)$, entropy of $q$, is constant with regard to $\theta$ and thus can be ignored when optimizing with regard to $\theta$.

$\mathbf{Remark.}$ Indeed, $q(z) = p_{\theta_t} (z | x)$ really maximizes ELBO of the current estimation of parameter $\theta_t$, i.e., $\underset{q}{\text{arg max }} F(q, \theta_t) = p_{\theta_t} (z | x)$

\[\begin{aligned} \underset{q}{\text{arg max }} F(q, \theta_t) &= \underset{q}{\text{arg max }} E_{q(z)} \left[\text{log} \frac{p_{\theta_t} (x, z)}{q(z)} \right] \\ &= \underset{q}{\text{arg min }} \text{KL} (q(z) \; \| \; p_{\theta_t} (z \mid x)) \\ &= p_{\theta_t} (z \mid x)._\blacksquare \end{aligned}\]

In summary, EM algorithm is:

  • Loop
    • $t \leftarrow t + 1$
    • E step compute $p_{\theta_{t-1}} (z \mid x)$ to obtain new $F(q, \theta_{t-1})$
    • M step evaluate $\theta_{t} \leftarrow \underset{\theta}{\text{arg max }} F(p_{\theta_{t-1}} (z \mid x), \theta_{t-1}) = \underset{\theta}{\text{arg max }} Q(\theta, \theta_{t-1})$

Convergence of EM Algorithm

In some situations, the EM algorithm can often be performed more efficiently than directly optimizing the likelihood. However, it doesn’t necessarily mean that it can always converge to the global optimum of likelihood.

However, the following inequalities can be guaranteed:

\[\begin{aligned} \text{ln } p_{\theta_{t-1}} (X) &= F(p_{\theta_{t-1}}(Z \mid X), \theta_{t-1}) + \text{KL}(p_{\theta_{t-1}}(Z \mid X) \; \| \; p_{\theta_{t-1}}(Z \mid X)) \\ &= F(p_{\theta_{t-1}}(Z \mid X), \theta_{t-1}) \\ &= Q(\theta, \theta_{t-1}) - \mathbb{H}(p_{\theta_{t-1}} (\cdot \mid X)) \\ &\leq Q(\theta, \theta_{t}) - \mathbb{H}(p_{\theta_{t-1}} (\cdot \mid X)) \\ &= F(p_{\theta_{t-1}} (Z \mid X), \theta_{t}) \\ &\leq F(p_{\theta_{t}} (Z \mid X), \theta_{t}) \\ &\leq F(p_{\theta_{t}} (Z \mid X), \theta_{t}) + \text{KL}(p_{\theta_{t}}(Z \mid X) \; \| \; p_{\theta_{t}}(Z \mid X)) \\ &= \text{ln } p_{\theta_{t}} (X) \end{aligned}\]



Variational EM Algorithm

Note that we can perform EM algorithm only if we can find $p_{\theta} (z \mid x)$ analytically. It is impossible in general when the probabilistic model becomes complex. So, we should find another distribution $q(z)$ to maximize the ELBO.

$\mathbf{Algorithm.}$ Variational EM Algorithm

  • E-step
    • Select the distribution $q(Z)$ to maximize ELBO, which gives the tightest bound on the likelihood (evidence) $\text{ln } p_{\theta_{t-1}} (X)$. Also, it should be able to compute $Q(\theta, \theta_{t-1})$.
    • Note that the family of variational distributions are usually parametrized. For example, a family of normal distributions with two parameters $\mu$ and $\sigma^2$.
    • Find the parameter of $q$, $\theta_q^\star$ that maximizes $Q(\theta, \theta_{t-1})$ and $q^* (Z) = q_{\theta_q^\star} (Z)$.
  • M-step
    • $\theta_{t} \leftarrow \underset{\theta}{\text{arg max }} Q(\theta, \theta_{t-1})._\blacksquare$


Mean Field EM Algorithm

One of the most well-known examples of variable EM algorithms is Mean Field approximation. The variational distribution $q(z)$ is usually assumed to factorize over some partition of the latent variables, i.e. for some partition of the latent variables $z$ into $z_1, z_2, \cdots$. That is, we assume that latent variables are independent, so that their joint pdf is given by

\[\begin{aligned} q(z) = \prod_i q(z_i)$. \end{aligned}\]

Even though we use the independent approximation, it allows us to update the pdf of each latent variable separately and has been successful in many interesting problems like physics.

Then, we can perform variational EM Algorithm with this $q$ for each latent variable $z_j$: ELBO becomes

\[\begin{aligned} F(q, \theta) &= \mathbb{E}_{q(z)} \left[\text{ln} \frac{p_{\theta} (x, z)}{q(z)} \right] \\ &= \int q(z) \cdot \text{ln } p_\theta (x, z) \cdot dz - \int q(z) \cdot \text{ln } q(z) \cdot dz \\ &= \int \prod_i q(z_i) \cdot \text{ln } p_\theta (x, z) \cdot dz_1 \cdot dz_2 \cdots - \sum_i \int q(z_i) \cdot \text{ln } q(z_i) \cdot dz_i \\ &= \int q(z_j) [\int [\prod_{i \neq j} q(z_i) \cdot \text{ln } p_\theta (x, z)) \prod_{i \neq j} dz_i] \cdot dz_j - \int q(z_j) \cdot \text{ln } q(z_j) dz_j \\ &\quad - \sum_{i \neq j} \int q(z_i) \cdot \text{ln } q(z_i) \cdot dz_i \\ &= \int q(z_j) \cdot \text{ln } \left(\frac{\text{exp } \mathbb{E}_{i \neq j} [\text{ln }p_\theta (x, z)]}{q(z_j)} \right) \cdot dz_j - \sum_{i \neq j} \int q(z_i) \cdot \text{ln } q(z_i) \cdot dz_i \\ &= \int q(z_j) \cdot \text{ln } \left(\frac{\widetilde{p}_{i \neq j}}{q(z_j)} \right) \cdot dz_j - \sum_{i \neq j} \int q(z_i) \cdot \text{ln } q(z_i) \cdot dz_i + \text{ln } C \\ &= - \text{KL} (q(z_j) \; \| \; \widetilde{p}_{i \neq j}) - \sum_{i \neq j} \int q(z_i) \cdot \text{ln } q(z_i) \cdot dz_i + \text{ln } C \\ \end{aligned}\]

where

$\widetilde{p}_{i \neq j} = \frac{1}{C} \text{exp } \mathbb{E}_{i \neq j} [\text{ln }p_\theta (x, z)]$


Since $\text{exp } \mathbb{E} [\text{ln }p_\theta (x, z)]$ is not a proper pdf, some constant $\text{ln } C$ is added. As KL divergence is nonnegative, ELBO is maximized when

\[\begin{aligned} q(z_j) = \frac{1}{C} \text{exp } \mathbb{E}_{i \neq j} [\text{ln }p_\theta (x, z)] \end{aligned}\]




Reference

[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
[2] Expectation-maximization: theory and intuition
[3] Wikipedia, Variational Bayesian methods

Leave a comment