[ML] EM Algorithm
EM Algorithm
EM algorithm is a technique where can iteratively obtain maximum likelihood estimation of latent variable model. One of the most known examples of it is Gaussian Mixture Model. For the sake of understanding EM algorithm, I recommend reading the last post: Variation Inference.
$\mathbf{Algorithm.}$
It alternates between performing an expectation step and a maximization step.
-
E-step: Derive a function for the expectation of the log-likelihood function evaluated using the distribution of the latent variables based on the current estimate for the parameters.
-
M-step: Compute the values of the parameters maximizing the expected log-likelihood function obtained in the E-step.
The estimated parameters are then used to determine the distribution of the latent variables in the next E step.
Actually, we can consider EM algorithm as a special case of Variational Inference which is applicable when we can find $p_{\theta} (z \mid x)$ analytically. Recall that maximizing ELBO can approximately maximize the likelihood $p_\theta (x)$ and the main purpose of variational inference is approximating the posterior $p_\theta (z \mid x)$ by intractable distribution $q(z)$. And, it is intuitive to choose the posterior with the current estimation of parameter $p_{\theta_t} (z \mid x)$ as the candidate of $q$.
Then, ELBO becomes
The second term $\mathbb{H}(q)$, entropy of $q$, is constant with regard to $\theta$ and thus can be ignored when optimizing with regard to $\theta$.
$\mathbf{Remark.}$
Indeed, $q(z) = p_{\theta_t} (z | x)$ really maximizes ELBO of the current estimation of parameter $\theta_t$, i.e., $\underset{q}{\text{arg max }} F(q, \theta_t) = p_{\theta_t} (z | x)$
In summary, EM algorithm is:
-
Loop
- $t \leftarrow t + 1$
-
E step compute $p_{\theta_{t-1}} (z \mid x)$ to obtain new $F(q, \theta_{t-1})$
-
M step evaluate $\theta_{t} \leftarrow \underset{\theta}{\text{arg max }} F(p_{\theta_{t-1}} (z \mid x), \theta_{t-1}) = \underset{\theta}{\text{arg max }} Q(\theta, \theta_{t-1})$
Convergence of EM Algorithm
In some situations, the EM algorithm can often be performed more efficiently than directly optimizing the likelihood. However, it doesn’t necessarily mean that it can always converge to the global optimum of likelihood.
However, the following inequalities can be guaranteed:
Variational EM Algorithm
Note that we can perform EM algorithm only if we can find $p_{\theta} (z \mid x)$ analytically. It is impossible in general when the probabilistic model becomes complex. So, we should find another distribution $q(z)$ to maximize the ELBO.
$\mathbf{Algorithm.}$ Variational EM Algorithm
-
E-step
- Select the distribution $q(Z)$ to maximize ELBO, which gives the tightest bound on the likelihood (evidence) $\text{ln } p_{\theta_{t-1}} (X)$. Also, it should be able to compute $Q(\theta, \theta_{t-1})$.
- Note that the family of variational distributions are usually parametrized. For example, a family of normal distributions with two parameters $\mu$ and $\sigma^2$.
- Find the parameter of $q$, $\theta_q^\star$ that maximizes $Q(\theta, \theta_{t-1})$ and $q^* (Z) = q_{\theta_q^\star} (Z)$.
- Select the distribution $q(Z)$ to maximize ELBO, which gives the tightest bound on the likelihood (evidence) $\text{ln } p_{\theta_{t-1}} (X)$. Also, it should be able to compute $Q(\theta, \theta_{t-1})$.
-
M-step
- $\theta_{t} \leftarrow \underset{\theta}{\text{arg max }} Q(\theta, \theta_{t-1})._\blacksquare$
- $\theta_{t} \leftarrow \underset{\theta}{\text{arg max }} Q(\theta, \theta_{t-1})._\blacksquare$
Mean Field EM Algorithm
One of the most well-known examples of variable EM algorithms is Mean Field approximation. The variational distribution $q(z)$ is usually assumed to factorize over some partition of the latent variables, i.e. for some partition of the latent variables $z$ into $z_1, z_2, \cdots$. That is, we assume that latent variables are independent, so that their joint pdf is given by
Even though we use the independent approximation, it allows us to update the pdf of each latent variable separately and has been successful in many interesting problems like physics.
Then, we can perform variational EM Algorithm with this $q$ for each latent variable $z_j$: ELBO becomes
where
Since $\text{exp } \mathbb{E} [\text{ln }p_\theta (x, z)]$ is not a proper pdf, some constant $\text{ln } C$ is added. As KL divergence is nonnegative, ELBO is maximized when
Reference
[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
[2] Expectation-maximization: theory and intuition
[3] Wikipedia, Variational Bayesian methods
Leave a comment