5 minute read

Gaussian Mixture Model (GMM)

Gaussian Mixture Model (GMM) assumes that data points $\mathbf{x}_1, \cdots, \mathbf{x}_N$ are generated by different Gaussian distributions as

$p(\mathbf{x}_i) = \sum_{k=1}^K \pi_k N(\mathbf{x}_i | \mathbf{\mu}_k, \mathbf{\Sigma}_k)$

where $\sum_{k=1}^K \pi_k = 1$. $\pi_k$ means the probability that the $k$-th Gaussian distribution is chosen.

Recall that, for $\mathbf{x_i} \in \mathbb{R}^d, N(\mathbf{x_i}; \mathbf{\mu_k}, \mathbf{\Sigma_k}) = \frac{1}{ \sqrt{(2\pi)^d \text{det}(\mathbf{\Sigma_k})}} e^{-\frac{1}{2}(\mathbf{x_i} - \mu_k)^T \mathbf{\Sigma_k}^{-1} (\mathbf{x_i} - \mu_k)}$

We use a latent random variable $\mathbf{z_i} = (z_{i1}, \cdots, z_{iK})$ with $z_{ik} \in {0, 1}, \sum_{k=1}^K \pi_k z_{ik} = 1$, and

$p(z_{ik} = 1) = \pi_k$, $1 \leq k \leq K$.
$p(\mathbf{z}_i) = \prod_{k=1}^K \pi_{k}^{z_{ik}}$


That is, $\mathbf{z_i}$ is an one-hot vector that if $\mathbf{x_n}$ is given from $k$-th gaussian distribution, flag $z_{nk} = 1$.

Thus, note that

$p(\mathbf{x}_i | z_{ik} = 1) = N(\mathbf{x}_i | \mathbf{\mu}_k, \Sigma_k)^{z_{ik}}$


We now consider the joint probability of all data points $\mathbf{x} = { \mathbf{x}_1, \cdots, \mathbf{x}_N }$ and $\mathbf{z} = { \mathbf{z}_1, \cdots, \mathbf{z}_N }$

\[\begin{align*} p(\mathbf{x}, \mathbf{z})& = \prod_{i=1}^N p(\mathbf{x}_i | \mathbf{z}_i) p(\mathbf{z_i}) \\ &= \prod_{i=1}^N \prod_{k=1}^K \pi_{k}^{z_{ik}} N (\mathbf{x}_i | \mathbf{\mu}_k, \mathbf{\Sigma}_k)^{z_{ik}}. \end{align*}\]

It is important to get $p(z_{ik} = 1; \mathbf{x}_i)$. Observe that

\[\begin{align*} p(z_{ik} = 1 | \mathbf{x}_i) &= \frac{p(z_{ik} = 1, \mathbf{x}_i)}{p(\mathbf{x}_i)} \\ &= \frac{p(z_{ik} = 1, \mathbf{x}_i)}{\sum_{l=1}^K p(\mathbf{x}_i, z_{il} = 1)} \\ &= \frac{p(\mathbf{x}_i | z_{ik} = 1)p(z_{ik} = 1)}{\sum_{l=1}^K p(\mathbf{x}_i | z_{il} = 1)p(z_{il} = 1)} \\ &= \frac{\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)}{\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \end{align*}\]

In the GMM, we have the following parameters to learn:

$\mathbf{\theta} := \{ \pi_k, \mu_k, \Sigma_k | 1 \leq k \leq K \}.$


To this end, we consider the log likelihood of $\mathbf{x}$ and find MLE

\[\begin{align*} \mathbf{\theta^*} &= \underset{\mathbf{\theta}}{\text{argmax }} p(\mathbf{x} | \mathbf{\theta}) \\ &= \underset{\mathbf{\theta}}{\text{argmax }} \sum_{i=1}^N \text{log }(\sum_{k=1}^K \pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)) \end{align*}\]

Training the model

We are now ready to learn the model. The learning process of the GMM is very similar to that of $k$-means clustering. That is, if we know latent variables $\mathbf{z}_i$, then we learn the Gaussian parameters $\mu_k$ and $\mathbf{\Sigma}_k$. In contrast, if we know the Gaussian parameters $\mu_k$ and $\mathbf{\Sigma}_k$, then we learn the latent variables $\mathbf{z}_i$.

Let $J (\mathbf{\theta}) = \sum_{i=1}^N \text{log}(\sum_{k=1}^K \pi_k N(\mathbf{x}_i; \mu_k, \mathbf{\Sigma}_k)).$

Finding the optimal $\mathbf{\mu_k}$

We have

\[\begin{align*} \nabla_{\mathbf{\mu}_k} J (\mathbf{\theta}) &= \sum_{i=1}^N \frac {\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i \mathbf{\mu}_k) \\ & = \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i - \mathbf{\mu}_k). \end{align*}\]

By letting $\nabla_{\mathbf{\mu}_k} J (\mathbf{\theta}) = \mathbf{0}$,

$\sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} \mathbf{x}_i = \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} \mathbf{\mu}_k$


Then,

$\mu_k = \sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i}) } {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j}) } \mathbf{x_i}$

The denominator indicates the effective number of data points in cluster $k$.

Finding the optimal $\mathbf{\Sigma_k}$

Similarly,

\[\begin{align*} \frac {\partial}{\partial \mathbf{\Sigma}_k} J (\mathbf{\theta}) = &-\frac{1}{2} \sum_{i=1}^N \frac {\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \mathbf{\Sigma}_{k}^{-1} \\ &+ \frac{1}{2} \sum_{i=1}^N \frac {\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T \mathbf{\Sigma}_{k}^{-1} \\ = &-\frac{1}{2} \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} \\ &+ \frac{1}{2} \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T \mathbf{\Sigma}_{k}^{-1} = \mathbf{0}, \end{align*}\]
we get $\mathbf{\Sigma_k} = \sum_{i=1}^N \frac {p(z_{ik} = 1 \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 \mathbf{x_j})} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T$.

For derivation, we need some remarks:

$\mathbf{Remark\ 1.}$ Derivatives of determinant
Let $\text{det}(\mathbf{A})$ be the determinant of matrix $\mathbf{A} = (a_{ij})$ of size $n$, and $\mathbf{C} = (c_{ij})$ be the cofactor matrix of $\mathbf{A}$. We know that $\text{det}(\mathbf{A}) = \sum_{i=1}^n a_{ij} c_{ij}$ for any $j$ ( $= \sum_{j=1}^n a_{ij} c_{ij}$ for any $i$), and $\mathbf{A}^{-1} = \frac{1}{\text{det}(\mathbf{A})} \text{adj} (\mathbf{A})$, where $\text{adj} (\mathbf{A}) = \mathbf{C}^\top$. Then, the derivatives of the determinant are given by

$\frac {\partial} {\partial a_{ij}} \text{det} (\mathbf{A}) = c_{ij} = \text{det} (\mathbf{A}) (\mathbf{A}^{-1})_{ji}.$


$\mathbf{Remark\ 2.}$ Derivatives of inverse matrix
From $A A^\top = I \to \frac {d \mathbf{A}^{-1}} {d \theta} = - \mathbf{A}^{-1} \frac {d \mathbf{A}} {d \theta} \mathbf{A}^{-1}$, we see that

$\frac {\partial} {\partial a_{mn}} (\mathbf{A}^{-1})_{ij} = - (\mathbf{A}^{-1})_{im} (\mathbf{A}^{-1})_{nj}.$


$\mathbf{Remark\ 3.}$ Derivatives of $f(\mathbf{A}) = \mathbf{y}^\top \mathbf{A}^{-1} \mathbf{y}$
$\mathbf{y}^\top \mathbf{A}^{-1} \mathbf{y} = \sum_i \sum_j y_i y_j (\mathbf{A}^{-1})_{ij}$.

\[\begin{align*} \frac {\partial f(\mathbf{A})} {\partial a_{mn}} &= \sum_i \sum_j y_i y_j \frac {\partial} {\partial a_{mn}} (\mathbf{A}^{-1})_{ij} \\ &= - \sum_i \sum_j y_i y_j (\mathbf{A}^{-1})_{im} \mathbf{A}^{-1})_{nj} \\ &= - (\mathbf{y}^\top \mathbf{A}^{-1})_m (\mathbf{A}^{-1} \mathbf{y})_n, \end{align*}\]

i.e.,

$\frac {\partial f(\mathbf{A})} {\partial \mathbf{A}} = - (\mathbf{A}^{-1})^\top \mathbf{y} \mathbf{y}^\top (\mathbf{A}^{-1})^\top$


Finding the optimal $\pi_k$

Now, we are ready to solve the optimization problem of $\pi_k$. Considering $\sum_{k=1}^K \pi_k = 1$ and a Lagrange multiplier $\lambda$, we formulate

$L (\mathbf{\theta}) := \text{log } p(\mathbf{x} | \mathbf{\theta}) + \lambda (\sum_{k=1}^K \pi_k - 1) = J (\mathbf{\theta}) + \lambda (\sum_{k=1}^K \pi_k - 1).$


From $\frac {\partial} {\partial \pi} L(\mathbf{\theta}) = \mathbf{0}$, i.e.,

$\sum_{i=1}^N \frac{N(\mathbf{x}_i | \mathbf{\mu}_k, \Sigma_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mathbf{\mu}_l, \Sigma_l)} + \lambda = 0, 1 \leq k \leq K$


we get

$\frac{1}{\pi_k} \sum_{i=1}^N \frac{\pi_k N(\mathbf{x}_i; \mathbf{\mu}_k, \Sigma_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mathbf{\mu}_l, \Sigma_l)} + \lambda = 0, 1 \leq k \leq K$


Recall that

$p(z_{ik} = 1 | \mathbf{x}_i) = \frac{\pi_k N(\mathbf{x}_i | \mathbf{\mu}_k, \Sigma_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mathbf{\mu}_l, \Sigma_l)}$


So, it follows that $\pi_k = -\frac{1}{\lambda} \sum_{i=1}^N p(z_{ik} = 1 \mathbf{x}_i)$.

From $\sum_{k=1}^K \pi_k = 1$, $\lambda$ satisfies

\[\begin{align*} \lambda &= - \sum_{k=1}^K \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \\ &= - \sum_{i=1}^N \sum_{k=1}^K p (z_{ik} = 1 | \mathbf{x}_i) = - N. \end{align*}\]
$\therefore \pi_k = \frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 \mathbf{x}_i).$

EM Algorithm

parameter MLE
$\mu_k$ $\sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j})} \mathbf{x_i}$
$\mathbf{\Sigma_k}$ $\sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j})} (\mathbf{x_i} - \mu_k) (\mathbf{x_i} - \mu_k)^T$
$\pi_k$ $\frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 | \mathbf{x_i})$

Note that, all the above solutions depend on $p (z_{ik} = 1; \mathbf{x}_i)$. From this, we have the following EM Algorithm:

  • Initialization

$K$-means clustering is often used to initialize the parameter $\mathbf{\theta}$.

  • E step

Using the current parameters $\mathbf{\theta}$ compute

$p(z_{ik} = 1 | \mathbf{x}_i) = \frac{\pi_k N(\mathbf{x}_i | \mathbf{\mu}_k, \Sigma_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mathbf{\mu}_l, \Sigma_l)}$


  • M step

Update all parameters $\mathbf{\theta}$

$\mu_k = \sum_{i=1}^N \frac {p(z_{ik} = 1 \| \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 \| \mathbf{x_j})} \mathbf{x_i}$
$\mathbf{\Sigma_k} = \sum_{i=1}^N \frac {p(z_{ik} = 1 \| \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 \| \mathbf{x_j})} (\mathbf{x_i} - \mu_k) (\mathbf{x_i} - \mu_k)^T$
$\pi_k = \frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 | \mathbf{x_i})$


Repeat E step and M step until convergence.

Reference

[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.

Leave a comment