[ML] Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM) assumes that data points $\mathbf{x}_1, \cdots, \mathbf{x}_N$ are generated by different Gaussian distributions as
where $\sum_{k=1}^K \pi_k = 1$. $\pi_k$ means the probability that the $k$-th Gaussian distribution is chosen.
Recall that, for $\mathbf{x_i} \in \mathbb{R}^d, N(\mathbf{x_i}; \mathbf{\mu_k}, \mathbf{\Sigma_k}) = \frac{1}{ \sqrt{(2\pi)^d \text{det}(\mathbf{\Sigma_k})}} e^{-\frac{1}{2}(\mathbf{x_i} - \mu_k)^T \mathbf{\Sigma_k}^{-1} (\mathbf{x_i} - \mu_k)}$
We use a latent random variable $\mathbf{z_i} = (z_{i1}, \cdots, z_{iK})$ with $z_{ik} \in {0, 1}, \sum_{k=1}^K \pi_k z_{ik} = 1$, and
$p(\mathbf{z}_i) = \prod_{k=1}^K \pi_{k}^{z_{ik}}$
That is, $\mathbf{z_i}$ is an one-hot vector that if $\mathbf{x_n}$ is given from $k$-th gaussian distribution, flag $z_{nk} = 1$.
Thus, note that
We now consider the joint probability of all data points $\mathbf{x} = { \mathbf{x}_1, \cdots, \mathbf{x}_N }$ and $\mathbf{z} = { \mathbf{z}_1, \cdots, \mathbf{z}_N }$
\[\begin{align*} p(\mathbf{x}, \mathbf{z})& = \prod_{i=1}^N p(\mathbf{x}_i | \mathbf{z}_i) p(\mathbf{z_i}) \\ &= \prod_{i=1}^N \prod_{k=1}^K \pi_{k}^{z_{ik}} N (\mathbf{x}_i | \mathbf{\mu}_k, \mathbf{\Sigma}_k)^{z_{ik}}. \end{align*}\]It is important to get $p(z_{ik} = 1; \mathbf{x}_i)$. Observe that
In the GMM, we have the following parameters to learn:
To this end, we consider the log likelihood of $\mathbf{x}$ and find MLE
Training the model
We are now ready to learn the model. The learning process of the GMM is very similar to that of $k$-means clustering. That is, if we know latent variables $\mathbf{z}_i$, then we learn the Gaussian parameters $\mu_k$ and $\mathbf{\Sigma}_k$. In contrast, if we know the Gaussian parameters $\mu_k$ and $\mathbf{\Sigma}_k$, then we learn the latent variables $\mathbf{z}_i$.
Let $J (\mathbf{\theta}) = \sum_{i=1}^N \text{log}(\sum_{k=1}^K \pi_k N(\mathbf{x}_i; \mu_k, \mathbf{\Sigma}_k)).$
Finding the optimal $\mathbf{\mu_k}$
We have
By letting $\nabla_{\mathbf{\mu}_k} J (\mathbf{\theta}) = \mathbf{0}$,
Then,
The denominator indicates the effective number of data points in cluster $k$.
Finding the optimal $\mathbf{\Sigma_k}$
Similarly,
\[\begin{align*} \frac {\partial}{\partial \mathbf{\Sigma}_k} J (\mathbf{\theta}) = &-\frac{1}{2} \sum_{i=1}^N \frac {\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \mathbf{\Sigma}_{k}^{-1} \\ &+ \frac{1}{2} \sum_{i=1}^N \frac {\pi_k N(\mathbf{x}_i | \mu_k, \mathbf{\Sigma}_k)} {\sum_{l=1}^K \pi_l N(\mathbf{x}_i | \mu_l, \mathbf{\Sigma}_l)} \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T \mathbf{\Sigma}_{k}^{-1} \\ = &-\frac{1}{2} \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} \\ &+ \frac{1}{2} \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \mathbf{\Sigma}_{k}^{-1} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T \mathbf{\Sigma}_{k}^{-1} = \mathbf{0}, \end{align*}\]we get $\mathbf{\Sigma_k} = \sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j})} (\mathbf{x}_i - \mu_k) (\mathbf{x}_i - \mu_k)^T$. |
For derivation, we need some remarks:
$\mathbf{Remark\ 1.}$ Derivatives of determinant
Let $\text{det}(\mathbf{A})$ be the determinant of matrix $\mathbf{A} = (a_{ij})$ of size $n$, and $\mathbf{C} = (c_{ij})$ be the cofactor matrix of $\mathbf{A}$. We know that $\text{det}(\mathbf{A}) = \sum_{i=1}^n a_{ij} c_{ij}$ for any $j$ ( $= \sum_{j=1}^n a_{ij} c_{ij}$ for any $i$), and $\mathbf{A}^{-1} = \frac{1}{\text{det}(\mathbf{A})} \text{adj} (\mathbf{A})$, where $\text{adj} (\mathbf{A}) = \mathbf{C}^\top$. Then, the derivatives of the determinant are given by
$\mathbf{Remark\ 2.}$ Derivatives of inverse matrix
From $A A^\top = I \to \frac {d \mathbf{A}^{-1}} {d \theta} = - \mathbf{A}^{-1} \frac {d \mathbf{A}} {d \theta} \mathbf{A}^{-1}$, we see that
$\mathbf{Remark\ 3.}$ Derivatives of $f(\mathbf{A}) = \mathbf{y}^\top \mathbf{A}^{-1} \mathbf{y}$
$\mathbf{y}^\top \mathbf{A}^{-1} \mathbf{y} = \sum_i \sum_j y_i y_j (\mathbf{A}^{-1})_{ij}$.
i.e.,
Finding the optimal $\pi_k$
Now, we are ready to solve the optimization problem of $\pi_k$. Considering $\sum_{k=1}^K \pi_k = 1$ and a Lagrange multiplier $\lambda$, we formulate
From $\frac {\partial} {\partial \pi} L(\mathbf{\theta}) = \mathbf{0}$, i.e.,
we get
Recall that
So, it follows that $\pi_k = -\frac{1}{\lambda} \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i)$. |
From $\sum_{k=1}^K \pi_k = 1$, $\lambda$ satisfies
\[\begin{align*} \lambda &= - \sum_{k=1}^K \sum_{i=1}^N p(z_{ik} = 1 | \mathbf{x}_i) \\ &= - \sum_{i=1}^N \sum_{k=1}^K p (z_{ik} = 1 | \mathbf{x}_i) = - N. \end{align*}\]$\therefore \pi_k = \frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 | \mathbf{x}_i).$ |
EM Algorithm
parameter | MLE |
---|---|
$\mu_k$ | $\sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j})} \mathbf{x_i}$ |
$\mathbf{\Sigma_k}$ | $\sum_{i=1}^N \frac {p(z_{ik} = 1 | \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 | \mathbf{x_j})} (\mathbf{x_i} - \mu_k) (\mathbf{x_i} - \mu_k)^T$ |
$\pi_k$ | $\frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 | \mathbf{x_i})$ |
Note that, all the above solutions depend on $p (z_{ik} = 1; \mathbf{x}_i)$. From this, we have the following EM Algorithm:
-
Initialization
$K$-means clustering is often used to initialize the parameter $\mathbf{\theta}$.
-
E step
Using the current parameters $\mathbf{\theta}$ compute
-
M step
Update all parameters $\mathbf{\theta}$
$\mathbf{\Sigma_k} = \sum_{i=1}^N \frac {p(z_{ik} = 1 \| \mathbf{x_i})} {\sum_{j=1}^N p(z_{jk} = 1 \| \mathbf{x_j})} (\mathbf{x_i} - \mu_k) (\mathbf{x_i} - \mu_k)^T$
$\pi_k = \frac{1}{N} \sum_{i=1}^N p(z_{ik}=1 | \mathbf{x_i})$
Repeat E step and M step until convergence.
Reference
[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
Leave a comment