4 minute read

$K$-Means Clustering

Clustering

Clustering is the process of grouping a set of observations into classes of similar observations. We should consider:

  • High intra-class similarity
  • Low inter-class similarity

Clustering is the most common form of unsupervised learning. Note that is subjective.

image


$K$-means clustering algorithm

$K$-means clustering aims to partition $n$ observations into $K$ clusters in which each observation belongs to the cluster with the nearest mean. $K$-means clustering is one of unsupervised learning algorithms: it creates a labeling of observations with cluster labels. The labels are derived exclusively from the observations. It is described as follows:

For a given set of observations

$\{ \mathbf{x}_i | \mathbf{x}_i = (x_{i1}, ..., x_{ip}), 1 \leq i \leq n \},$


if centroids of $K$ clusters are denoted by $\mathbf{\mu} = { \mathbf{\mu}_1, …, \mathbf{\mu}_k }$, and partitions are denoted by $\mathbf{C} = {C_1, …, C_k }$, then $K$-means clustering aims to partition the $n$ observations into $k (\leq n)$ sets $C_1, …, C_k$ so as to minimize the within-cluster sum of squares (WCSS) (i.e., variance).

Formally, the objective is to find:

$J(\mathbf{C}, \mathbf{\mu}) := \text{argmin}_{\mathbf{C}, \mu} \sum_{i=1}^k \sum_{\mathbf{x}_j \in C_i} || \mathbf{x}_i - \mathbf{\mu}_i ||^2$


$\mathbf{Algorithm.}$ $K$-means clustering algorithm
(1) Let $t = 0$ and start with initial guesses $\mu_{1}^{(0)}, \cdots, \mu_{k}^{(0)}$ for cluster centers (centroids).
(2) For each observation, find the closest cluster centroid.

$C_{i}^{(t)} = \{ \mathbf{x}_l : ||\mathbf{x}_l - \mu_{i}^{(t)} ||^2 \leq ||\mathbf{x}_l - \mu_{j}^{(t)} ||^2, \forall 1 \leq j \leq k \}$


(Find $\mathbf{C}$ to minimize $J(\mathbf{C}. \mathbf{\mu})$ while fixing $\mathbf{\mu}$)

(3) Replace each centroid by the average of observations in its partition.

$\mu_{j}^{(t)} = \frac {1} {|C_{i}^{(t)}|} \sum_{\mathbf{x}_j \in C_{i}^{(t)}} \mathbf{x}_j$


(Find $\mathbf{\mu}$ to minimize $J(\mathbf{C}. \mathbf{\mu})$ while fixing $\mathbf{C}$)

(4) Iterate steps (2) and (3) until convergence.

$K$-means clustering as Expectation Maximization

Recall that we defined the objective function

$J(\mathbf{C}, \mathbf{\mu}) := \text{argmin}_{\mathbf{C}, \mu} \sum_{i=1}^k \sum_{\mathbf{x}_j \in C_i} || \mathbf{x}_i - \mathbf{\mu}_i ||^2$


But it seems to be hard to find a solution. So, for each data point $\mathbf{x_n}$, introduce a corresponding set of binary indicator variables $r_{nk}$.
Let $r_{ij} = 1$ if $\mathbf{x_j} \in C_i$ and $r_{ij} = 0$ if $\mathbf{x_j} \notin C_i$. We then have

$J(\mathbf{C}, \mathbf{\mu}) = \sum_{i=1}^k \sum_{\mathbf{x}_j \in C_i} || \mathbf{x}_i - \mathbf{\mu}_i ||^2$
$= \sum_{i=1}^k \sum_{j=1}^n || \mathbf{x}_i - \mathbf{\mu}_i ||^2$
$= \sum_{j=1}^n \sum_{i=1}^k r_{ij} || \mathbf{x}_i - \mathbf{\mu}_i ||^2$


We want to find $\mathbf{C}$, equivalently $r_{ij}$ and $\mathbf{\mu}$, that minimizes this $J(\mathbf{C}, \mathbf{\mu})$.

Step 1: Find $r_{ij}$ to minimize $J(\mathbf{C}, \mathbf{\mu})$ while fixing $\mathbf{\mu}$. (Expectation)

\[\begin{align*} r_{ij} = & \begin{cases} 1 & \text{if } i = \underset{l}{\text{arg min }} || \mathbf{x_j} - \mathbf{\mu_l} ||^2 \\ 0 & \text{otherwise.} \end{cases} \end{align*}\]

Then, compute the expectation for all samples:

$J(\mathbf{C}, \mathbf{\mu}) = \sum_{j=1}^n \sum_{i=1}^k r_{ij} || \mathbf{x}_i - \mathbf{\mu}_i ||^2$


Step 2: Find $\mathbf{\mu}$ to minimizes the expectation while fixing $r_{ij}$. (Maximization / Minimization)

$2 \sum_{j=1}^n r_{ij} (\mathbf{x}_j - \mathbf{\mu}_i) = 0$
$\mathbf{\mu}_i = \frac {\sum_{j=1}^n r_{ij} \mathbf{x}_j} {\sum_{j=1}^n r_{ij}}$


Note that the denominator in this expression is equal to the number of points assigned to cluster $k$, and so this result has a simple interpretation, namely set $\mu_k$ equal to the mean of all of the data points $\mathbf{x_n}$ assigned to cluster $k$, which corresponds to the heuristic algorithm introduce above!

Drawbacks

A direct implementation of the $K$-means algorithm as discussed here can be relative slow, because in each E step it is necessary to compute the Euclidean distance between every data point. Euclidean distance is used as a metric and variance is used as a measure of cluster scatter, which limits the applicability of the algorithm. It would be inappropriate for cases where some or all the variables represent categorical labels for instance). It can also make the determination of the cluster means nonrobust to outliers.

Also, the value of $k$ should be given as an input parameter, and it might converge to a local minimum, not a global minimum, …

So, many alternative algorithms & improvement for this are introduced, for example, $K$-medoids algorithm, etc.

$\mathbf{Example.}$

(Reference: Christopher M. Bishop, Pattern Recogition and Machine Learning)

The $K$-means algorithm is illustrated using the Old Faithful data set in the figure:

image
image
image


For the purposes of this example, we have made a linear re-scaling of the data (known as standardizing), such that each of the variables has zero mean and unit standard deviation. For this example, we have chosen $K = 2$, and so in this case, the assignment of each data point to the nearest cluster center is equivalent to a classification of the data points according to which side they lie of the perpendicular bisector of the two cluster centers.

A plot of the cost function of $J$ given by data for the Old Faithful example is shown in the last figure. Each blue point indicates E step, and red point indicates M step.

Note that we have deliberately chosen ppor initial values for the cluster centers so that the algorithm takes several steps before convergence. In practice, a better initialization procedure would be to choose the cluster centers $\mu_k$ to be equal to a random subset $K$ data points.


$\mathbf{Example.}$ Image segmentation and compression

image



Reference

[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.

Leave a comment