10 minute read

Principal Component Analysis

Introduction

Principal Component Analysis (PCA) is a technique that is widely used for applications such as dimensionality reduction, lossy data compression, feature extraction and data visualization.

There are two commonly used definitions of PCA that give rise to the same algorithm: Maximum Variance Formulation, which is the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, and Minimum-error Formulation, the linear projection that minimizes the average projection cost.

Minimum-error formulation seems natural and intuitive. But, why do we find the vector space that maximizes the variance of projected data? For intuition, consider the projection of 3D pictures onto the 2D. By projection, a dataset loses their information of one dimension, and also the number of data because some data points might overlap.

image

What we want is, finding a projected subspace of the original space which preserves the most information. To do this, by maximizing variance, we capture the original space by taking the most variability of dataset. [2]

There is one more remark before start: it seems two definitions are totally opposite, but at the end, we will obtain the identical result. In fact, we can see that two definitions are equivalent! [3] So, two formulations will produce the same mathematical result.

Maximum Variance Formulation

Consider a data set of observations ${ \mathbf{x_n} }$ where $n = 1, 2, \cdots, N$ and $\mathbf{x}_i$ are $D$-dimensional vectors.

The main idea of PCA is to project the data onto a space of lower dimension $M (M < D)$, which is called the principal subspace. The principal subspace is determined by maximizing the variance of the projected data.

image


Let $\mathbf{u}_i$, $1 \leq i \leq M$ be the basis vectors of the principal subspace.
Each $\mathbf{u_i}$ is a $D \times 1$ vector and satisfies $\mathbf{u_i}^{\top}\mathbf{u_i} = 1.$
Let $\mathbf{\bar{x}}, \mathbf{S}$ be the sample mean and covariance matrix of ${ \mathbf{x}_n }$, i.e.,

\[\begin{align*} &\mathbf{\bar{x}} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \\ &\mathbf{S} = \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \mathbf{\bar{x}}) (\mathbf{x}_n - \mathbf{\bar{x}})^{\top} \end{align*}\]

Note that $\mathbf{S}$ is symmetric. Let $\mathbf{z}_n = \mathbf{x}_n - \mathbf{\bar{x}}$.
For a basis vector $\mathbf{u}$, consider the length of the projected data. The variance is given by

\[\begin{align*} \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_{n}^{\top} \mathbf{u} - \mathbf{\bar{x}}^{\top} \mathbf{u})^2 &= \frac{1}{N} \sum_{n=1}^N (\mathbf{z}_n^{\top} \mathbf{u})^2 \\ &= \frac{1}{N} \sum_{n=1}^N \mathbf{u}^{\top} \mathbf{z}_n \mathbf{z}_{n}^{\top} \mathbf{u} = \mathbf{u}^{\top} \mathbf{S} \mathbf{u} \end{align*}\]

image

Then, we formulate an optimization problem as follows:

\[\begin{align*} &\underset{\mathbf{u}} {\text{max }} \mathbf{u}^\top \mathbf{Su} \\ &\text{s.t. } \mathbf{u}^\top \mathbf{u} = 1. \end{align*}\]

Using the Lagrange multiplier $\lambda$, we make an unconstrained maximization of

$J(\mathbf{u}, \lambda) = \mathbf{u}^\top \mathbf{Su} + \lambda (1 - \mathbf{u}^\top \mathbf{u}).$


From $\nabla_{\mathbf{u}} J = 2 (\mathbf{Su} - \lambda \mathbf{u}) = \mathbf{0}$, we obtain $\mathbf{Su} = \lambda \mathbf{u}$.

Moreover, by multiplying $\mathbf{u}^\top$, we also obtain $\mathbf{u}^\top \mathbf{Su} = \lambda \mathbf{u}^\top \mathbf{u} = \lambda$. From the equation, we see that the first principal component (corresponding to the largest variance) is the eigenvector having the largest eigenvalue.

We can define additional principal components in an incremental fashion by choosing each new direction to be that which maximizes the variance of the projected data amongst all possible directions orthogonal to those already considered. The principal subspace of dimension $M$ can be obtained with $M$ eigenvectors of the data covariance matrix $\mathbf{S}$ corresponding to the largest $M$ eigenvalues of $\mathbf{S}$.

Minimum-error Formulation

We now consider an alternative formulation of PCA based on projection error minimization. Let $\mathbf{u_i}$, $i = 1, 2, \cdots, D$ form an orthonormal basis of the $D$-dimensional space. Note that $\mathbf{u_i}^\top \mathbf{u_j} = \delta_{ij}$. Using the orthonormal basis, $\mathbf{x}_n$ can be expressed by

Now our objective is to approximate $\mathbf{x_n}$ using a representation in a lower dimension $M (M < D)$ as follows:

$\widetilde{\mathbf{x}}_n = \sum_{i=1}^M z_{ni} \mathbf{u}_i + \sum_{i=M+1}^D b_i \mathbf{u_i}$.


Note that $b_i$, $i = M + 1, \cdots, D$ are the same for all data points $\mathbf{x}_n$. The objective function is given by

$J = \frac{1}{N} \sum_{i=1}^N || \mathbf{x}_n - \widetilde{\mathbf{x}}_n ||^2$

Observe that

$\mathbf{x}_n - \widetilde{\mathbf{x}}_n = \sum_{i=1}^M (\alpha_{ni} - z_{ni}) \mathbf{u}_i + \sum_{i=M+1}^D (\alpha_{ni} - b_{i}) \mathbf{u}_i$


$|| \mathbf{x}_n - \widetilde{\mathbf{x}}_n ||^2 = \sum_{i=1}^M (\alpha_{ni} - z_{ni})^2 + \sum_{i=M+1}^D (\alpha_{ni} - b_{i})^2$


From $J = \frac{1}{N} \sum_{i=1}^N (\sum_{i=1}^M (\alpha_{ni} - z_{ni})^2 + \sum_{i=M+1}^D (\alpha_{ni} - b_{i})^2)$ by differentiating $J$ with respect to $z_{ni}$ and $b_i$ and letting them 0, we obtain

\[\begin{align*} &z_{ni} = \alpha_{ni} = \mathbf{x}_n^\top \mathbf{u}_i \\ &b_i = \frac{1}{N} \sum_{n=1}^N \alpha_{ni} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n^\top \mathbf{u}_i = \bar{\mathbf{x}}^\top \mathbf{u}_i \end{align*}\]

Thus, $J$ can be expressed as

$J = \frac{1}{N} \sum_{n=1}^N (\sum_{i = M+1}^D (\mathbf{x}_n^\top \mathbf{u}_i-\bar{\mathbf{x}}^\top \mathbf{u}_i)^2)$
\[\begin{align*} J &= \frac{1}{N} \sum_{n=1}^N (\sum_{i = M+1}^D (\mathbf{x}_n^\top \mathbf{u}_i-\bar{\mathbf{x}}^\top \mathbf{u}_i)^2 ) \\ &= \frac{1}{N} \sum_{n=1}^N (\sum_{i = M+1}^D ((\mathbf{x}_n - \bar{\mathbf{x}})^\top \mathbf{u}_i)^2 ) = \frac{1}{N} \sum_{n=1}^N (\sum_{i = M+1}^D ((\mathbf{z}^\top \mathbf{u}_i)^2 )\\ &= \frac{1}{N} \sum_{n=1}^N (\sum_{i = M+1}^D ( \mathbf{u}_i^\top \mathbf{z} \mathbf{z}^\top \mathbf{u}_i ) = \sum_{i = M+1}^D \mathbf{u}_i^\top (\frac{1}{N} \sum_{n=1}^N \mathbf{z} \mathbf{z}^\top) \mathbf{u}_i \\ &= \sum_{M+1}^D \mathbf{u}_i^\top \mathbf{S} \mathbf{u}_i = \sum_{i=M+1}^D \lambda_i \end{align*}\]

To minimize $J$, we consider the $D - M$ smallest eigenvalues. In other words, for approximation we take the $M$ eigenvectors corresponding to the largest eigenvalues of $\mathbf{S}$.

$\mathbf{Example.}$ The effectiveness of PCA

(Reference: hanson-ml)

PCA is widely used for a data whose feature dimension is huge. To check this, we deal with two examples of the classfication problem which cannot be visualized.

The first one is the well-known iris data. It has 4 features with 150 data. For model test, we split our dataset into a training set (112) and a test set (38). Here, we compare the test accuracies of the logistic regression models which use
1) the whole data
2) reduced data via PCA extraction (2 features)
3) reduced data via heuristic selection (2 features) where we select to use petal width and length.

Before the comparison, we visualize our two reduced data.

image

The upper figure is the visualization of reduce data via heuristic selection, and lower one is visualization of reduced data via PCA extraction. Without using a feature extraction method like PCA, the original data already has two important features for classification.

image

From the previous figure, we can say that as the original data already has key features for calssification, the effect of PCA is small. Moreover, as the training data is small, we cannot check the amount of time that we can save by using the PCA.

We now deal with a data which is much bigger than the iris data. It is a hand-written digits data, which is a simplified version of the MNIST data. It has $8 \times 8 = 64$ features with 1797 data. Similar with the iris data, we split the dataset into a training (about 75%) and a test set (25%). In this data, we extract $6 \times 6 = 36$ features from the original 64 features. The following figure shows the result.

image

We again build two logistic regression models which use the original data and reduced data, respectively, for classification.

image

Here, we can check that by using PCA, we can significantly decrease the computation time for training the model. Morevoer, as the data has complicate features, a proper feature extraction can lead to better training.

$\mathbf{Example.}$ Face Decomposition and Approximation

We apply PCA to decompose a face dataset. Each face picture is a $64 \times 64$ matrix.

image

We approximate the face dataset with 2 principal components.

image

The average face and two principal components are given as follows.

image

Kernel PCA

We will generalize the algorithm by replacing the scalar products with a nonlinear kernel. Then, we can now perform standard PCA in the feature space, which implicity defines a nonlinear principal component model in the original data sapce, as illustrated in the following figure from Kevin Murphy:

image

First consider nonlinearly mapping all data points $\mathbf{x}$ to $\phi (\mathbf{x})$ in a higher dimensional feature space $F$, where the covariance matrix can be estimated as

$\Sigma_{\phi} = \frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) \phi(\mathbf{x}_n)^\top$


For simplicity, let’s start from the case that the projected data has zero means, i.e., $\frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) = \mathbf{0}$

1) Assuming the projected data has zero mean.

By plugging this $\frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) = \mathbf{0}$ into the eigenequation of the covariance matrix, we get

$[\frac{1}{N} \sum_{n=1}^N \phi (\mathbf{x}_n) \phi (\mathbf{x}_n)^\top] \psi_i = \frac{1}{N} \sum_{n=1}^N \phi (\mathbf{x}_n)^\top \psi_i \phi (\mathbf{x}_n) = \lambda_i \psi_i$


where $\lambda_i$ and $\psi_i$ are an eigenvalue and an eigenvector of $\Sigma_\phi$. We see that the eigenvector $\psi_i$ is a linear combination of the $N$ mapped data points:

$\psi_i = \frac{1}{\lambda_i N} \sum_{n=1}^N \phi (\mathbf{x}_n)^\top \psi_i \phi (\mathbf{x}_n) = \lambda_i \psi_i = \sum_{n=1}^N a_{n}^{(i)} \phi (\mathbf{x}_n)$

Left multiplying $\phi (\mathbf{x}_m)^\top$ to the both sides of the equation above, we get

$\phi (\mathbf{x}_m)^\top \psi_i (= \lambda_i N a_{m}^{(i)}) = \sum_{n=1}^N a_{n}^{(i)} \phi (\mathbf{x}_m)^\top \phi (\mathbf{x}_n) = \sum_{n=1}^N a_{n}^{(i)} k(\mathbf{x}_m, \mathbf{x}_n)$


where

$k(\mathbf{x}_m, \mathbf{x}_n) = \phi (\mathbf{x}_m)^\top \phi (\mathbf{x}_n), m, n = 1, \cdots, N$


is the kernel representing an inner-product of two vectors in space F. If we consider $m = 1, \cdots, N,$ the above scalar equation becomes the $m$-th component of the following vector equation

$\lambda_i N \mathbf{a}_i = \mathbf{Ka}_i$


where $\mathbf{K} = (k(\mathbf{x_i}, \mathbf{x_j}))_{N \times N}$, and $\mathbf{a_i} = [a_1^{(i)} \cdots a_N^{(i)}]^\top (i = 1, \cdots, N)$ are the $N$ eigenvectors of $\mathbf{K}$, which can be obtained by solving the eigenvalue problem.

Any new data point $\mathbf{x}$ can now be mapped to $\phi(\mathbf{x})$ in the high dimensional space $F$, where its $i$-th PCA component can be obtained as its projection on the i-th eigenvector $\psi_i = \lambda_i \psi_i = \sum_{n=1}^N a_{n}^{(i)} \phi (\mathbf{x}_n)$:

$\phi(\mathbf{x})^\top \psi_i = \psi_i = \sum_{n=1}^N a_{n}^{(i)} k(\mathbf{x}, \mathbf{x}_n)$

Here, we don’t need to know $\psi(\mathbf{x})$.

$\mathbf{Algorithm}$
1) Choose a kernel mapping $k(\mathbf{x_m}, \mathbf{x_n})$.
2) Obtain $\mathbf{K}$ based on the training data $\mathbf{x}_n, n = 1, \cdots, N$.
3) Solve eigenvalue problem of $\mathbf{K}$ to get $\lambda$ and $\mathbf{a_i}$
4) For each given data point $\mathbf{x}$, obtain its principal componennts in the feature space:

$\psi(\mathbf{x})^\top \psi_i = \sum_{n=1}^N a_n^{(i)} k (\mathbf{x}, \mathbf{x}_n).$


5) Do whatever processing (e.g., feature selection, classification) in the feature space.

$\mathbf{Remark.}$ For $\mathbf{x}$, the approximation of $\phi(\mathbf{x})$ with $k$ PCA components is given by

$\phi(\mathbf{x}) \approx \sum_{i=1}^k \phi(\mathbf{x})^\top \psi_i \psi_i = \sum_{i=1}^k (\sum_{n=1}^N a_n^{(i)} k(\mathbf{x}, \mathbf{x}_n))\psi_i$.

2) General Case

The above discussion is based on the assumption that the data points have zero mean, and therefore the covariance of two data points is the same as their correlation. Now, we consider following

$\widetilde{\phi}(\mathbf{x}_m) = \phi(\mathbf{x}_m) - \frac{1}{N} \sum_{k=1}^N \phi(\mathbf{x}_k)$


Note that this cannot be actually carried out as the mapping $\phi(\mathbf{x})$ is not explicit and $\phi(\mathbf{x}_k)$ is never available.

However, we can still obtain the kernel matrix $\widetilde{\mathbf{K}}$ for the zero-mean data points $\widetilde{\phi}(\mathbf{x})$:

\[\begin{align*} \widetilde{k}_{mn} &= \widetilde{\phi}(\mathbf{x}_m)^\top \widetilde{\phi}(\mathbf{x}_n) \\ &= [\phi(\mathbf{x}_m) - \frac{1}{N} \sum_{k=1}^N \phi(\mathbf{x}_k)]^\top [\phi(\mathbf{x}_n) - \frac{1}{N} \sum_{k=1}^N \phi(\mathbf{x}_k)] \\ &= \phi(\mathbf{x}_m)^\top \phi(\mathbf{x}_n) - \frac{1}{N} \sum_{k=1}^N \phi(\mathbf{x}_k)^\top \phi(\mathbf{x}_m) - \frac{1}{N} \sum_{k=1}^N \phi(\mathbf{x}_k) \phi(\mathbf{x}_n) + \frac{1}{N^2} \sum_{k=1}^N \sum_{l=1}^N \phi(\mathbf{x}_k)^\top \phi(\mathbf{x}_l) \\ &= k_{mn} - \frac{1}{N} \sum_{k=1}^N k_{km} - \frac{1}{N} \sum_{k=1}^N k_{kn} + \frac{1}{N^2} \sum_{k=1}^N \sum_{l=1}^N k_{kl} \end{align*}\]

Let $\widetilde{\mathbf{K}} = (\widetilde{k}_{mn})$ and $\mathbf{1}_N$ be the $N \times N$ matrix with all elements equal to $\frac{1}{N}$. Then,

$\widetilde{\mathbf{K}} = \mathbf{K} - \mathbf{1}_N \mathbf{K} - \mathbf{K} \mathbf{1}_N + \mathbf{1}_N \mathbf{K} \mathbf{1}_N$


and $\widetilde{\mathbf{K}}$ is used for kernel PCA.

There are three key observations:

  • The mapping $\phi(\mathbf{x})$ in the kernel PCA algorithm is never explicitly specified, neither is the dimensionality of the feature space $F$.
  • Similarly, the covariance matrix $\Sigma_{\phi}$ and its eigenvectors $\psi_i$ are only mentioned in the above derivation, but they do not need to be computed in implementation.
  • The potential high dimensionality of $F$ does not cost extra computation, as only the kernel $k(\mathbf{x}, \mathbf{x}_n)$ is needed in the implementation.

Kernel PCA will have all the advantages of the regular PCA, as well as the implicit nonlinear mapping to a feature space $F$ where the features representing the structure in the data may be better extracted. For example, data classification may be much more easily carried out in $F$ due to linear separability.

Reference

[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
[2] Why do we maximize variance during Principal Component Analysis?
[3] PCA objective function: what is the connection between maximizing variance and minimizing error?
[4] Hanson-ml

Leave a comment