[Linear Algebra] Low Rank Approximations
Low-Rank Approximations
Motivation
In mathematics, low-rank approximation is a minimization problem, in which the objective measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable) with reduced rank. In other word, it aims to find the best way to approximate any high-rank matrix to a matrix with a low-rank. And low-rank approximation is the most famous example of application of SVD.
The Low-Rank Approximation is mainly used for data compression: When $m \times n$ matrix is given, we can reduce the number of entries $mn$ to $k(m + n)$ where $k$ is target rank of approximation. It is quite meaningful when $k$ is relatively small than $m$ and $n$.
Low-Rank Approximations from the SVD
First, by SVD, a matrix can be represented as a sum of rank-one matrices:
$\mathbf{Thm\ 2.1.}$ $A = \sum_{i=1}^r \sigma_i u_i v_i^*$
$\mathbf{Proof.}$
It allows us to represent a matrix as weighted sum of components that is ordered by “importance”, which is expressed in singular values. And it also plays a role in reducing the number of entries expressing the matrix.
Now, let’s prove main theorem:
$\mathbf{Thm\ 2.2.}$ Eckart–Young–Mirsky theorem
For any $\nu$ with $0 \leq \nu \leq r$, define $A_\nu = \sum_{j=1}^\nu \sigma_j u_j v_j^*$. If $\nu = \text{min}(m, n), \sigma_{\nu+1} = 0$. Then,
$\mathbf{Proof.}$
Suppose there exists $B \in \mathbf{C}^{m \times n}$ such that $\text{rank}(B) \leq \nu$ where
Then, $\text{nullity}(B) \geq n - \nu$. So, there exists at least $(n-\nu)$-D subspace $W =$ { $w | Bw = 0$ } $\subseteq \mathbb{C}^n$.
For all $w \in W$,
And, consider $(\nu + 1)$-D subspace $V = \text{Span}(v_1, v_2, \cdots, v_{\nu+1}) \subseteq \mathbb{C}^n$. Then
clearly. Since $n \leq \text{dim } W + \text{dim } V = \text{dim}(V + W) + \text{dim}(V \cap W)$ and $\text{dim}(V + W) \leq n$ (as $W + V \subseteq \mathbb{C}^n$ ), $V \cap W \neq \emptyset$.
In other words, $\mathbf{0} \neq w \in W \cap V$, which is contradiction.
$\mathbf{Thm\ 2.3.}$ Analogous result for Frobenious norm:
How to choose target rank?
We can think the target rank $k$ as a hyperparameter of algorithm in Deep Learning, and we should select appropriate and efficient $k$ for approximation. Then, how should $k$ be decided? In practice, the SVD often provides clear criterion: the obvious approach is to take $k$ to be equal to the number of large values. If the top few of singular values are large and the rest are not, it is intuitive to get rid of small values as it’ll have a small impact on the matrix. If not, we usually choose $k$ such that the sum of the top $k$ singular values is at least $c$ times as big as the sum of the other singular values, where $c$ is a domain-dependent constant like $10$.
PCA vs. SVD
Recall that for any given $m \times n$ matrix, there exists a unique SVD (up to complex sign), which we proved in the previous post. Therefore, unlike eigenvalue decomposition, which requires the condition of symmetricity, we can always perform low-rank approximation via SVD.
Reference
[1] Lloyd N. Trefethen & David Bau III, Numerical Linear Algebra, 1997
[2] Stanford CS168: The Modern Algorithmic Toolbox. Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations.
[3] Wikipedia, Low-rank approximation
file:///Users/leeyngdo/Downloads/l9.pdf
Leave a comment