8 minute read

Singular Value Decomposition (SVD)

Motivation 1: geometric interpretation

SVD is popular matrix decomposition which is widely used in various linear algebra relate algorithms. There are various motivation in SVD, one of which is motivated by the following geometric fact: The image of the unit sphere under any square matrix is a hyperellipse.

image

$\mathbf{Fig\ 1.}$ Illustration of SVD of $M \in \mathbb{R}^{2 \times 2}$.


Let $S$ be the unit sphere in $\mathbb{R}^n$, and take a linear map induced by any matrix $A \in \mathbb{R}^{m \times n}$ with $m \geq n$. Although any matrix can be factorized into SVD regardless of rank and $\mathbb{R}$ or $\mathbb{C}$, we suppose $A$ has full rank and consider $\mathbb{R}$ instead of $\mathbb{C}$ for simplicity and intuition.

image

$\mathbf{Fig\ 2.}$ Our situation

Then, imaged hyperellipse $AS$ will have $n$ principal semiaxes (since we assumed full rank) \(\{ \sigma_1 \mathbf{u}_1, \cdots, \sigma_n \mathbf{u}_n \}\) where \(\mathbf{u}_j\) indicate the directions of semiaxes and $\sigma_j$ are the lengths of them. And we find the preimage of semiaxes, \(\{ \mathbf{v}_1, \cdots, \mathbf{v}_n \} \in S\).

Motivation 2: Best-fit Subspaces

Another motivation is, SVD is a solution of best-fit subspace of $n$ vectors. [2]

Suppose $n$ number of datas, represented in $d$-dimensional vectors. Construct $A \in \mathbb{R}^{n \times d}$ with them whose $i$-th row is \(\mathbf{a}_i\).

In terms of data reduction, we can consider a problem finding $k$ best-fit subspace $W$ that represents $A$ well. Best-fitting is in the sense of minimizing the sum of squares of lengths of the projections, equivalently maximizing remains. That means, we want to find a orthonormal basis \(\{ \mathbf{v}_1, \cdots, \mathbf{v}_k \}\) (since Gram-Schmidt ensures the construction of a orthonormal basis from any bases, we may assume the basis as orthonormal).

Fortunately, if we choose the best-fitting lines one by one with orthogonal constraints, then we could obtain the best-fitting subspace. In other words, a set \(\{ \mathbf{v}_1, \cdots, \mathbf{v}_k \}\) where

\[\begin{array}{lll} & \mathbf{v}_1:=\arg \max _{|\mathbf{v}|=1}|A \mathbf{v}|, & \sigma_1(A):=\left|A \mathbf{v}_1\right|, \\ \text { and } & \mathbf{v}_{k+1}:=\arg \max _{|\mathbf{v}|=1, \mathbf{v} \perp \text { span }\left\{\mathbf{v}_1, \ldots, \mathbf{v}_k\right\}}|A \mathbf{v}|, & \sigma_{k+1}(A):=\left|A \mathbf{v}_{k+1}\right|, \end{array}\]

is our desired basis.

$\mathbf{Proof.}$

For $1 \leq r \leq k$, let $V_k$ be the subspace spanned by \(\mathbf{v}_1, . . . , \mathbf{v}_r\). For each r, we will prove that $V_k$ is the best-fit $r$-dimensional subspace for $A$ by induction.

It is trivial for $r = 1$.

For $V_{k-1}$, the best-fitting $k-1$-dimensional subspace, suppose $W$ is a best-fitting $k$-dimensional subspace. Construct an orthonormal basis \(\{ \mathbf{w}_1, \cdots \mathbf{w}_k \}\) of $W$ where \(\mathbf{w}_k\) is orthogonal to \(\mathbf{v}_1, \cdots, \mathbf{v}_{k-1}\). Then,

\[\begin{aligned} \left|A \mathbf{w}_1\right|^2+\left|A \mathbf{w}_2\right|^2+\cdots+\left|A \mathbf{w}_{k-1}\right|^2 \leq\left|A \mathbf{v}_1\right|^2+\left|A \mathbf{v}_2\right|^2+\cdots+\left|A \mathbf{v}_{k-1}\right|^2 \end{aligned}\]

and by definition of \(\mathbf{v}_k\),

\[\begin{aligned} \left|A \mathbf{w}_1\right|^2+\left|A \mathbf{w}_2\right|^2+\cdots+\left|A \mathbf{w}_{k-1}\right|^2 + \left|A \mathbf{w}_{k}\right|^2 \leq\left|A \mathbf{v}_1\right|^2+\left|A \mathbf{v}_2\right|^2+\cdots+\left|A \mathbf{v}_{k-1}\right|^2+\left|A \mathbf{v}_{k}\right|^2 \end{aligned}\]

The possibility of extension of dimension ensured by the following statement:

If \(\mathcal{B} = \{ \mathbf{v}_1, \cdots, \mathbf{v}_k \}\) is orthonormal and $W$ is arbitrary $k+1$-dimensional subspace, there exists $\mathbf{0} \neq \mathbf{w} \in W$ where $\mathbf{w}$ is orthogonal to $\mathcal{B}$.

Proof is simple. Let \(\{ \mathbf{w}_1, \cdots, \mathbf{w}_{k+1} \}\) be a basis of $W$. Then, we can express any vector $\mathbf{w} \in W$ as \(\mathbf{w} = x_1 \mathbf{w}_1 + \cdots + x_{k+1} \mathbf{w}_{k+1}\). We want to find $\mathbf{w}$ such that

\[\begin{aligned} \left\langle\mathbf{v}_i, \mathbf{w}\right\rangle=x_1\left\langle\mathbf{v}_i, \mathbf{w}_1\right\rangle+\cdots+x_{k+1}\left\langle\mathbf{v}_i, \mathbf{w}_{k+1}\right\rangle=0, \quad i=1, \ldots, k \end{aligned}\]

Since the system of equations have more variables than equations, there exists nonzero solution of the system. Done.


And the $i$-th unit vector of the best-fitting subspace problem is called right singular vector, and it defines $i$-th singular value \(\sigma_i = \| A \mathbf{v}_i \|\). Finally, \(\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i\) is called left singular value.

Singular Value Decomposition

Let $m$ and $n$ be arbitrary. Given $A \in \mathbb{C}^{m \times n}$, SVD of $A$ is a factorization

\[\begin{aligned} A = U \Sigma V^\star \end{aligned}\]

where a matrix of left singular vectors $U \in \mathbb{C}^{m \times m}$ unitary, $\Sigma \in \mathbb{R}^{m \times m}$ diagonal, a matrix of right singular vectors $\mathbb{V}^{n \times n}$ unitary. In addition, it is assumed that the diagonal entries $\sigma_i$ of $\Sigma$: $\sigma_1 \geq \cdots \sigma_r \geq 0$ where $r = \text{rank}(A)$.

Note that, any matrices of any size can have SVD by following theorem.

$\mathbf{Thm\ 1.1.}$ Existence & Uniqueness of SVD
Every $A \in \mathbb{C}^{m \times n}$ has a SVD. Furthermore, ${ \sigma_i }$ are uniquely determined. And, if $A$ is a square and $\sigma_j$ are distinct, the left and right singular vectors ${ u_i }$ and ${ v_j }$ are uniquely determined up to complex signs.

$\mathbf{Proof.}$
  • Existence

Let \(\sigma_1 = \| A \|_2 = \underset{\| x \| = 1}{\text{sup}} \| Ax \|_2\). By compactness of the unit sphere, \(\exists v_1 \in \mathbb{C}^n$ with $\| v_1 \|_2 = 1\) such that \(A \mathbf{v}_1 \mathbf{u}_1\) where \(\| \mathbf{u}_1 \|_2 = \sigma_1\).

Extend \(\mathbf{u}_1, \mathbf{v}_1\) to orthonormal bases \(\{ \mathbf{u}_1, \cdots, \mathbf{u}_m \}\) and \(\{ \mathbf{v}_1, \cdots, \mathbf{v}_n \}\) for $\mathbb{C}^m$ and $\mathbb{C}^n$ respectively. Let $$U_1 = [ \mathbf{u}_1 \; \; \cdots \; \; \mathbf{u}_m ], V_1 = [ \mathbf{v}_1 \; \; \cdots \; \; \mathbf{v}_n ]$$. Then $U_1, V_1$ are clearly unitary, and
\[\begin{aligned} U_1^\star A V_1 =\left[\begin{array}{ll} \sigma_1 & \mathbf{w}^\star \\ \mathbf{0} & B \end{array}\right] \end{aligned}\]

Set $S = U_1^\star A V_1$. We want to show $\mathbf{w} = \mathbf{0}$. Since

\[\begin{aligned} \left\|\left[\begin{array}{cc} \sigma_1 & \mathbf{w}^\star \\ \mathbf{0} & B \end{array}\right]\left[\begin{array}{l} \sigma_1 \\ \mathbf{w} \end{array}\right]\right\|_2=\left\|\left[\begin{array}{c} \sigma_1^2+\mathbf{w}^\star \mathbf{w} \\ B \mathbf{w} \end{array}\right]\right\|_2 \geq \sigma_1^2+\mathbf{w}^\star \mathbf{w}=\sqrt{\sigma_1^2+\mathbf{w}^\star \mathbf{w}}\left\|\left[\begin{array}{c} \sigma_1 \\ \mathbf{w} \end{array}\right]\right\|_2 \end{aligned}\]

\(\| S \|_2 \geq \sqrt{\sigma_1^2 + \mathbf{w}^\star \mathbf{w}}\). As $U_1, V_1$ unitary, \(\sigma = \| S \|_2 = \| A \|_2\). Thus, $\mathbf{w} = \mathbf{0}$.

If $m=1$ or $n=1$, done. If not, $B$ describe the action of $A$ on the subspace orthogonal to $$\mathbf{v}_1$. By induction hypothesis, $B = U_2 \Sigma_2 V_2^\star$. Done by

\[A=U_1\left[\begin{array}{cc} 1 & 0 \\ 0 & U_1 \end{array}\right]\left[\begin{array}{ll} \sigma_1 & 0 \\ 0 & \Sigma_2 \end{array}\right]\left[\begin{array}{ll} 1 & 0 \\ 0 & V_2 \end{array}\right]^\star V_1^\star\]



  • Uniqueness

The geometric justification is straightforward: if the semiaxis lengths of a hyperellipse are distinct, then the semiaxes themselves are determined by the geometry, up to signs.

First, \(\sigma_1 = \| A \|_2\) and it is uniquely determined. Suppose there exists an $\mathbf{w}$ which is linearly independent of \(\mathbf{v}_1\) with \(\| \mathbf{w} \|_2 = 1\) such that $$| A\mathbf{w} |_2 = \sigma_1$.

Define the unit vector \(\mathbf{v}_2 = \frac{\mathbf{w} - (\mathbf{v}_1^\star \mathbf{w}) \mathbf{v}_1}{\| \mathbf{w} - (\mathbf{v}_1^\star \mathbf{w}) \mathbf{v}_1 \|_2} = \frac{\mathbf{w} - s \mathbf{v}_1}{c}\).




Matrix Properties via SVD

This section addresses the important properties of SVD. Although SVD is slightly different from diagonalization such as eigenvalue decomposition, (we handle the relationship between SVD and eigenvalue decomposition in the next section) it provides sufficient information of matrix for us.

$\mathbf{Thm\ 1.2.}$ $\text{rank}(A)$ equals to the number of nonzero singular values.

$\mathbf{Proof.}$

Since $\text{col}(AB) \subset \text{col}(A)$, $\text{rank}(AB) \leq \text{rank}(A)$. If $B$ is invertible, $\text{rank}(ABB^{-1}) \leq \text{rank}(AB)$.
Thus, $\text{rank}(A) = \text{rank}(AB)$.

Similary, $\text{rank}(A) = \text{rank}(BA)$ when $B$ is invertible. Done.


$\mathbf{Thm\ 1.3.}$ $\text{range}(A) = \left< u_1, \cdots, u_r \right>$ and $\text{null(A)} = \left< v_{r+1}, \cdots v_n \right>$. ( $r = \text{rank}(A)$ )

$\mathbf{Proof.}$

\(\text{null}(A) = \{ x \in \mathbb{C}^n | U \Sigma V^* x = 0 \} = \{ x \in \mathbb{C}^n | \Sigma V^* x = 0 \}\) as $U$ is invertible.
Let $x = \sum_{j=1}^n c_j v_j$. $x \in \text{null}(A)$ if and only if $c_1 = \cdots = c_r = 0$.

Note that $\text{range}(A) = \text{range}(AB)$ when $B$ is invertible.
$\text{range}(A) = \text{col}(A) = \text{col}(U \Sigma V^*) = \text{col}(U \Sigma) = \text{col}([ \sigma_1 u_1 \ \cdots \ \sigma_r u_r \ 0 \cdots 0]) = \left< u_1, \cdots, u_r \right>._\blacksquare$


$\mathbf{Thm\ 1.4.}$ $|| A ||_2 = \sigma_1, || A ||_F = \sqrt{\sigma_1^2 + \cdots + \sigma_r^2}$

$\mathbf{Proof.}$

Note that unitary matrix preserves norms.

$\| A \|_F = \| \Sigma \|_F = \sqrt{\sigma_1^2 + \cdots + \sigma_r^2}$.
$\| A \|_2 = \| \Sigma \|_2 = \underset{\| x \| = 1}{\sup} \| \Sigma x \|_2 = \underset{\| x \| = 1}{\sup} \sqrt{\sigma_1^2 x_1^2 + \cdots + \sigma_r^2 x_r^2} = \sigma_1._\blacksquare$



$\mathbf{Thm\ 1.5.}$ The nonzero singular values of $A$ are the square roots of the nonzero eigenvalues of $A^* A$ or $A A^*$.

$\mathbf{Proof.}$

$A = U \Sigma V^\star, A^\star = V \Sigma^\star U^\star$. Then, $AA^\star = U \Sigma \Sigma^\star U^\star = U \Sigma^2 U^\star._\blacksquare$


$\mathbf{Thm\ 1.6.}$ If $A = A^*$, the singular values of $A$ are the absolute values of eigenvalues of $A$.

$\mathbf{Proof.}$

$A = Q \Lambda Q^* = Q | \Lambda | \text{sign}(\Lambda) Q^*._\blacksquare$


$\mathbf{Thm\ 1.7.}$ For $A \in \mathbb{C}^{m \times m}$, $| \text{det}(A) | = \prod_{i=1}^m \sigma_i$.

$\mathbf{Proof.}$

$| \text{det}(A) | = | \text{det} (U \Sigma V^\star) | = | \text{det}(U) \text{det}(\Sigma) \text{det}(V^\star) | = | \text{det}(\Sigma) |._\blacksquare$




Change of Bases

Also, SVD is saying every matrix can be considered as diagonal matrix if one uses the proper basis for the domain and range space. For $\mathbf{b} = A \mathbf{x}$ and $A = U \Sigma V^\star$ (full SVD), $\mathbf{b}$ and $\mathbf{x}$ can be extended

\[\begin{aligned} \mathbf{b} & = U^\star \mathbf{b} \\ \mathbf{x}^\prime & = V^\star \mathbf{x} \end{aligned}\]

then it becomes \(\mathbf{b}^\prime = \Sigma \mathbf{x}^\prime\).



SVD vs. Eigenvalue decomposition

The following table summarizes the main difference between SVD and EVD:

SVD EVD
use 2 different bases $U$, $V$ use one basis
orthonormal bases generally, not orthogonal for non-symmetric matrix
exists for all matrices only for non-defective matrices


And, if a matrix $A$ is diagonalizable by eigenvalue decomposition, i.e., \(A = \sum_{i=1}^n \lambda_i \mathbf{v}_i \mathbf{v}_i^\top\) we can construct a singular value decomposition with it as we saw in the proof of $\mathbf{Thm\ 1.6}$.



Reference

[1] Lloyd N. Trefethen & David Bau III, Numerical Linear Algebra, 1997
[2] Blum, Avrim, John Hopcroft, and Ravindran Kannan. Foundations of data science. Cambridge University Press, 2020.

Leave a comment