[ML] Support Vector Machine (SVM)

10 minute read

Linear SVM

Introduction

A hyperplane in an $n$-dimensional feature space can be represented by the following equation:

\[f(\mathbf{x}) = \mathbf{x}^\top \mathbf{w} + b = \sum_{i=1}^n x_i w_i + b = 0\]

Dividing by $\Vert \mathbf{w} \Vert$, we get

\[\frac {\mathbf{x}^\top \mathbf{w}} {\Vert \mathbf{w} \Vert} = P_{\mathbf{w}} (\mathbf{x}) = - \frac {b}{\Vert \mathbf{w} \Vert}\]

indicating that the projection of any point $\mathbf{x}$ on the plane onto the vector $\mathbf{w}$ always has the length $\frac {\vert b \vert}{\Vert \mathbf{w} \Vert}$. Moreover, we see that $\mathbf{w}$ is the normal direction of the plane.

Note that the equation of the hyperplane is not unique. For instance, $c \cdot f(\mathbf{x}) = 0$ represents the same plane for any $c$. And the $n$-dimensional space is partitioned into two regions by the plane. Specifically, we define a map

\[y = \text{sign}(f(\mathbf{x})) \in \{ -1, 1 \}\]

A point $\mathbf{x}$ of unknown class will be classified to P (Positive) if $f(\mathbf{x}) > 0$, or N (Negative) if $f(\mathbf{x}) < 0$.

$\color{green}{\mathbf{Example.}}$ A straight line in a $2$-dimensional space $\mathbf{x} = [x_1, x_2]^\top$, which is perpendicular to $\mathbf{w} = [1, 2]^\top$ and $b = -1$, described by the following equation:
$$ f(\mathbf{x}) = \mathbf{x}^\top \mathbf{w} + b = x_1 + 2x_2 - 1 = 0 $$ And it divides the $2$-dimensional plane into two halves. The distance between the origin and the line is
$$ \frac {\vert b \vert}{\Vert \mathbf{w} \Vert} = \frac {1} {\sqrt{w_1^2 + w_2^2} } = \frac{1}{\sqrt{5}} = 0.447 $$

Given a set of $K$ training samples from two linearly separable classes P and N:

\[\{ (\mathbf{x}_k, y_k) | y_k \in \{ -1, 1\}, k = 1, \cdots, K \}\]

where $y_k$ labels $\mathbf{x}_k$ to belong to either of the two classes. We want to find a hyperplane in terms of $\mathbf{w}$ and $b$, that linearly separates the two classes. Before the classifier is properly trained, the actual output $\widehat{y} = \text{sign}(f(\mathbf{x}))$ may not be the same as the desired output $y$. There are four possible cases:

	Input $(\mathbf{x}, y)$	Output $\widehat{y} = \text{sign}(f(\mathbf{x}))$	result
1	$(\mathbf{x}, y=1)$	$\widehat{y} = 1 = y$	correct
2	$(\mathbf{x}, y=-1)$	$\widehat{y} = 1 \neq y$	incorrect
3	$(\mathbf{x}, y=1)$	$\widehat{y} = -1 \neq y$	incorrect
4	$(\mathbf{x}, y=-1)$	$\widehat{y} = -1 - y$	correct

Note that for a correct decision hyperplane it satisfies that

\[y_i (\mathbf{x}_i^\top \mathbf{w} + b) \geq 0 \quad \forall i = 1, 2, \cdots, K.\]

Among all correct decision hyperplanes, we want to find the optimal one $H_0$ that separates the 2 classes with the maximal margin (the distance between the decision plane and the closest sample points).

It is reasonable that the optimal hyperplane should be in the middle of the 2 classes, so that the distance from the hyperplane to the closest point on either side is the same. So, we define two additional hyperplanes $H_+$ and $H_-$ that are parallel to $H_0$ and go through the point closest to the hyperplane $H_0$ on either side.

\[H_+: \mathbf{x}^\top \mathbf{w} + b = 1 \textrm{ and } H_-: \mathbf{x}^\top \mathbf{w} + b = -1\]

Note that these two hyperplanes can uniquely determine $H_0$.

$\color{blue}{\mathbf{Definition.}}$ Support Vectors
All points $\mathbf{x}_i \in P$ on the positive side should satisfy $$ \mathbf{x}_{i}^\top \mathbf{w} + b \geq 1, y_i = 1 $$ and all points $\mathbf{x}_i \in N$ on the negative side should satisfy $$ \mathbf{x}_{i}^\top \mathbf{w} + b \leq -1, y_i = -1 $$ These can be combined into one inequality: $$ y_i (\mathbf{x}_{i}^\top \mathbf{w} + b) \geq 1 (i = 1, \cdots, K) $$ The equality holds for those points on the hyperplanes $H_+$ and $H_-$. Such points are called **support vectors**, for which
$$ \mathbf{x}_{i}^\top \mathbf{w} + b = y_i $$

$\color{#bf1100}{\mathbf{Remark.}}$ Recall that the distances from the origin to the three parallel hyperplanes $H_+$, $H_0$ and $H_-$ are $\frac{\vert b-1 \vert}{\Vert\mathbf{w}\Vert}, \frac{\vert b \vert}{\Vert\mathbf{w}\Vert}$, and $\frac{\vert b+1 \vert}{\Vert\mathbf{w}\Vert}$, respectively. Moreover, the distance between hyperplanes $H_-$ and $H_+$ is $\frac{2}{\Vert\mathbf{w}\Vert}$.

Finding SVM

Our goal is to maximize the distance, equivalently to minimize the norm $\Vert\mathbf{w}\Vert$ from which we have the following constrained optimization problem:

\[\begin{array}{cl} \min & \frac{1}{2} \mathbf{w}^\top \mathbf{w} = \frac{1}{2} \Vert\mathbf{w}\Vert^2 \\ \textrm{ s.t. } & y_i (\mathbf{x}_{i}^\top \mathbf{w} + b) \geq 1 \; \forall i = 1, \cdots, K \end{array}\]

Next, for simplicity, convert into an unconstrained optimization problem by using a penalty term as follows:

\[\min \frac{1}{2} \mathbf{w}^\top \mathbf{w} + C\]

where the penalty term $C$ is given by

\[C = \underset{\alpha_i \geq 0}{\max} \alpha_i (1 - y_i (\mathbf{x}_{i}^\top \mathbf{w} + b)) = \begin{cases} 0 & \quad \text{ if } y_i (\mathbf{x}_{i}^\top \mathbf{w} + b) \geq 1 \\ \infty & \quad \text{ otherwise } \end{cases}\]

Now we have:

\[\begin{aligned} & \underset{\mathbf{w}, b}{\min} \left\{ \frac{1}{2} \mathbf{w}^\top \mathbf{w} + \sum_{i=1}^K \underset{\alpha_i \geq 0}{\max} \alpha_i (1 - y_i ( \mathbf{x}_i^\top \mathbf{w} + b)) \right\} \\ = & \; \underset{\mathbf{w}, b}{\min} \underset{\alpha_i \geq 0}{\max} \left\{ \frac{1}{2} \mathbf{w}^\top \mathbf{w} + \sum_{i=1}^K \alpha_i (1 - y_i ( \mathbf{x}_i^\top \mathbf{w} + b)) \right\} \\ \geq & \; \underset{\alpha_i \geq 0}{\max} \underset{\mathbf{w}, b}{\min} \left\{ \frac{1}{2} \mathbf{w}^\top \mathbf{w} + \sum_{i=1}^K \alpha_i (1 - y_i ( \mathbf{x}_i^\top \mathbf{w} + b)) \right\} \end{aligned}\]

For the last inequality, please refer to max-min inequality. Here, $\alpha_i$ are called Lagrange multipliers and the equality holds in certain conditions. Let:

\[L_p (\mathbf{w}, b, \mathbf{\alpha}) = \frac{1}{2} \mathbf{w}^\top \mathbf{w} + \sum_{i=1}^K \alpha_i ( 1 - y_i (\mathbf{x}_i^\top \mathbf{w} + b))\]

Solving the minimization problem first by using $\nabla_\mathbf{w} L_p (\mathbf{w}, b, \mathbf{\alpha}) = 0, \frac{\partial}{\partial b} L_p (\mathbf{w}, b, \mathbf{\alpha}) = 0$, we get

\[\mathbf{w} = \sum_{i=1}^K \alpha_i y_i \mathbf{x}_i \textrm{ and } \sum_{i=1}^K \alpha_i y_i = 0.\]

Substituting these two equations back into the expression of $L_p (\mathbf{w}, b, \mathbf{\alpha})$, we get the following maximation problem with respect to $\alpha_i$:

\[\underset{\alpha_i \geq 0}{\max} \left\{\sum_{i=1}^K \alpha_i - \frac{1}{2} \sum_{i=1}^K \sum_{j=1}^K \alpha_i \alpha_j \mathbf{x}_{i}^\top \mathbf{x}_{j}^\top \right\}\]

where $\sum_{i=1}^K \alpha_i y_i = 0$.

For the solution $\alpha_i \ (i = 1, 2, \cdots, K)$ of the above problem, we may get the optimal $\mathbf{w}^* = \sum_{i=1}^K \alpha_i y_i \mathbf{x}_i$. To get the constant $b$, observe that for a support vector $\mathbf{x}_i$, it satisfies $y_i (\mathbf{x}_i^\top \mathbf{w} + b) = 1$, which is equivalent to $\mathbf{x}_i^\top \mathbf{w} + b = y_i$. Thus, the optimal $b^* = y_i - \mathbf{x}_i^\top \mathbf{w}$.

$\color{#bf1100}{\mathbf{Remark.}}$ Recall that support vectors are those points $\mathbf{x}_i$ on either of the two hyperplanes $H_+$ and $H_-$ (for which the equality $y_i (\mathbf{x}_i^\top + b) = 1$ holds.) Then, we see that $\alpha_i = 0$ for non-support vectors $\mathbf{x}_i$ where $y_i ( \mathbf{x}_i^\top + b) > 1$. (the penalty term must be $0$ or $\infty$ but $\alpha_i \geq 0$.) Therefore, the training depends only on the support vectors, while all other samples away from the planes $H_+$ and $H_−$ are not important.

From this remark, we can derive the length of optimal weight $\mathbf{w}$ as follows. First, note that

\[\mathbf{w} = \sum_{i=1}^K \alpha_j y_j \mathbf{x}_{j} = \sum_{j \in I} \alpha_j y_j \mathbf{x}_{j} \to y_i (\sum_{j \in I} \alpha_j y_j \mathbf{x}_{j} + b) = 1\]

Then:

\[\begin{aligned} \Vert \mathbf{w} \Vert^2 = \mathbf{w}^\top \mathbf{w} & = \sum_{j \in I} \alpha_j y_j \mathbf{x}_j^\top \sum_{j \in I} \alpha_j y_j \mathbf{x}_j \\ & = \sum_{i \in I} \alpha_i y_i \sum_{j \in I} \alpha_j y_j \mathbf{x}_i^\top\mathbf{x}_j \\ & = \sum_{i \in I} \alpha_i (1 - y_i b) = \sum_{i \in I} \alpha_i - b \sum_{i \in I} \alpha_i y_i \\ & = \sum_{i \in I} \alpha_i \end{aligned}\]

where we use the constraint that $\sum_{i \in I} \alpha_i y_i = \sum_{i=1}^K \alpha_i y_i = 0$.

Logistic Regression vs. SVM

Logistic regression takes the decision hyperplane that maiximizes the probability of data (we use the likelihood function). So the farther the data lies from the decision hyperplane (on the correct side), the happier the logistic regression is.

Support vector machine takes the decision hyperplane that maximizes the distance of the closest points (the margine of the support vectors) from the hyperplane. If a point is not a support vector, it doesn’t really matter.

Soft Margin SVM

When the two classes are not linearly separable (e.g., due to noise), the condition for the optimal hyperplane can be relaxed by including an extra term:

\[y_i \mathbf{x}_{i}^\top \mathbf{w} + b \geq 1 - \xi_i \; \forall i = 1, \cdots K\]

For minimum error, $\xi_i \geq 0$ should be also minimized as well as $| \mathbf{w} |$, and the objective function becomes:

\[\begin{array}{cl} \min & \frac{1}{2} \mathbf{w}^\top \mathbf{w} + C \sum_{i=1}^K \xi_i \\ \textrm{ s.t. } & y_i (\mathbf{x}_{i}^\top \mathbf{w} + b) \geq 1 - \xi_i \textrm{ and } \xi_i \geq 0 \; \forall i = 1, \cdots, K \end{array}\]

Here, $C$ is a regularization parameter that controls the trade-off between maximizing the margin adn minimizing the training error. A small value of $C$ tends to emphasize the margin while ignoring the outliers in the training data, while a large value of C may tend to overfit the training data.

Using Lagrange multipliers $\alpha_i$ and $\mu_i$, we have the following uncontrained optimization problem:

\[\begin{array}{lr} \min & L_p (\mathbf{w}, b, \xi_i, \alpha_i, \mu_i) \\ \textrm{ s.t. } & \alpha_i \geq 0, \; \mu_i \geq 0 \; \forall 1 \leq i \leq n \end{array}\]

where $L_p (\mathbf{w}, b, \xi_i, \alpha_i, \mu_i) = \frac{1}{2} \mathbf{w}^\top \mathbf{w} + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i (y_i (\mathbf{w}^\top \mathbf{x}_i + b) - 1 + \xi_i) - \sum_{i=1}^n \mu_i \xi_i$.

Observe that:

\[\begin{aligned} \frac {\partial L_p} {\partial \mathbf{w}} = 0 & \implies \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \\ \frac {\partial L_p} {\partial b} = 0 & \implies \sum_{i=1}^n \alpha_i y_i = 0 \\ \frac {\partial L_p} {\partial \xi_i} = 0 & \implies C - \alpha_i - \mu_i = 0 \\ \end{aligned}\]

Using the above results we have the following optimization problem:

\[\underset{\alpha_i}{\max} L_D (\alpha_i) = \underset{\alpha_i}{\max} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j\]

for $0 \leq \alpha_i \leq C$.

Kernel SVM

The algorithm above converges only for linearly separable data. If the data set is not linearly separable, we can map the samples $x$ into a feature space of higher dimensions:

\[\mathbf{x} \to \phi (\mathbf{x})\]

in which the classes can be linearly separated.

The decision function in the new space becomes:

\[f(\mathbf{x}) = \phi (\mathbf{x})^\top \mathbf{w} + b = \sum_{i=1}^K \alpha_i y_i ( \phi(\mathbf{x})^\top \phi(\mathbf{x}) ) + b\]

where:

\[\mathbf{w} = \sum_{j=1}^K \alpha_j y_j \phi (\mathbf{x}_j)\]

and $b$ are the parameters of the decision plane in the new space. Let $K(\mathbf{x_1}, \mathbf{x_2}) = \phi(\mathbf{x_1})^\top \phi(\mathbf{x_2})$. It then follows that $f(\mathbf{x}) = \phi (\mathbf{x})^\top \mathbf{w} + b = \sum_{j=1}^K \alpha_j y_j K(\mathbf{x}, \mathbf{x}_j) + b$.

The parameter $b$ can be found from any support vectors $\mathbf{x}_i$:

\[b = y_i - \phi( \mathbf{x}_i )^\top \mathbf{w} = y_i - \sum_{j=1}^K \alpha_j y_j (\phi (\mathbf{x}_i^\top \phi (\mathbf{x}_j)) = y_i - \sum_{j=1}^K \alpha_j y_j K(\mathbf{x}_i, \mathbf{x}_j).\]

Similarly as in the linear SVM, we solve the following maximization problem:

\[\underset{\alpha_i \geq 0}{\max} \left\{ \sum_{i=1}^K \alpha_i - \frac{1}{2} \sum_{i=1}^K \sum_{j=1}^K \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) \right\}\]

Unlike the linear SVM, we cannot express $\mathbf{w}$ as a linear combination of support vectors in this case. Even though kernels allow very flexible models, we have to determine the kernel parameters which is one of drawbacks.

$\mathbf{Example\ 1.}$ Hard margin SVM - linearly seperable case

Consider a classification problem in a 2D space, which is linearly separable. Here, the goal is to find a linear decision boundary.

Here, the black contour is the linear decision boundary which comes from a hard margin support vector machine.

$\mathbf{Example\ 2.}$ Soft margin SVM - linearly seperable case

Now, by adding one outlier to the previous data set, we make the data to be non-linearly separable. In this case, if we want to find a linear decision bound, we should use a soft margin support vector machine.

In this case, the soft margin SVM model suggests that mis-classifying the positive outlier is the best.

$\mathbf{Example\ 3.}$ Kernelized SVM

We now consider a classification problem with a highly non-linearly separable data set which needs a non-linear decision boundary as shown in the following two cases.

To get a non-linear decision boundary, we use a kernel trick in support vector machine learning. For two cases, we use the following polynomial kernel of degree $3$.

\[K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z})^3 = (x_1 z_1 + x_2 z_2)^3\]

We then get the following results.

$\mathbf{Example\ 4.}$ Kernelized SVM 2

In the kernelized svm, not only choosing a proper kernel, but also selecting a valid hyperparameter is also important. Consider the following radial basis function, which is a widely used kernel.

\[K(\mathbf{x}, \mathbf{z}) = e^{-\Vert\mathbf{x} - \mathbf{z}\Vert^2 / 2 \gamma^2}\]

For a soft margin kernelized svm with the radial basis function, choosing $\gamma$ and the regularization term $C$ is very important as shown in the following example.

By varying the values of $C$ and $\gamma$, we get different decision boundaries as follows.

Here, the score means the accuracy of the decision boundary for training data.

Share on

Twitter Facebook LinkedIn

Youngdo Lee