7 minute read

Motivation: Overfitting

We have a data sample ${ (x_1, y_1), \cdots, (x_n, y_n) }$ and want to find a polynomial of degree $M$ that fits the data sample well, i.e.

\[\begin{aligned} y = \beta_0 + \beta_1 x_1 + \cdots + \beta_M x^M$ \end{aligned}\]

It is just a simple linear regression with the objective function $f(\boldsymbol{\beta}) := | \mathbf{y} - \mathbf{X} \boldsymbol{\beta} |^2$, whose features are $1, x, x^2, …, x^M$.


The following figures show the result of the polynomial regression for $M = 1, 3, 5, 9$. Among $M = 1, 3, 5, 9$, the value of MSE (= $\frac{1}{n} f(\boldsymbol{\beta})$) with respect to training data is minimized when $M = 9$, although $M = 3$ and $M = 5$ are fitted quite well. This is overfitting.

image


We want to generalize our model well, so let’s measure the generalization of each model using a separate dataset (validation set) with RMS error $\mathcal{L}_{RMS} = \sqrt{\frac{f(\boldsymbol{\beta})}{n}}$:

image

We can observe that the difference between training curve and test (validation) curve increases as $M$ increases for $M \geq 5$.


Regularized Regression Model

One possible way to control such an overfitting is the regularization technique. We penalize large coefficients in the error function $J[f]$, for example, as

\[\begin{aligned} J[f] = \frac{1}{2} f(\boldsymbol{\beta}) + \frac{\lambda}{2} ||\boldsymbol{\beta}||^2. \end{aligned}\]

From this table, we can check that as $\lambda \to 0$, i.e. $\text{ln} \lambda \to -\infty$, we have the over-fitting problem. But with proper regularization, the test (with validation set) and training errors show similar trend!

image image


Hence, we will consider a number of regularizers:

\[\begin{aligned} J[f] = \frac{1}{2} f(\boldsymbol{\beta}) + \frac{\lambda}{2} \sum_{j=0}^M |\beta_j|^q. \end{aligned}\]
  • $q = 1$: LASSO (Least Absolute Shrinkage and Selection Operator) regression
  • $q = 2$: Ridge regression (or, Parameter Shrinkage regression)

image


For example, for the objective function

\[\begin{aligned} J[f] = \frac{1}{2} f(\beta) + \frac{\lambda}{2} \sum_{j=0}^M |\beta_j|^2 = \frac{1}{2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta} ) + \frac{\lambda}{2} \boldsymbol{\beta}^\top \boldsymbol{\beta}, \end{aligned}\]

by letting $\nabla_{\boldsymbol{\beta}} J[f] = \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} - \mathbf{X}^\top \mathbf{y} + \lambda \boldsymbol{\beta} = 0$, we obtain $\widehat{\boldsymbol{\beta}} = (\lambda \mathbf{I} + \mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$.

However, LASSO regression is not differential at 0 due to absolute operator. For this, we usually use the concept of subdifferential.

$L_2$ / Ridge Regression

MSE with $L_2$ regularization can be written as

\[\begin{aligned} L(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \frac{1}{2} (\mathbf{w}^\top \mathbf{x}^{(n)} + b - y^{(n)})^2 + \frac{\lambda}{2} \| \mathbf{w} \|^2 \end{aligned}\]

That is, it prevents from increasing the scale of $\mathbf{w}$. Note that $\mathbf{w}$ is a kind of slope. Thus, the function can’t have too high slope. (For neural network, this is called weight decaying)


How about regularize the bias term? The role of bias is just moving the graph, and restricting the amount of bias may degrade the model performance. Also, the bias term does not affect the model complexity. Thus, we usually do not regularize the bias.


$L_1$ / LASSO Regression

MSE with $L_1$ regularization can be written as

\[\begin{aligned} L(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \frac{1}{2} (\mathbf{w}^\top \mathbf{x}^{(n)} + b - y^{(n)})^2 + \lambda \sum_{i=1}^d \mid w_d \; \mid \end{aligned}\]


An important intuition of $L_1$ regularization is that it encourages sparsity, which make some weight to be zero. This can be seen as an automatic feature selection. It can be understood in two ways. First, since $L_1$ norm is only minimized when value of each entry is 0, hence we are forcing the weight to be zero via gradient descent to minimize $L_1$ norm.

Or, we can think it in a geometrical way. Note that the level surface of $L_1$ norm is the shape of square diamond. Hence, using it to touch the solution surface of regression without regularization is very likely to obtain a sparse solution. For example, consider the situation as below

image


Kernel Regression Model

The original regression models converge only for linearly separable data. But, it is very restrictive in many applications. In that case, if the dataset is not linearly separable, we can map the samples $\mathbf{x}$ into a feature space of higher dimensions:

\[\begin{aligned} \mathbf{x} \to \phi(\mathbf{x}) \end{aligned}\]

in which the classes can be linearly separated.

image

In the (linear) ridge regression, we consider the function $f(\mathbf{x}) = \mathbf{x}^\top \boldsymbol{\beta}$, but we now consider the function $f(\mathbf{x}) = \phi(\mathbf{x})^\top \boldsymbol{\beta}$.


We then have the following objective function:

\[\begin{aligned} J[f] = \frac{1}{2}\left(\mathbf{y} - \phi(\mathbf{X})^\top \boldsymbol{\beta}\right)^\top \left(\mathbf{y} - \phi (\mathbf{X})^\top \boldsymbol{\beta}\right) + \frac{\lambda}{2} \boldsymbol{\beta}^\top \boldsymbol{\beta} \end{aligned}\]

where $\phi(\mathbf{X}) = [\phi(\mathbf{x}_1), \cdots, \phi(\mathbf{x}_n)]$.


By expanding $J[f]$, we obtain

\[\begin{aligned} J[f] & = \frac{1}{2}\left(\mathbf{y} - \phi(\mathbf{X})^\top \boldsymbol{\beta}\right)^\top \left(\mathbf{y} - \phi (\mathbf{X})^\top \boldsymbol{\beta}\right) + \frac{\lambda}{2} \boldsymbol{\beta}^\top \boldsymbol{\beta} \\ & = \frac{1}{2}\left(\mathbf{y}^\top \mathbf{y} - \mathbf{y}^\top \phi (\mathbf{X})^\top \boldsymbol{\beta} - \boldsymbol{\beta}^\top \phi (\mathbf{X}) \mathbf{y} + \boldsymbol{\beta}^\top \phi (\mathbf{X}) \phi (\mathbf{X})^\top \boldsymbol{\beta}\right) + \frac{\lambda}{2} \boldsymbol{\beta}^\top \boldsymbol{\beta} \end{aligned}\]


From $\nabla_{\boldsymbol{\beta}} J[f] = - \phi (\mathbf{X}) \mathbf{y} + \phi (\mathbf{X}) \phi (\mathbf{X})^\top \boldsymbol{\beta} + \lambda \boldsymbol{\beta} = 0$, we obtain

\[\begin{aligned} \boldsymbol{\beta}^* = \left(\lambda \mathbf{I} + \phi (\mathbf{X}) \phi (\mathbf{X})^\top \right)^{-1} \phi (\mathbf{X}) \mathbf{y}. \end{aligned}\]


From $\left(\lambda \mathbf{I} + \phi (\mathbf{X}) \phi (\mathbf{X})^\top \right) \phi (\mathbf{X}) = \phi (\mathbf{X}) \left(\lambda \mathbf{I} + \phi (\mathbf{X})^\top \phi (\mathbf{X}) \right)$, we have

\[\begin{aligned} \left(\lambda \mathbf{I} + \phi (\mathbf{X}) \phi (\mathbf{X})^\top \right)^{-1} \phi (\mathbf{X}) = \phi (\mathbf{X}) (\lambda \mathbf{I} + \phi (\mathbf{X})^\top \phi (\mathbf{X}))^{-1}. \end{aligned}\]


Thus,

\[\begin{aligned} \boldsymbol{\beta}^* = \phi (\mathbf{X}) (\lambda \mathbf{I} + \phi (\mathbf{X})^\top \phi (\mathbf{X}))^{-1} \mathbf{y} \end{aligned}\]

and the optimal function $f^* (\mathbf{x})$ becomes

\[\begin{aligned} f^* (\mathbf{x}) = \phi (\mathbf{X})^\top \phi (\mathbf{X}) (\lambda \mathbf{I} + \phi (\mathbf{X})^\top \phi (\mathbf{X}))^{-1} \mathbf{y}. \end{aligned}\]


$\color{blue}{\mathbf{Definition.}}$ (Kernel) A kernel is a function takes two vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ as inputs and returns the value of the inner product of $\phi (\mathbf{x}_i)$ and $\phi (\mathbf{x}_j)$, i.e.
$$ \begin{aligned} K(\mathbf{x}_i, \mathbf{x}_j) = \phi (\mathbf{x}_i)^\top \phi (\mathbf{x}_j) \end{aligned} $$


So, from the definition, we see that the optimal function $f^* (\mathbf{x})$ can be written as

\[\begin{aligned} f^* (\mathbf{x}) = K(\mathbf{x}, \mathbf{X}) \left(\lambda \mathbf{I} + K(\mathbf{X}, \mathbf{X}) \right)^{-1} \mathbf{y} \end{aligned}\]

where $\mathbf{X} = [\mathbf{x}_1, \cdots, \mathbf{x}_n]$, $K(\mathbf{x}, \mathbf{X}) = [K(\mathbf{x}, \mathbf{x}_1), \; cdots \;, K(\mathbf{x}, \mathbf{x}_n)]$ and $K(\mathbf{X}, \mathbf{X})$ is a matrix whose $ij$-th entry corresponds to $(K(\mathbf{x}_i, \mathbf{x}_j))$.


For convenience, let $\mathbf{\alpha}^* = \left(\lambda \mathbf{I} + K(\mathbf{X}, \mathbf{X})\right)^{-1} \mathbf{y}$. Then, the optimal function $f^* (\mathbf{x})$ can be written as

\[\begin{aligned} f^* (\mathbf{x}) = K(\mathbf{x}, \mathbf{X}) \mathbf{\alpha}^* = \phi (\mathbf{x})^\top \boldsymbol{\beta}^*. \end{aligned}\]

by observing that $K(\mathbf{x}, \mathbf{X}) = \phi (\mathbf{x})^\top \phi (\mathbf{X})$.


By introducing mapping $\phi$, the high dimensional problem is downgraded to a simple linear problem. However, finding the appropriate $\phi$ can be bothersome and it may increase the computations.

In kernel trick, as we saw, we don’t have to consider this issue of $\phi (\mathbf{x})$. The objective function $J[f]$ is expressed by the kernel matrix $K(\mathbf{X}, \mathbf{X})$. The mapping function $\phi: \mathbb{R}^n \to \mathbb{R}^m$ doesn’t need to be explicitly specified. All we need is, the inner product of vectors $\phi (\mathbf{x})$ in the new space. This fact is very important feature in the kernel method. And it is logically completed by one mathematical theorem: Mercer’s Theorem.


$\color{red}{\mathbf{Thm.}}$ Mercer's Theorem
If a function $K(\mathbf{x}, \mathbf{y})$ respects a few mathematical conditions called Mercer’s conditions (e.g. $K$ must be continuous and symmetric in its argument so that $K(\mathbf{x}, \mathbf{y}) = K(\mathbf{y}, \mathbf{x})$, etc.), then there exists a function $\phi$ that maps $\mathbf{x}$ and $\mathbf{y}$ into another space (possibly with much higher dimensions) such that $K(\mathbf{x}, \mathbf{y}) = \phi (\mathbf{x})^\top \phi (\mathbf{y})$.


We can use such $K$ as a kernel because we know $\boldsymbol{\phi}$ exists (although we don’t know its exact form). In the case of the Gaussian RBF kernel, it can be shown that maps each training instance to an infinite-dimensional space, so it’s a good thing you don’t need to actually perform the mapping. Note that some frequently used kernels (such as the sigmoid kernel) don’t respect all of Mercer’s conditions, yet they generally work well in practice. This kernel trick will be also exploited in other ML algorithms, e.g., SVM.



We end this post by summarizing remarkable features of kernel trick and introducing some popular kernels.

$\mathbf{Remark.}$ Important Fact of Kernel Trick
  1. The mapping function $\phi (\mathbf{x})$ doesn't need to be explicitly specified.
  2. Non-parametric: don't introduce new parameters to be trained.
  3. Kernels measure similarity.


$\color{green}{\mathbf{Example.}}$ Popular Kernel Functions


Linear Kernel

\[\begin{aligned} K(\mathbf{x}_1, \mathbf{x}_2) = \mathbf{x}_{1}^\top \mathbf{x}_2 \end{aligned}\]

Polynomial Kernel

\[\begin{aligned} K(\mathbf{x}_1, \mathbf{x}_2) = (a\mathbf{x}_{1}^\top \mathbf{x}_2 + b)^c \end{aligned}\]

Sigmoid Kernel (Hyperbolic Tangent Kernel)

\[\begin{aligned} K(\mathbf{x}_1, \mathbf{x}_2) = \text{tanh }(a\mathbf{x}_{1}^\top \mathbf{x}_2 + b) \end{aligned}\]

Gaussian Kernel (Radial Basis Function Kernel; RBF Kernel)

\[\begin{aligned} K(\mathbf{x}_1, \mathbf{x}_2) = \text{exp }(\frac{-\| \mathbf{x}_1 - \mathbf{x}_2 \|_2^2}{2\sigma^2}) \end{aligned}\]




Reference

[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
[2] GeÌron, Aureìlien, Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd ed., O’Reilly 2019.

Leave a comment