[ML] K-Nearest Neighbors (KNN)
$K$ Nearest Neighbors; KNN
$K$ Nearest Neighbors Algorithm
The $K$-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.
Consider a new blue data and the objective is to classify it.
When used in regression, the value of a new data z is given by the average of the values of its $K$ neighbors in $N$.
The KNN algorithm has to compute all distances between a new data and neighbor data, which results in high computational cost. There exist some algorithms to reduce the computational cost such as
- Locality Sensitive Hashing
- Network based Indexer
- Optimized product quantization
$\mathbf{Example.}$ Decision boundary - Comparison with Different $K$
Distance metrics
Similarity can be measured in terms of distance. There are a number of different distance metrics as given below.
Euclidean Distance ($L_2$ Distance)
$d_{(X, Y)} = \sqrt{(x_1 - y_1)^2 + \cdots + (x_n - y_n)^2}$
Manhattan Distance ($L_1$ Distance)
Mahalanobis Distance
It takes into account the covariances of the data set. For two vectors $\mathbf{x}$ and $\mathbf{y}$,
where $\Sigma$ denotes the covariance matrix of the data set.
To understand Mahalanobis distance, consider two points $x = (x_1, x_2)$ and $y = (0, 0)$ in $\mathbf{R}^2$. For $a_{12} = a_{21}$ and
if $d_{\text{Mahalanobis}(\mathbf{x}, \mathbf{y})} = d$, then we have
Pearson Correlation Distance
For $\mathbf{x} = (x_1, \cdots, x_n)$ and $\mathbf{y} = (y_1, \cdots, y_n)$
where $\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i, \overline{y} = \frac{1}{n} \sum_{i=1}^n y_i,$ and
Spearman’s Rank Correlation
where $r_i$ and $s_i$ denote the ranks of $x_i$ and $y_i$ in $\mathbf{x}$ and $\mathbf{y}$, respectively.
$\mathbf{Note.}$ When there are no ties, we see that
$\sum_{i=1}^n \sum_{j=1}^n (r_j - r_i) (s_j - s_i)$
$= \sum_{i=1}^n \sum_{j=1}^n r_i s_i + \sum_{i=1}^n \sum_{j=1}^n r_j s_j - \sum_{i=1}^n \sum_{j=1}^n r_i s_j - \sum_{i=1}^n \sum_{j=1}^n r_j s_i$
$= 2n \sum_{i=1}^n r_i s_j - 2 \sum_{i=1}^n r_i \sum_{j=1}^n s_j$
$= 2n \sum_{i=1}^n r_i s_i - 2 (\frac{1}{2} n(n+1))^2$
$= 2n \sum_{i=1}^n r_i s_i - \frac{1}{2} n^2 (n+1)^2$
From $\sum_{i=1}^n (r_i - s_i)^2 = 2 \sum r_{i}^2 - 2 \sum r_i s_i$, we have
$\sum_{i=1}^n \sum_{j=1}^n (r_j - r_i) (s_j - s_i)$
$= 2n \sum_{i=1}^n r_{i}^2 - \frac{1}{2} n^2 (n+1)^2 - n \sum_{i=1}^n (r_i - s_i)^2$
$= \frac{1}{6} n^2 (n^2 - 1) - n \sum_{i=1}^n (r_i - s_i)^2$.
Further,
$\sum_{i=1}^n \sum_{j=1}^n (r_j - r_i)^2 = 2n \sum_{i=1}^n r_{i}^2 - 2 \sum_{i=1}^n \sum_{j=1}^n r_i r_j$
$= 2n \sum_{i=1}^n r_{i}^2 - 2 (\sum_{i=1}^n r_i)^2 = \frac{1}{6} n^2 (n^2 - 1)$
By substituting into the original formula these results, done $._\blacksquare$
Reference
[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
[2] Spearman’s rank correlation coefficient
Leave a comment