[ML] Introduction to ML

21 minute read

Introduction to ML

What is Machine Learning?

“A field of study that gives computers the ability to learn without being explicitly programmed.”

– Arthur Samuel, 1959

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

– Tom Mitchell, 1997

ML vs Traditional Programming
- Traditional Programming
  - solve a problem by explicitly programming the solution algorithm found by humans
  - e.g., sorting algorithms
- Machine Learning
  - Programming by observing = Let data program!
  - solve a problem by programming a learning algorithm that can learn an (implicit) solution algorithm by observing data

General Form in ML

Let’s consider Machine Learning with a data set

$D = {(x_1, y_1), ..., (x_n, y_n)}$

$x_1$, $x_2$, $…$, $x_n$ : input vectors
$y_1$, $y_2$, $…$, $y_n$ : target vectors
the dataset $D$ is called a training set (used to tune the parameters of a ML model)

The objective of a machine learning algorithm is to find $y(x)$ with $D$.

$\mathbf{y = f(x) + \epsilon}$

$\textbf{y}$ is the observation. It can be divided into
$f$ is an unkown function of $\mathbf{x} = (x_1, x_2, …, x_p)$
$\epsilon$ is a random error and independent of $\mathbf{x}$, and zero mean.

Thus, $f(\mathbf{x})$ is the true function value and $\epsilon$ is a noise due to some limitations of computer, etc.
We may want to find an approach to estimate $f$.

Parametric method

Parametric method selects the form of $f$ before estimation.
For instance, assume that $f$ is linear in $\mathbf{x}$, $f(\mathbf{x}) = \sum_{i=0}^{p} \beta_i x_i$ with $x_0 = 1$.
Now, our objective is reduced to estimate the parameters $\beta_0, \beta_1, …, \beta_p$.
In other words, the problem of estimating $f$ is reduced to estimating a set of parameters.
But, it may not fit the data well due to the model limitation.

Nonparametric method

However, there is no explicit assumption is made for $f$ in nonparametric method.
It has a great potential to fit the data well, but it needs a large number of data to train and overfitting can occur.

Gradient Descent

Gradient Descent method

In machine learning, we are trying to minimize the error $J$ in the estimation of $f$. One good way is to use the Gradient Descent method.

Recall that the gradient of the error $J$, $\nabla J$, points in the direction along which $J$ is increasing the fastest.
So, if we wish to move in a direction in which $J$ decreases the fastest, we should move in the direction $-\nabla J$.

For instance, assume that $J = J(\beta_0, \beta_1)$. Then, two parameters $\beta_0$ and $\beta_1$ can be updated as

$\beta_{i}^{(t+1)} = \beta_{i}^{(t)} - \alpha \frac{\partial J(\beta_0, \beta_1)}{\partial \beta_i}$

( $\alpha$ is called the learning rate.)

In summary, the gradient method can find the minimum of a convex function by starting at an arbitrary point and repeatedly take steps in the downward direction. After several iterations, we will eventually reach the minimum for which we have the best fit of $f$.

In addition, it is the first-order optimization algorithm as we use gradient. And it is well-knowned that the first-order is enough empirically.

Local Optima for nonconvex functions

For nonconvex functions, we can reach a local optimum. So we use different starting points and get the best one among local optima.

Another popular method is the stochatic gradient method.

Mini-Batch/Stochastic Gradient Method

Instead of taking a step using the entire training set, we sample a small batch of training data at random to determine our next step.
(Stochastic gradient method uses mini_batch_size = 1, but we mix up the terminology “mini-batch gradient method” nowadays.)
It is computationally more efficient and may lead to faster convergence.

Learning & Validation of The Model

Hyperparameter & Model Selection

Hyperparameters are parameters of a learning algorithm that are not automatically learnable from data but still affect the performance.

Hyperparameters $=$ Model Hyperparameter $+$ Non-model Hyperparmeter
e.g.,

degree $M$ in polynomial regression (Model)
number of clusters $K$ in $K$-means (Non-model)
learning rate in GD (Non-model)
number of layers in DNN (Model)

Data Split

Say, you’re developinig a new face recognition algorithm in a startup and needs to report the accuracy to the CEO. Your project starts with a dataset (grey).

Is it good or bad to use all data to train the model?
The answer is No. Although the training error is low, the generalization error may be high.

A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set.

Ok, now we can report the test-set accuracy to the CEO. Suppose that the test performance of model 1 did not outperform the existing model (model 0), so you make model 2 by changing a hyperparameter. Then, you get a test performance outperforming model 0. Is it good or bad?

Suppose that you find the best hyperparameter value that produces a model 2 with the lowest generalization error, say just 5% error. So you launch this model into production, but unfortunately it does not perform as well as expected and produces 15% errors. Why?

It is because, you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that particular set. This means that the model is unlikely to perform as well on new data.

A common solution to this is, split the training data again to train & validation set.

The validation set (also, development set or held-out set) is aimed to evaluate several candidate models and select the best one.

In summary,

training set: a dataset used for learning to fit the parameters
validation set: a dataset used to select the best model
test set: a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset

Train the multiple models with various hyperparameters
$\to$ Select the best model on the validation set
$\to$ Train the best model on the full training set (including the validation set)
$\to$ Evaluate the final model on the test set

This solution usually works quite well. However,

if the validation set is too small, then model evaluations will be imprecise: you may end up selecting a suboptimal model by mistake.
if the validation set is too large, then the remaining training set will be much smaller than the full training set. Since the final model will be trained on the full training set, it is not ideal to compare candidate models trained on a much smaller training set.

One way to solve this problem is to perform repeated cross-validation, using many small validation sets.

K-fold Cross-Validation

For choosing the validation set, we usually use Cross Validation, one of the most frequently used ways.

When the dataset is not so large, partitioning the available data into three sets drastically reduces the training set size. Also, the model can overfit to a specific validation set. Good for small datasets but computationally expensive.

Types of Machine Learning

The types of ML algorithms can be classified based on the following criteria:

The amount and type of supervision they get during training.
Whether it can learn incrementally from a strem of incoming data.
How they generalize?

Type of supervision

Supervised Learning: input/target vectors in the training set

The training set $\mathbf{D}$ you feed to the algorithm includes the desired solutions, called labels (or targets)

스크린샷 2022-09-01 오후 4 13 08

Examples) Classification & Regression

Classification
- label is categorical
  - discrete
  - No order relationship
  - e.g., classify digit images

스크린샷 2022-09-04 오후 3 14 37

Regression
- label is continuous (and ordinal) number
  - continuous
  - Order relationship
  - e.g., Car Price Prediction

How to decide whether to use regression or classification model?

Here are some of the most important supervised learning algorithms:

k-Nearest Neighbors
Linear Regression
Logistic Regression
SVM
Decision Trees and Random Forests
Neural Networks

Unsupervised learning: no target vectors in the training set.

Unsupervised learning is a self-organized learning from data that has not been labeled, classified, or categorized.

Example) Clustering
Grouping of similar data. (e.g., Grouping customers and experiment outcomes, etc.)

K-Means
Finding $K$ (hyperparameter) clusters.
DBSCAN
Hierarchical Cluster Analysis (HCA)

Example) Anomaly detection and novelty detection
The system is shown mostly normal instances during training, so it learns to recognize them and when it sees a new instance it can tell whether it looks like a normal one or whether it is likely an anomaly.
For example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.

A very similar task is novelty detection: the difference is that novelty detection algorithms expect to see only normal data during training, while anomaly detection algorithms are usually more tolerant, they can often perform well even with a small percentage of outliers in the training set.

One-class SVM
Isolation Forest

Example) Visualization and dimensionality reduction
Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted.

A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information by reducing the dimension or the number of random variables to consider. (e.g., merge several correlated features into one: a car’s mileage may be very correlated with its age.)

Principle Component Analysis (PCA)
Kernel PCA
Locally-Linear Embedding (LLE)
t-distributed Stochastic Neighbor Embedding (t-SNE)

Semi-Supervised Learning

Labeling data is time-consuming and expensive. But, some algorithms can deal with partially labeled training data, usually a lot of unla‐ beled data and a little bit of labeled data. This is called semisupervised learning.

Reinforcement learning: actions and rewards

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
e.g., DeepMind’s AlphaGo

스크린샷 2022-09-04 오후 3 47 52

Whether it can learn incrementally

Batch Learning

The system is incapable of learning incrementally. It must be trained using all the available data.

Collect & Store the whole data
Train an ML model using the stored data
Deploy the trained model

If you want to add new data, you need to train a new version of the system from scratch with added dataset. It may beu useful if the dataset is not too big or too costly in computing and storage, because it will generally take a lot of time and computing resources (GPU, CPU, memory, etc.)

Online Learning

You train the system incrementally by feeding it data instances sequenctially, either individually or by small groups called mini-batches. (Data instances are streamed to the learning system.)
It is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly and autonomously. Also good option if you have the limited resources since a data point or a mini-batch of data is seen only once and not reused $\to$ No need for a large storage.

How they generalize?

At test time (i.e., once the trained model is deployed), the model is asked to make predictions on unseen data. We say that a learning algorithm or model generalizes well if the prediction accuracy on these unseen data is good. Otherwise, we say it does not generalize well.

스크린샷 2022-09-04 오후 5 06 23

Instance-based Learning

The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure.
In other words, it generalizes by making a decision by seeing similar neighbors.

$\mathbf{Example)}$ Spam filtering

Possibly the most trivial form of learning is simply to learn by heart. If you were to create a spam filter this way, it would just flag all emails that are identical to emails that have already been flagged by users—not the worst solution, but certainly not the best.

Instead of just flagging emails that are identical to known spam emails, your spam filter could be programmed to also flag emails that are very similar to known spam emails. This requires a measure of similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.

So, Instance-based learning requires a measure of similarity and storing instances to make a decision.

e.g., KNN; K-Nearest Neighbor

Make predictions based on the consensus of $k$ closest neighbors (e.g., by Euclidean distance)
Voting to class 1: $N_1$, Voting to class 2: $N_2$, pick class 1 if $N_1 > N_2$, otherwise class 2.

Model-based Learning

Another way is to build a model of these examples, then use that model to make predictions.

For example, suppose you want to know if money makes people happy.

Now, select a model class (ex. linear, polynomial, trigonometric, etc.) that looks matching well to the data. It seems that the data would fit well to a linear model

$\text{life satisfaction} = \theta_0 + \theta_1 \times \text{GPD per capita}$

This model has 2 model parameters $\theta_0$ and $\theta_1$. By tweaking these parameters, you can make your model represent any linear function.

To do this, we need to define a function $L(\theta_1,\theta_2)$, called the cost function (also called loss function, objective function, or utility function), to represent how good the current model parameters are.

Now, find the parameters that make the linear model fit best to your data. This is called training the model. So, training(= learning) is an optimization problem (often gradient-based).

Then we have our prediction model to use at test-time. (Making prediction at test-time is called inference, but note that in probabilistic ML, ‘inference’ has a different meaning) We hope the trained model generalize well.

Challenges to Machine Learning

Bad data

Insufficient Quantity of Training Data

Most ML algorithms require a lot of data to work properly. It has shown that different ML algorithms tend to perform similarly once they were given enough data.

But, it’s not always easy to collect a large amount of extra data.

Nonrepresentative Training Data

Not merely the amount of data is important, but the data should represent the whole phenomenon. For example,

The previous linear model (blue dotted) became inaccurate after adding more data points (red). In this case, we say that the samples (in the training set) are biased.

Poor Quality

Also, each data sample may contain lots of noise.

Irrelevant Features

In traditional ML (non-DL), feature engineering is often the key to the success of an ML project.

Feature engineering
- Making a good set of features to train on.
  - Price of a car $\Leftarrow$ (year, brand, color, owner’s weight?)
- Feature selection
  - Selecting the most useful features to train on among existing features.
- Feature extraction
  - Combining existing features to produce a more useful one.
  - e.g., polynomial feature extraction.

The power of deep learning is to replace feature engineering by feature learning (representation learning). It makes data do the feature engineering.

Overfitting

Model Complexity

The complexity of a model is how complex patterns the model can represent. A more complex model can capture more complex data patterns.

For example, in polynomial regression, we can control the complexity by the degree (M).

$y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M = \sum_{j = 0}^{M} w_j x^j$

Overfitting

We say that a model is overfitting if the model performs well on training data but not on unseen data. (Poor generalization)

The key factors causing overfitting are:

(1) Model Complexity $\to$ potential to overfit
(2) The amount of data $\to$ can generalize better as it sees more
(3) Optimization performance $\to$ ability to unleash the potential to overfit

More accurate minimaization of cost function, more chance to overfit

So, we can prevent overfitting by:

(Model Selection) make model less complex
- Reduce the number of model parameters
- Constrain the range of model parameter values (e.g., slope)
(Data Collection) Provide more training data
- Data augmentation
(Lazy Learning) Prevent too accurate optimization

Then, how can we generalize to new cases? One of better options: Data split!

Bias and Variance Trade-off

When we evaluate the model, we should consider the trade-off between bias and variance.
Given a model, can we quantize its generalization error? That is, how good is the model?

Consistency: a model is good if the training error is maintained at test time (No overfitting)
Flexibility: a model is good if it flexible enough to capture the patterns in the training set (Low bias)

An ideal model will achieve both. Can we?

$\mathbf{Example.}$
Consider the following experiment:

$L = 100$ datasets (only 20 are shown), NOT DATA POINTS.
Each having $N = 25$ datapoints
24 Gaussian basis functions in the model
$M = 25$ parameters (coefficients) including bias

The below graph is trained models on different $\text{ln }\lambda$ with $L$ different datasets.
The right column is the expectation of total $L$ trained model.

More flexible the model is, Higher variance & less bias the model has. What are bias and variance, and why is this behavior happened?

Bias
Bias is the difference between the average prediction of our model and the true value to predict. (Reference: Bias (statistics), Wikipedia)

Bias is due to the simplifying assumption made by a model to make the target function easier to learn.
Since a model with high bias does not fit well even for a given training set, it usually results in underfitting.

Variance
Variance is the variability of model prediction for a given dataset.

In general, the more complex (flexible) the model is, the more variance it has.
So, a model with high variance is usually an overfitted model

Bias-Variance trade-off
Suppose $f$ be the target function we want to find with the input data $\mathbf{x}$ in the training dataset $\mathbf{D}$, and $y$ be the prediction of the model trained with $\mathbf{D}$.

$f(x)$: true function (We never know what this is).
$y = f(x) + \epsilon$: target value $y$ contains intrinsic error $\epsilon$.
$\widehat{f} (x ; \mathbf{D})$: the estimated function learned from a data $\mathbf{D}$.
$\mathbb{E}_{\mathbf{D}} [\widehat{f} (x ; \mathbf{D})]$: expectation of the estimated function.
The expected generalization error on an unseen sample $x$ can be written as $\text{Bias}^2 + \text{Variance} + \epsilon$.

$E_{\mathbf{D}, \epsilon} [ (y - \widehat{f} (x; \mathbf{D}))^2 ] = (\text{Bias } [\widehat{f} (x; \mathbf{D})])^2 + \text{Var } [\widehat{f} (x; \mathbf{D}))^2] + \sigma^2$

$\text{Bias } [\widehat{f} (x; \mathbf{D})] = E_{\mathbf{D}} [\widehat{f} (x; \mathbf{D})] - f(x)$

$\text{Var } [\widehat{f} (x; \mathbf{D})] = E_{\mathbf{D}} [ ( E_{\mathbf{D}} [ \widehat{f} (x; \mathbf{D} )] - \widehat{f} (x; \mathbf{D}) )^2 ].$

For derivation, observe that MSE between $y$ and $f$ can be decomposed as next:

$\displaylines{E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - f(\mathbf{x}))^2] \\ = E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] + E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] - f(\mathbf{x}))^2] \\ = E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})])^2 + 2 (y(\mathbf{x} ; \mathbf{D}) - E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] ) (E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] - f(\mathbf{x})) + (E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] - f(\mathbf{x}))^2] \\ = E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})])^2] + (E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] - f(x))^2}$

Since

$E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})])^2] = Var(y(\mathbf{x} ; \mathbf{D}))$

and

$E_{\mathbf{D}} [y(\mathbf{x} ; \mathbf{D})] - f(x) = Bias(y(\mathbf{x} ; \mathbf{D}))$

Thus, $E_{\mathbf{D}} [(y(\mathbf{x} ; \mathbf{D}) - f(\mathbf{x}))^2] = Var(y(\mathbf{x} ; \mathbf{D}])) + Bias(y(\mathbf{x} ; \mathbf{D})))^2$

The bias-variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and intrinsic error. The bias-variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameter. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.

Summary
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. Thus, we need to find the most proper model by trading-off the bias and variance!

The Curse of Dimensionality

Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data.

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

We can gain further insight into the problems of high-dimensional spaces. For example, consider the problem of polynomial curve fitting. If we have $D$ input variables, then a general polynomial with coefficients up to order 3 would take the form

$y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{i=1}^{D} w_i x_i + \sum_{i=1}^{D} \sum_{j=1}^{D} w_{ij}x_i x_j + \sum_{i=1}^{D} \sum_{j=1}^{D} \sum_{k=1}^{D} w_{ijk} x_i x_j x_k.$

As $D$ increases, so the number of independent coefficients grows proportionally to $D^3$. (not all of the coefficients are independent due to interchange symmetries amongst the $x$ variables). For a polynomial of order $M$, the growth in the number of coeffticients is like $D^M$. Although this is just a power law growth rather than an exponential growth, it still points to the method becoming rapidly unwieldy and of limited practical utility.

Also, not all the intuitions developed in spaces of low dimensionality will generalize to spaces of high dimensionality.

Consider a sphere of radius $r = 1$ in a $D$-dimensional space. What is the fraction of the volume of the sphere that lies between radius $r = 1 - \epsilon$ and $r = 1$?
Note that the volume of a sphere of radius $r$ in a $D$-dimensional space is $V_D (r) = K_D r^D$. So, the required fraction is given by

$\frac{V_D (1) - V_D (1-\epsilon)}{V_D (1)} = 1 - (1 - \epsilon)^D$

In spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface!

Solutions

Inductive Bias

Inductive bias (a.k.a. learning bias) is the set of assumptions that the model designer uses to make a prediction model that can predict outputs of given inputs that it has not encountered. That is, the additional assumption to generalize a prediction model beyond the seen training data.

For example, when we pick polynomial functions as the model class, we assume smooth interpolation between data points will be effective. And, in KNN, we assume that the classification of an instance x will be most similar to the classification of other instances that are nearby in Euclidean distance.

The following is a list of common inductive biases in machine learning algorithms

Minimum cross-validation error
when trying to choose among hypotheses, select the hypothesis with the lowest cross-validation error. Although cross-validation may seem to be free of bias, the “no free lunch” theorems show that cross-validation must be biased.
Maximum margin
when drawing a boundary between two classes, attempt to maximize the width of the boundary. This is the bias used in SVM. The assumption is that distinct classes tend to be separated by wide boundaries.
Minimum description length
when forming a hypothesis, attempt to minimize the length of the description of the hypothesis.
Minimum features
unless there is good evidence that a feature is useful, it should be deleted. This is the assumption behind feature selection algorithms.
Nearest neighbors
assume that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for which the class is unknown, guess that it belongs to the same class as the majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm. The assumption is that cases that are near each other tend to belong to the same class.

In DL, we usually consider relational inductive bias, which is focusing on the relation between input and output.

Relational inductive biases, deep learning, and graph networks:
https://arxiv.org/pdf/1806.01261.pdf

For instance, because we knows in advance that there is a locality between adjacent pixels in the vision information, CNN is designed to extract information between adjacent pixels with locality inductive bias and it extracts spatial information well from the local area.
Similarly, RNNs are designed to handle sequential information well with sequentiality inductive bias.

On the other hand, fully connected (MLP) is an all(input)-to-all (output) relationship, so all weights are independent and not shared, so inductive bias is very weak than others.

The I.I.D. Assumption

One of the most foundational assumption of modern ML is the i.i.d. (identically independently distributed) assumption. This means that the data points in training data and test data are independently sampled from the identical distribution. In other words, at test time the model will see similar samples as seen during training.
However, this assumption does not hold in many cases.

No Free Lunch Theorem

There is no universally best model. We should use assumptions (i.e. inductive bias) that fit well to the given specific task or dataset.

Occam’s Razor

A classical example of an inductive bias is Occam’s razor, assuming that the simplest consistent hypothesis about the target function is actually the best. Here consistent means that the hypothesis of the learner yields correct outputs for all of the examples that have been given to the algorithm.

Share on

Twitter Facebook LinkedIn

Youngdo Lee