4 minute read

When we evaluate the model, can we quantize its generalization error? How can we evaluate the performance of model? There are various criteria, but one can think of the following:

  • Consistency: a model is good if the training error is maintained at test time (No overfitting)
  • Flexibility: a model is good if it flexible enough to capture the patterns in the training set (Low bias)

And an ideal model may achieve both. But, can we?

$\mathbf{Example.}$
Consider the following experiment:

  • $L = 100$ datasets (only 20 are shown), NOT DATA POINTS.
  • Each having $N = 25$ datapoints
  • 24 Gaussian basis functions in the model
  • $M = 25$ parameters (coefficients) including bias

The below graph is trained models on different $\text{ln }\lambda$ with $L$ different datasets.
The right column is the expectation of total $L$ trained model.

image

Experimentally, more flexible the model is, Higher variance & less bias the model has. What are bias and variance, and why is this behavior happened?

Bias and Variance

Bias is the difference between the average prediction of our model and the true value to predict. (Reference: Bias (statistics), Wikipedia)

  • Bias is due to the simplifying assumption made by a model to make the target function easier to learn.
  • Since a model with high bias does not fit well even for a given training set, it usually results in underfitting.

Variance is the variability of model prediction for a given dataset.

  • In general, the more complex (flexible) the model is, the more variance it has.
  • So, a model with high variance is usually an overfitted model


image


Bias-Variance trade-off

Suppose $f$ be the target function we want to find with the input data $\mathbf{x}$ in the training dataset $\mathcal{D}$, and $y$ be the prediction of the model trained with $\mathcal{D}$.

  • $f(x)$: true function (We never know what this is).
  • $y = f(x) + \varepsilon$: target value $y$ contains intrinsic error $\varepsilon$.
  • $\widehat{f} (x ; \mathcal{D})$: the estimated function learned from a data $\mathcal{D}$.
  • $\mathbb{E}_{\mathcal{D}} [\widehat{f} (x ; \mathcal{D})]$: expectation of the estimated function.
  • The expected generalization error on an unseen sample $x$ can be written as $\text{Bias}^2 + \text{Variance} + \varepsilon$.
\[\begin{aligned} \mathbb{E}_{\mathcal{D}, \varepsilon} \left[ (y - \widehat{f} (x; \mathcal{D}))^2 \right] = \left( \text{Bias } [\widehat{f} (x; \mathcal{D})] \right)^2 + \text{Var } [\widehat{f} (x; \mathcal{D})] + \sigma^2 \end{aligned}\]

where

\[\begin{aligned} &\text{Bias } [\widehat{f} (x; \mathcal{D})] = \mathbb{E}_{\mathcal{D}} [\widehat{f} (x; \mathcal{D})] - f(x) \\ &\text{Var } [\widehat{f} (x; \mathcal{D})] = \mathbb{E}_{\mathcal{D}} \left[ \left( \mathbb{E}_{\mathcal{D}} [ \widehat{f} (x; \mathcal{D} )] - \widehat{f} (x; \mathcal{D}) \right)^2 \right] \end{aligned}\]

image


For derivation, observe that MSE between $y$ and $f$ can be decomposed as next:

\[\begin{aligned} & \mathbb{E}_{\mathcal{D, \varepsilon}} \left[ \left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - f(\mathbf{x} - \varepsilon) \right)^2 \right] \\ & = \mathbb{E}_{\mathcal{D}, \varepsilon} \left[\left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - \mathbb{E}_{\mathcal{D}} \left[\widehat{f}(\mathbf{x} ; \mathcal{D}) \right] + \mathbb{E}_{\mathcal{D}} \left[ \widehat{f}(\mathbf{x} ; \mathcal{D}) \right] - f(\mathbf{x}) - \varepsilon \right)^2 \right] \\ & = \mathbb{E}_{\mathcal{D}, \varepsilon} \left[ \left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - E_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] \right)^2 + 2 \times \left( \widehat{f}(\mathbf{x} ; \mathcal{D}) - E_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] \right) \times \left( \mathbb{E}_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] - f(\mathbf{x}) - \varepsilon \right) + \left( \mathbb{E}_{\mathcal{D}} \left[ \widehat{f}(\mathbf{x} ; \mathcal{D}) \right] - f(\mathbf{x}) - \varepsilon \right)^2 \right] \\ & = \mathbb{E}_{\mathcal{D}} \left[\left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - \mathbb{E}_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] \right)^2 \right] + \mathbb{E}_{\mathcal{D}, \varepsilon} \left[ \left( \mathbb{E}_{\mathcal{D}} \left[\widehat{f}(\mathbf{x} ; \mathcal{D}) \right] - f(x) - \varepsilon \right)^2 \right] \\ & = \mathbb{E}_{\mathcal{D}} \left[\left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - \mathbb{E}_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] \right)^2 \right] + \left( \mathbb{E}_{\mathcal{D}} \left[\widehat{f}(\mathbf{x} ; \mathcal{D}) \right] - f(x) \right)^2 + \mathbb{E}_{\varepsilon} [\varepsilon^2] \\ \end{aligned}\]

from $\mathbb{E}_{\varepsilon} [\varepsilon] = 0$.


Since

\[\begin{aligned} \mathbb{E}_{\mathcal{D}} \left[ \left(\widehat{f}(\mathbf{x} ; \mathcal{D}) - \mathbb{E}_{\mathcal{D}} [\widehat{f}(\mathbf{x} ; \mathcal{D})] \right)^2 \right] = \text{Var}(\widehat{f}(\mathbf{x} ; \mathcal{D})) \end{aligned}\]

and

\[\begin{aligned} \mathbb{E}_{\mathcal{D}} \left[\widehat{f}(\mathbf{x} ; \mathcal{D}) \right] - f(x) = \text{Bias}(\widehat{f}(\mathbf{x} ; \mathcal{D})) \end{aligned}\]

Thus, $\mathbb{E}_{\mathcal{D}} [(\widehat{f}(\mathbf{x} ; \mathcal{D}) - f(\mathbf{x}))^2] = \text{Var}(\widehat{f}(\mathbf{x} ; \mathcal{D})) + \text{Bias}(\widehat{f}(\mathbf{x} ; \mathcal{D}))^2 + \sigma^2$

This bias-variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and intrinsic error. The bias-variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameter. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.


Summary

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. Thus, we need to find the most proper model by trading-off the bias and variance!

image


Leave a comment