5 minute read

Bayes Classifier

Bayes classfier is a simple generative approach to classification. As the name suggests, the governing theory of the model is the Bayes’ Theorem, so it is one of the representative examples of supervised learning and is also a statistical method of classification.

Consider a general classification problem. We have to find a classifier, or target function $f$ which takes a feature vector $X$ as an input and outputs a class $Y$. Let $\mathcal{C}$ be a set of all possible classes. Then, for given input feature vector $X$, the probability that $X$ corresponds to some class $Y = y$ is as follows:

\[\begin{aligned} p(Y = y | X) = \frac{p(Y = y) \cdot p(X | Y = y)}{p(X)} \end{aligned}\]

by Bayes’ Theorem. Intuitively, we can assign $X$ a class with the highest probability. In other words,

\[\begin{aligned} f(X) = \underset{y \in \mathcal{C}}{\text{arg max }} \frac{p(Y = y) \cdot p(X | Y = y)}{p(X)} \end{aligned}\]

Note that $\frac{p(Y = y) \cdot p(X | Y = y)}{p(X)} \propto p(Y = y) \cdot p(X | Y = y)$, $p(X)$ is independent of the optimization problem with respect to $Y$. So, we can define a Bayes Classifier $f^\star$

\[\begin{aligned} f^\star (X) = \underset{y \in \mathcal{C}}{\text{arg max }} p(Y = y) \cdot p(X | Y = y)._\blacksquare \end{aligned}\]

Therefore, we can construct a Bayes classifier if we know the distribution of the class $Y$ and the conditional distribution of $X | Y = y$ for all possible classes $y \in \mathcal{C}$.

Optimality of Bayes Classifier

Before, I defined the Bayes classifier simply by intuition. But in practice, this definition is optimal for the risk of misclassfication $p(f^\star(X) \neq y)$ where $Y = y$ is the true class label of $X$. In other words, for any probabilistic classifier $f$, $p(f^\star(X) \neq y) \leq p(f(X) \neq y).$ Proof proceeds as follows.

\[\begin{aligned} p(f(X) \neq y) &= \mathbb{E}_{X, Y} [\mathbb{I}_{\{ f(X) \neq Y \}}] \\ &= \int_X \sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f(X) \neq y \}} \cdot p(X, Y = y) \cdot dX \\ &= \int_X \sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f(X) \neq y \}} \cdot p(Y = y | X) p(X) \cdot dX \\ &= \mathbb{E}_X [\sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f(X) \neq y \}} \cdot p(Y = y | X)] \end{aligned}\]

Then the problem changes as follows.

\[\begin{aligned} \mathbb{E}_X [\sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f^\star(X) \neq y \}} \cdot p(Y = y | X)] \leq \mathbb{E}_X [\sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f(X) \neq y \}} \cdot p(Y = y | X)] \end{aligned}\]

Finally,

\[\begin{aligned} &\mathbb{E}_X [\sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f(X) \neq y \}} \cdot p(Y = y | X)] - \mathbb{E}_X [\sum_{y \in \mathcal{C}} \mathbb{I}_{\{ f^\star(X) \neq y \}} \cdot p(Y = y | X)] \\ &= \sum_{y \neq f(X)} p(Y = y | X) - \sum_{y \neq f^\star(X)} p(Y = y | X) \\ &= (1 - p(Y = f(X) | X)) - (1 - p(Y = f^\star(X) | X)) \\ &= p(Y = f^\star(X) | X) - p(Y = f(X) | X) \\ &\geq 0 \end{aligned}\]

since $f^\star(X) = \underset{y \in \mathcal{C}}{\text{arg max }} p(Y = y | X)._\blacksquare$

Naive Bayes Classifier

Now, we will discuss about a special case of Bayes classifier, which assumes the features are conditionally independent. “Naive” means that we do not expect the features to be indepedent given the class label. But, even if the assumption is not true, this classifier quite works well! One reason for this is the model is quite simple (it only has $O(CD)$ parameters for $C$ classes and $D$ features), so it is relatively immune to overfitting. [1]

Let $X = (X_1, \cdots, X_D)^\top \in \mathbb{R}^m$. By independence, we can write the Bayes classifier as

\[\begin{aligned} f^\star (X) = \underset{y \in \mathcal{C}}{\text{arg max }} p(Y = y) \cdot \prod_{i = 1}^D p(X_i | Y = y)._\blacksquare \end{aligned}\]

and this is known as Naive Bayes Classifier.

$\mathbf{Example.}$ Bernoulli distribution

(Reference: Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.)

In the case of binary features, $X_d = 0$ or $X_d = 1$, we can use the Bernoulli distribution for $p(X | Y = y)$.

\[\begin{aligned} p(X | Y = y) = \prod_{d = 1}^D \text{Ber}(X_d | \theta_{yd}) \end{aligned}\]

where $\theta_{yd}$ is the probability that $X_d = 1$ in class $Y = y$.





Model fitting of (Naive) Bayes Classifier

Recall that we can construct a Bayes classifier if we know the distribution of the class $Y$ and the conditional distribution of $X | Y = y$ for all possible classes $y \in \mathcal{C}$. But, in general, we often don’t know about this and we should estimate them. To select the distribution for the classifier from the training data, we may use some estimation such as MLE, MAP or optimization of loss criterion. From now on, we discuss how to fit a (naive) Bayes classifier using Maximum Likelihood Estimation (MLE).

Estimation of $p(Y)$

Let $\mathcal{C}$ be a set of possible classes for $Y$:

\[\begin{aligned} \mathcal{C} = \{1, 2, \cdots, C \} \end{aligned}\]

Then, we can view the distribution of $Y$ as Categorical Distribution.

And the MLE of categorical distribution is just a empirical count, $\widehat{p}(Y = k) = \frac{N_k}{N}$. The proof is simple. You can refer to the following link

Estimation of $p(X | Y)$

The MLEs of $p(X_i | Y)$ depend on the choice of the class conditional density for feature $X_i$. We discuss some common choices below.

Discrete features

If a independent feauture $X_i$ can be represented as discrete random variables, we can use a categorical distribution or bernoulli distribution for binary features.

The MLE of parameters of categorical distribution or bernoulli distribution is

\[\begin{aligned} \widehat{p}(X_i = k | Y = y) &= \frac{N_{X_i = k, Y = y}}{N_{Y = y}} \\ &= \frac{\sum_{n = 1}^N \mathbb{I}(X_{ni} = k, Y_n = y)}{\sum_{n = 1}^N \mathbb{I}(Y_n = y)} \end{aligned}\]

clearly.

Continuous features

If a independent feauture $X_i$ can be represented as continuous, real-valued random variables, we can use a Gaussian distribution.

\[\begin{aligned} X_i | Y = y \sim N(\mu_y, \sigma^2_y) \end{aligned}\]

The MLE of parameters of Gaussian distribution is clearly

\[\begin{aligned} &\widehat{\mu}_y = \frac{1}{\sum_{n=1}^N \mathbb{I}(Y_n = y)} \sum_{n=1}^N \mathbb{I}(Y_n = y) \cdot X_{n} \\ &\widehat{\sigma}_y^2 = \frac{1}{\sum_{n=1}^N \mathbb{I}(Y_n = y)} \sum_{n=1}^N \mathbb{I}(Y_n = y) \cdot [X_{n} - \widehat{\mu}_y]^2 \end{aligned}\]




Reference

[1] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.
[2] Bayes classifier, Wikipedia

Leave a comment