[ML] Bayes Classifier
Bayes Classifier
Bayes classfier is a simple generative approach to classification. As the name suggests, the governing theory of the model is the Bayes’ Theorem, so it is one of the representative examples of supervised learning and is also a statistical method of classification.
Consider a general classification problem. We have to find a classifier, or target function $f$ which takes a feature vector $X$ as an input and outputs a class $Y$. Let $\mathcal{C}$ be a set of all possible classes. Then, for given input feature vector $X$, the probability that $X$ corresponds to some class $Y = y$ is as follows:
by Bayes’ Theorem. Intuitively, we can assign $X$ a class with the highest probability. In other words,
Note that $\frac{p(Y = y) \cdot p(X | Y = y)}{p(X)} \propto p(Y = y) \cdot p(X | Y = y)$, $p(X)$ is independent of the optimization problem with respect to $Y$. So, we can define a Bayes Classifier $f^\star$
Therefore, we can construct a Bayes classifier if we know the distribution of the class $Y$ and the conditional distribution of $X | Y = y$ for all possible classes $y \in \mathcal{C}$.
Optimality of Bayes Classifier
Before, I defined the Bayes classifier simply by intuition. But in practice, this definition is optimal for the risk of misclassfication $p(f^\star(X) \neq y)$ where $Y = y$ is the true class label of $X$. In other words, for any probabilistic classifier $f$, $p(f^\star(X) \neq y) \leq p(f(X) \neq y).$ Proof proceeds as follows.
Then the problem changes as follows.
Finally,
since $f^\star(X) = \underset{y \in \mathcal{C}}{\text{arg max }} p(Y = y | X)._\blacksquare$
Naive Bayes Classifier
Now, we will discuss about a special case of Bayes classifier, which assumes the features are conditionally independent. “Naive” means that we do not expect the features to be indepedent given the class label. But, even if the assumption is not true, this classifier quite works well! One reason for this is the model is quite simple (it only has $O(CD)$ parameters for $C$ classes and $D$ features), so it is relatively immune to overfitting. [1]
Let $X = (X_1, \cdots, X_D)^\top \in \mathbb{R}^m$. By independence, we can write the Bayes classifier as
and this is known as Naive Bayes Classifier.
$\mathbf{Example.}$ Bernoulli distribution
(Reference: Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.)
In the case of binary features, $X_d = 0$ or $X_d = 1$, we can use the Bernoulli distribution for $p(X | Y = y)$.
where $\theta_{yd}$ is the probability that $X_d = 1$ in class $Y = y$.
Model fitting of (Naive) Bayes Classifier
Recall that we can construct a Bayes classifier if we know the distribution of the class $Y$ and the conditional distribution of $X | Y = y$ for all possible classes $y \in \mathcal{C}$. But, in general, we often don’t know about this and we should estimate them. To select the distribution for the classifier from the training data, we may use some estimation such as MLE, MAP or optimization of loss criterion. From now on, we discuss how to fit a (naive) Bayes classifier using Maximum Likelihood Estimation (MLE).
Estimation of $p(Y)$
Let $\mathcal{C}$ be a set of possible classes for $Y$:
Then, we can view the distribution of $Y$ as Categorical Distribution.
And the MLE of categorical distribution is just a empirical count, $\widehat{p}(Y = k) = \frac{N_k}{N}$. The proof is simple. You can refer to the following link
Estimation of $p(X | Y)$
The MLEs of $p(X_i | Y)$ depend on the choice of the class conditional density for feature $X_i$. We discuss some common choices below.
Discrete features
If a independent feauture $X_i$ can be represented as discrete random variables, we can use a categorical distribution or bernoulli distribution for binary features.
The MLE of parameters of categorical distribution or bernoulli distribution is
\[\begin{aligned} \widehat{p}(X_i = k | Y = y) &= \frac{N_{X_i = k, Y = y}}{N_{Y = y}} \\ &= \frac{\sum_{n = 1}^N \mathbb{I}(X_{ni} = k, Y_n = y)}{\sum_{n = 1}^N \mathbb{I}(Y_n = y)} \end{aligned}\]clearly.
Continuous features
If a independent feauture $X_i$ can be represented as continuous, real-valued random variables, we can use a Gaussian distribution.
The MLE of parameters of Gaussian distribution is clearly
Reference
[1] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.
[2] Bayes classifier, Wikipedia
Leave a comment