5 minute read

Bayes ClassifierPermalink

Bayes classfier is a simple generative approach to classification. As the name suggests, the governing theory of the model is the Bayes’ Theorem, so it is one of the representative examples of supervised learning and is also a statistical method of classification.

Consider a general classification problem. We have to find a classifier, or target function f which takes a feature vector X as an input and outputs a class Y. Let C be a set of all possible classes. Then, for given input feature vector X, the probability that X corresponds to some class Y=y is as follows:

p(Y=y|X)=p(Y=y)p(X|Y=y)p(X)

by Bayes’ Theorem. Intuitively, we can assign X a class with the highest probability. In other words,

f(X)=arg max yCp(Y=y)p(X|Y=y)p(X)

Note that p(Y=y)p(X|Y=y)p(X)p(Y=y)p(X|Y=y), p(X) is independent of the optimization problem with respect to Y. So, we can define a Bayes Classifier f

f(X)=arg max yCp(Y=y)p(X|Y=y).

Therefore, we can construct a Bayes classifier if we know the distribution of the class Y and the conditional distribution of X|Y=y for all possible classes yC.

Optimality of Bayes ClassifierPermalink

Before, I defined the Bayes classifier simply by intuition. But in practice, this definition is optimal for the risk of misclassfication p(f(X)y) where Y=y is the true class label of X. In other words, for any probabilistic classifier f, p(f(X)y)p(f(X)y). Proof proceeds as follows.

p(f(X)y)=EX,Y[I{f(X)Y}]=XyCI{f(X)y}p(X,Y=y)dX=XyCI{f(X)y}p(Y=y|X)p(X)dX=EX[yCI{f(X)y}p(Y=y|X)]

Then the problem changes as follows.

EX[yCI{f(X)y}p(Y=y|X)]EX[yCI{f(X)y}p(Y=y|X)]

Finally,

EX[yCI{f(X)y}p(Y=y|X)]EX[yCI{f(X)y}p(Y=y|X)]=yf(X)p(Y=y|X)yf(X)p(Y=y|X)=(1p(Y=f(X)|X))(1p(Y=f(X)|X))=p(Y=f(X)|X)p(Y=f(X)|X)0

since f(X)=arg max yCp(Y=y|X).

Naive Bayes ClassifierPermalink

Now, we will discuss about a special case of Bayes classifier, which assumes the features are conditionally independent. “Naive” means that we do not expect the features to be indepedent given the class label. But, even if the assumption is not true, this classifier quite works well! One reason for this is the model is quite simple (it only has O(CD) parameters for C classes and D features), so it is relatively immune to overfitting. [1]

Let X=(X1,,XD)Rm. By independence, we can write the Bayes classifier as

f(X)=arg max yCp(Y=y)Di=1p(Xi|Y=y).

and this is known as Naive Bayes Classifier.

Example. Bernoulli distribution

(Reference: Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.)

In the case of binary features, Xd=0 or Xd=1, we can use the Bernoulli distribution for p(X|Y=y).

p(X|Y=y)=Dd=1Ber(Xd|θyd)

where θyd is the probability that Xd=1 in class Y=y.





Model fitting of (Naive) Bayes ClassifierPermalink

Recall that we can construct a Bayes classifier if we know the distribution of the class Y and the conditional distribution of X|Y=y for all possible classes yC. But, in general, we often don’t know about this and we should estimate them. To select the distribution for the classifier from the training data, we may use some estimation such as MLE, MAP or optimization of loss criterion. From now on, we discuss how to fit a (naive) Bayes classifier using Maximum Likelihood Estimation (MLE).

Estimation of p(Y)Permalink

Let C be a set of possible classes for Y:

C={1,2,,C}

Then, we can view the distribution of Y as Categorical Distribution.

And the MLE of categorical distribution is just a empirical count, ˆp(Y=k)=NkN. The proof is simple. You can refer to the following link

Estimation of p(X|Y)Permalink

The MLEs of p(Xi|Y) depend on the choice of the class conditional density for feature Xi. We discuss some common choices below.

Discrete featuresPermalink

If a independent feauture Xi can be represented as discrete random variables, we can use a categorical distribution or bernoulli distribution for binary features.

The MLE of parameters of categorical distribution or bernoulli distribution is

ˆp(Xi=k|Y=y)=NXi=k,Y=yNY=y=Nn=1I(Xni=k,Yn=y)Nn=1I(Yn=y)

clearly.

Continuous featuresPermalink

If a independent feauture Xi can be represented as continuous, real-valued random variables, we can use a Gaussian distribution.

Xi|Y=yN(μy,σ2y)

The MLE of parameters of Gaussian distribution is clearly

ˆμy=1Nn=1I(Yn=y)Nn=1I(Yn=y)Xnˆσ2y=1Nn=1I(Yn=y)Nn=1I(Yn=y)[Xnˆμy]2




Reference

[1] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.
[2] Bayes classifier, Wikipedia

Leave a comment