[ML] Information Theory and Entropy
Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. And, it is also applied to the context of machine learning to analyze the probability distributions like measuring similarity of two distributions. In this post, we will see some key notion from information theory required to understand further topics of ML.
Information
In information theory, self-information(surprisal, or Shannon information) in some data means ‘the degree of uncertainty, the degree of surprise’. The basic intuition for quantifying this abstract value is, an unlikely event can give us more information than a likely event
For example, a message saying “the sun rose this morning”
is uninformative as we witness sunrise every morning, but a message saying “there was a solar eclipse this morning”
is much informative than that.
Via this intuition, self-information of an event $X = x$ can be defined as follow:
$$ \begin{aligned} I(x) = -\text{ln } p(x) \end{aligned} $$
Note that we can use logarithm with other bases such as $2$. If the logarithm has base $e$, we denote $I(x)$ by units of nats. If the logarithm has base $2$, we denote $I(x)$ by units of bits or shannons.
It is reasonable and obvious that independent events should have additive information, i.e.
For instance, learning that a coin has come up heads twice should reveal twice as much information as learning that it has come up heads only once. And our definition obviously contains this intuition through the product formula of logarithm.
Entropy
Observe that the self-information deals only with a single outcome. To quantify the amount of uncertainty in the whole probability process or distribution, we define the (Shannon) Entropy $\mathbb{H}$:
$$ \begin{aligned} \mathbb{H}(X) = \mathbb{E}_{X \sim p} [I(x)] = - \mathbb{E}_{X \sim p} [\text{ln } p(x)] \end{aligned} $$
In other words, entropy can be considered as the expected amount of information in an event drawn from that distribution, i.e. how random is the random variable?
In fact, it has maximum values when $X$ has the uniform distribution, which all event occur with the same probability. [3]
- $\mathbb{H}(X) \geq 0$ since $p(x) \in [0, 1]$
- $\mathbb{H}(X) = 0$ if there exists some $x$ with $p(x) = 1$
Conditional & Joint Entropy
Sometimes we also need the amount of information when the two events or distributions are related. Like the probability distribution, we can define conditional and joint entropy:
For $(X, Y) \sim p(x, y)$, the conditional entropy $\mathbb{H}(Y \mid X)$ is defined by
$$ \begin{aligned} \mathbb{H}(Y \mid X) = \mathbb{E}_{(X, Y) \sim p(x, y)} [I(Y \mid X)] = - \mathbb{E}_{(X, Y) \sim p(x, y)}[\text{ln }p(Y \mid X)]. \end{aligned} $$ For example, in the discrete case,
$$ \begin{aligned} \mathbb{H}(Y \mid X) = -\sum_{x, y} p(x, y) \text{ ln } p(y \mid x). \end{aligned} $$
For $(X, Y) \sim p(x, y)$, the joint entropy $\mathbb{H}(X, Y)$ is defined by
$$ \begin{aligned} \mathbb{H}(X, Y) = \mathbb{E}_{(X, Y) \sim p(x, y)} [I(X, Y)] = - \mathbb{E}_{(X, Y) \sim p(x, y)}[\text{ln } p(X, Y)]. \end{aligned} $$
We see that $$ \begin{aligned} \mathbb{H}(X, Y) = \mathbb{H}(Y \mid X) + \mathbb{H}(X) \end{aligned} $$ as
$$ \begin{aligned} \mathbb{H}(X, Y) &= - \mathbb{E}_{(X, Y) \sim p(x, y)}\left[\text{ln }p(X, Y)\right]= - \mathbb{E}_{(X, Y) \sim p(x, y)}\left[\text{ln }p(Y \mid X) p(X)\right] \\ &= - \mathbb{E}_{(X, Y) \sim p(x, y)}\left[\text{ln }p(Y \mid X) + \text{ln } p(X)\right] \\ &= - \mathbb{E}_{(X, Y) \sim p(x, y)}\left[\text{ln }p(Y \mid X)\right] - \mathbb{E}_{X \sim p(x)}\left[\text{ln } p(X)\right] \\ &= \mathbb{H}(Y \mid X) + \mathbb{H}(X). \end{aligned} $$
Thus, if $X$ and $Y$ are independent, it becomes
which implies that independent uncertainties can be additive.
Kullback-Leibler (KL) Divergence
We may want to measure the similarity of two different probability distribution $p$ and $q$. We use KL divergence in order to measure this:
The Kullback-Leibler (KL) divergence $\text{KL}(p \; \| \; q)$ between two probability mass (density) functions $p(x)$ and $q(x)$ is defined by
$$ \begin{aligned} \text{KL}(p \; \| \; q) = \mathbb{E}_{X \sim p(X)} \left[\text{log } \frac{p(X)}{q(X)} \right] \end{aligned} $$
Note that it may be not defined for some special cases. Here, we have the following conventions, for $a, b > 0$:
There are some useful and notable properties of KL divergence: non-negativity and asymmetry:
KL divergence is always nonnegative i.e.,
$$ \begin{aligned} \text{KL}( p \; \| \; q) \geq 0. \end{aligned} $$ Moreover, $\text{KL}( p \; \| \; q) = 0$ if and only if $p(x) = q(x)$ for all $x.$
$\mathbf{Proof.}$
Define a set
\[\begin{aligned} \mathcal{S} = \{ x \mid p(x) > 0 \} \end{aligned}\]Then,
\[\begin{aligned} \text{KL}(p \; \| \; q) &= \sum_{x \in \mathcal{S}} p(x) \text{ log }\frac{p(x)}{q(x)} \\ &= - \sum_{x \in \mathcal{S}} p(x) \text{ log }\frac{q(x)}{p(x)} \\ &\geq - \text{ log } \left[\sum_{x \in \mathcal{S}} p(x) \frac{q(x)}{p(x)} \right] \quad \text{ by Jensen's Inequality } \\ &= - \text{ log } \left[ \sum_{x \in \mathcal{S}} q(x) \right] \geq 0. \end{aligned}\]From the above derivation, we see that the equality holds if and only if $\frac{q(x)}{p(x)} = c$ for all possible values of $x$. In this case, $1 = \sum_x q(x) = c \sum_x p(x) = c$ and hence $q(x) = p(x)$ for all $x$.
\[\tag*{$\blacksquare$}\]The KL divergence $\text{KL}(p \; \| \; q)$ is not a true distance because it is not symmetric and does not satisfy the triangular inequality.
For instance, consider 2 Bernoulli distributions with respective success probabilities $r$ and $s$ ($r \neq s$). Then
$$ \begin{aligned} \text{KL}(p \; \| \; q) & = (1-r) \text{ log }(\frac{1-r}{1-s}) + r \text{ log }(\frac{r}{s}) \\ \text{KL}(q \; \| \; p) & = (1-s) \text{ log }(\frac{1-s}{1-r}) + s \text{ log }(\frac{s}{r}) \end{aligned} $$
How can we determine the order of $p$ and $q$ in KL divergence?
Suppose we have $p(x)$ and wish to approximate it with another distribution $q(x)$. We have to choose which KL divergence should we minimize: \(\text{KL}(p \; \| \; q)\) or \(\text{KL}(q \; \| \; p)\).
The bottom line is, the choice of this direction is problem dependent. Sometimes we want an approximation that usually places high probability anywhere that the true distribution places high probability, while in some other cases we want an approximation that rarely places high probability anywhere that the true distribution places low probability. The direction of KL divegence reflects which of these considerations takes priority for each case.
Mutual Information
Now we can measure the similarity of two different distributions by KL divergence.
Recall that the definition of two distributions is $p(x, y) = p(x) p(y)$. That means, by calculating the similarity of $p(x, y)$ and $p(x)p(y)$, we can measure the independence of them.
For $(X, Y) \sim p(x, y)$, the mutual information $I(X; Y)$ of $X$ and $Y$ is defined by
$$ \begin{aligned} I(X; Y) = \text{KL}\left(p(x, y) \; \| \; p(x)p(y) \right) = \mathbb{E}_{(X, Y) \sim p(x, y)} \left[ \text{log }\frac{p(X,Y)}{p(X)p(Y)} \right]. \end{aligned} $$
The reason why it is called mutual information
can be inferred from the fact
So, $I(X; Y)$ is the reduction of the uncertainty of $X$ (or $Y$) obtained by telling the value of $Y$ (or $X$).
Cross Entropy
A quantity that is closely related to the KL divergence is the cross-entropy.
The cross entropy $\mathbb{H}_p (q)$ of the distribtution $q(x)$ relative to a distribution $p(x)$ is defined by $$ \begin{aligned} \mathbb{H}_p (q) = \mathbb{H}_p (p) + \text{KL} (p \; \| \; q) = - \mathbb{E}_p [\text{log }q(X)]. \end{aligned} $$
Note that $\mathbb{H}_p(q) \geq \mathbb{H}_p(p)$:
\[\begin{aligned} \text{KL} (p \; \| \; q) = \sum_{x \in E} p(x) \text{ log } \frac{p(x)}{q(x)} = - \sum_{x \in E} p(x) \text{ log }\frac{q(x)}{p(x)} = \mathbb{H}_p(q) - \mathbb{H}_p(p). \end{aligned}\]That means, the cross entropy can reach the lower bound when $p = q$. Actually, minimizing the cross entropy with regard to $q$ is equivalent to minimizing KL divergence. Because of this nature, cross entropy is often used as a type of loss function in ML.
Reference
[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press 2016.
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
[3] Maximum entropy probability distribution
[4] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.
Leave a comment