[Statistics] Probability and Distributions - Part II
(This post is a summary of Chapter 1 of [1])
1.6 Discrete Random Variable
The 1st example of a random variable is a discrete random variable, which is defined next.
$\mathbf{Def\ 1.6.1}$ (Discrete Random Variable). We say a random variable is a discrete random variable if its space is either finite or countable.
$\mathbf{Def\ 1.6.2}$ (Probability Mass Function (pmf)). Let $X$ be a discrete random variable with space $\mathcal{D}$. The probability mass function (pmf) of $X$ is given by
Note that pmfs satisfy the following 2 properties:
(i) $0 \leq p_X (x) \leq 1$
(ii) $\sum_{x \in \mathbf{D}} p_X (x) = 1$
In a more advanced class it can be shown that if a function satisfies properties (i) and (ii) for a discrete set $\mathcal{D}$, then this function uniquely determines the distribution of a random variable.
$\mathbf{Definition}$ (Support). We define the support of a discrete random variable $X$ to be the points in the space of $X$ which have positive probability. We often use $\mathcal{S}$ to denote the support of $X$.
Note that $\mathcal{S} \subseteq \mathcal{D}$.
$\mathbf{Remark}$. As $\mathbf{Thm\ 1.5.3}$ shows, discontinuities of $F_X (x)$ define a mass; that is, if $x$ is a point of discontinuity of $F_X$, then $P(X = x) > 0$.
1.6.1 Transformations
A problem often encountered in statistics is the following. We have a random variable $X$ and we know its distribution. We are interested, though, in a random variable $Y$ which is some transformation of $X$ , say, $Y = g(X)$. In particular, we want to determine the distribution of $Y$. In this sections, we will consider two cases of transformation of discrete random variable.
$g$ is one-to-one
Clearly, the pmf of $Y$ is obtained as
\[\begin{align*} p_Y (y) = P[Y=y] = P[g(X)=y] = P[X=g^{-1}(y)]=p_X (g^{-1}(y)). \end{align*}\]$\mathbf{Example\ 1.6.3,\ 1.6.4.}$
$g$ is not one-to-one
In this case, instead of developing an overall rule, for most applications involving discrete random variables the pmf of $Y$ can be obtained in a straightforward manner.
$\mathbf{Example}$
1.7 Continuous Random Variables
Another class of random variables important in statistical applications is the class of continuous random variables, which we define next.
$\mathbf{Def\ 1.7.1}$ (Continuous Random Variable). We say a random variable is a continuous random variable if its cdf $F_X (x)$ is a continuous function for all $x \in \mathbb{R}$.
$\mathbf{Note.}$ Recall from $\mathbf{Thm\ 1.5.3}$ that $P(X = x) = F_X (x) - F_X (x-)$, for any random variable $X$. Hence, for a continuous random variable $X$, there are no points of discrete mass; i.e., if $X$ is continuous, then $P(X=x) = 0$ for all $x \in \mathbb{R}$.
$\mathbf{Definition.}$ (Probability Density Function, pdf). If for continuous random variable $X$ we have that the cdf $F_X$ satisfies $F_X (x) = \int_{-\infty}^{x} f_X (t) \cdot dt$ for some function $f_X$, then $f_X$ is the probability density function (pdf) of $X$.
In this case, the support of $X$ is
\[\begin{align*} \mathcal{S} = \{ x \in \mathbb{R} | f_X (x) > 0 \} \end{align*}\]$\mathbf{Note.}$ If $f_X (x)$ is also continuous, then the Fundamental Theorem of Calculus implies that $\frac{d}{dx} F_X(x) = f_X(x)$.
1.7.1 Quantiles
Quantiles (percentiles) are easily interpretable characteristics of a distribution.
$\mathbf{Def\ 1.7.2}$ (Quantile). Let $0 < p < 1$. The quantile of order $p$ of the distribution of a random variable $X$ is a value $\xi_p$ such that $P(X < \xi_p) \leq p$ and $P(X \leq \xi_p) \geq p$. It is also known as the $(100p)-\text{th}$ percentile of $X$.
A median of a random variable $X$ is a quantile of $p = 1/2$, $\xi_{1/2}$. A first quartile, $q_1$ is a quantile of $p = 1/4$, $\xi_{1/4}$, and a third quartile, $q_3$ is a quantile of $p = 3/4$, $\xi_{3/4}$. A difference $\xi_{3/4} - \xi_{1/4}$ is an interquartile range, iq of $X$.
The median is often used as a measure of center of the distribution of $X$ , while the interquartile range is used as a measure of spread or dispersion of the distribution of $X$.
1.7.2 Transformations
Let $X$ be a continuous random variable with a known pdf $f_X$. As in the discrete case, we are often interested in the distribution of a random variable $Y$ which is some transformation of $X$, say, $Y = g(X)$.
Often we can obtain the pdf of $Y$ by first obtaining its cdf.
$\mathbf{Example\ 1.7.4,\ 1.7.5.}$
$\mathbf{Example\ 1.7.4.}$ Let $X$ be the distance from the origin to the random point selected in the unit circle. Suppose instead that we are interested in the square of the distance; let $Y = X^2$.
Note that the support of $Y$ is the same as that of $X$, $\mathbf{S}_Y = (0,1)$. And, the cdf of $X$ is
Let $y$ be in the support of $Y$; i.e., $0 < y < 1$. Then, the cdf of $Y$ is
by the fact that the support of $X$ contains only positive numbers.
So, the pdf of $Y$ is
$\mathbf{Example\ 1.7.5.}$ Let $f_X = \frac{1}{2}, -1 < x < 1,$ zero elsewhere, be the pdf of a random variable $X$. Define the random variable $Y = X^2$. We wish to find the pdf of $Y$. If $y \geq 0$, the probability $P(Y \leq y)$ is equivalent to
Accordingly, the cdf of $Y$, $F_Y(y) = P(Y \leq y)$ is
Hence, the pdf of $Y$ is given by
These examples illustrate the cumulative distribution function technique.
The transformation in $\mathbf{Example\ 1.7.4.}$ is one-to-one, and in such cases we can obtain a simple formula for the pdf of $Y$ in terms of the pdf of $X$, which we record in the next theorem.
$\mathbf{Thm\ 1.7.1.}$ Let $X$ be a continuous random variable with pdf $f_X (x)$ and support $\mathbf{S_X}$. Let $Y = g(X)$, where $g(x)$ is a one-to-one differentiable function, on the support of $X$, $\mathbf{S_X}$. Denote the inverse of $g$ by $x = g^{-1}(y)$ and let $dx / dy = d[g^{-1}(y)]/dy$. Then the pdf of $Y$ is given by
where the support of $Y$ is the set $\mathbf{S_Y} = {y = g(x): x \in \mathbf{S_X}}$.
$\mathbf{Proof.}$
Since $g(x)$ is one-to-one and continuous, it is either strictly monotonically increasing or decreasing. Assume that it is strictly monotonically increasing or decreasing. Assume that it is strictly monotonically increasing, for now. The cdf of $Y$ is given by
\[\begin{align*} F_Y(y) = P[Y \leq y] = P[g(X) \leq y] = P[X \leq g^{-1} (y)] = F_X (g^{-1} (y)). \end{align*}\]Hence, the pdf of $Y$ is
\[\begin{align*} f_Y(y) = f_X (g^{-1} (y)) \frac {dx}{dy}. \end{align*}\]In this case, since $g$ is increasing, $\frac{dx}{dy} > 0$, we can write $\frac{dx}{dy} = \left\vert \frac{dx}{dy} \right\vert$.
Otherwise, $g(x)$ is strictly monotonically decreasing. Then,
Hence, the pdf of $Y$ is $f_Y(y) = f_X (g^{-1} (y)) (-\frac {dx}{dy}).$ Since $g$ decreasing, $dx/dy < 0._\blacksquare$
$\mathbf{Example\ 1.7.6.}$
1.7.3 Mixtures of Discrete and Continuous Type Distributions
Note that, distributions that are mixtures of the continuous and discrete type do, in fact, occur frequently in practice.
$\mathbf{Example\ 1.7.7.}$
1.8 Expectation of a Random Variable
In this section we introduce the expectation operator, which we use throughout the remainder of the text. For the definition, recall from calculus that absolute convergence of sums or integrals implies their convergence.
$\mathbf{Def\ 1.8.1}$ (Expectation). Let $X$ be a random variable. If $X$ is a continuous random variable with pdf $f(x)$ and
then the expectation of $X$ is
\[\begin{align*} \mathbb{E}(X) = \int_{-\infty}^{\infty} x f(x) \cdot dx. \end{align*}\]If $X$ is a discrete random variable with pmf $p(x)$ and
\[\begin{align*} \sum_{x} |x| p(x) < \infty \end{align*}\]then the expectation of $X$ is
\[\begin{align*} \mathbb{E}(X) = \sum_{x} x p(x)._\blacksquare \end{align*}\]Note that sometimes the expectation $\mathbb{E}(X)$ is called the mathematical expectation of $X$, the expected value of $X$, or the mean of $X$. When the mean designation is used, we often denote the $\mathbb{E}(X)$ by $\mu = \mathbb{E}(X)$.
$\mathbf{Remark\ 1.8.1.}$ Origin of the terminology "expectation"
Expectation of a transformed variable
$\mathbf{Thm\ 1.8.1.}$ Let $X$ be a random variable and let $Y = g(X)$ for some function $g$.
(a) Suppose $X$ is continuous with pdf $f_X (x)$. If $\int_{-\infty}^{\infty} |g(x)|f_X (x) dx < \infty$, then
(b) Suppose $X$ is discrete with pmf $p_X (x)$. Suppose the support of $X$ is denoted by $\mathbf{S_X}$. If $\sum_{x \in \mathbf{S_X}} | g(x) | p_X (x) < \infty$, then
\[\begin{align*} \mathbb{E}(Y) = \sum_{x \in \mathbf{S_X}} g(x)p_X (x). \end{align*}\]$\mathbf{Proof.}$
Let’s prove the discrete case. The proof for the continuous case requires some advanced results in analysis. (See)
-
Existence
Because $\sum_{x \in \mathbf{S_X}} | g(x) | p_X (x)$ converges, it follows by a theorem in calculus that any rearrangement of the terms of the series converges to the same limit. (See)
where $\mathbf{S_Y}$ denotes the support of $Y$. So $\mathbb{E}(Y)$ exists; i.e., $\sum_{x \in \mathbf{S_X}} g(x)p_X (x)$ converges.
-
Equality
Since $\sum_{x \in \mathbf{S_X}} g(x)p_X (x)$ converges and also converges absolutely, the same theorem from calculus can be used, which shows that the above equalities hold without the absolute values. $_\blacksquare$
$\mathbf{Example\ 1.8.4.,\ 1.8.5.}$
Expectation of linear combination of random variables
We can show that the expectation operator $E$ is a linear operator from the linearity of integration & sum.
$\mathbf{Thm\ 1.8.1.}$ Let $g_1 (X)$ and $g_2 (X)$ be functions of a random variable $X$.
Suppose the expectations of $g_1 (X)$ and $g_2 (X)$ exist. Then, for any constant $k_1$ and $k_2$, the expectation of $k_1 g_1 (X) + k_2 g_2 (X)$ exists and it is given by
$\mathbf{Proof.}$
For the continuous case, existence follows from the hypothesis, the triangle inequality, and the linearity of the integral; i.e.,
\[\begin{align*} $\int_{-\infty}^{\infty} |k_1 g_1 (x) + k_2 g_2(x)| f_X (x) dx &\leq |k_1| \int_{-\infty}^{\infty} |g_1 (x)| f_X (x) \cdot dx + |k_2| \int_{-\infty}^{\infty} |g_2 (x)| f_X (x) \cdot dx \\ &< \infty. \end{align*}\]Also, the expectation of it follows similarly using the linearity of the integral. The proof for the discrete case follows likewise using the linearity of sums. $_\blacksquare$
$\mathbf{Example\ 1.8.6.,\ 1.8.7.,\ 1.8.8.}$
1.9 Some Special Expectations
In this section we use the expectation operator to define the mean, variance, and moment generating function of a random variable. (These measures are called the moment of the probability distribution. See)
$\mathbf{Def\ 1.9.1}$ (Mean). Let $X$ be a random variable whose expectation exists. The mean value $\mu$ of $X$ is defined as $\mu = \mathbb{E}[X]$.
$\mathbf{Def\ 1.9.2}$ (Variance and Standard Deviation). Let $X$ be a random variable with finite mean $\mu$ and such that $\mathbb{E}[(X-\mu)^2]$ is finite. Then variance of $X$ is $\mathbb{E}[(X-\mu)^2]$. It is commonly denoted $\sigma^2$ on $\text{Var}(X)$. The standard deviation of $X$ is $\sigma = \sqrt{\sigma^2} = \sqrt{\mathbb{E}[(X-\mu)^2]}$.
$\mathbf{Note.}$ We have by the linearity of an operator$\mathbb{E}$, given in $\mathbf{Thm\ 1.8.2}$, that
\[\begin{align*} \text{Var}(X) &= \mathbb{E}[(X-\mu)^2] \\ &= \mathbb{E}[X^2 - 2 \mu X + \mu^2] \\ &= \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mathbb{E}[\mu^2] \\ &= \mathbb{E}[X^2] - \mu^2. \end{align*}\]Also note that the variance operator is not linear. But we have the following result:
$\mathbf{Thm\ 1.9.1.}$ Let $X$ be a random variable with finite mean $\mu$ and finite variance $\sigma^2$. Then for all constant $a$ and $b$ we have
\[\begin{align*} \text{Var}(aX+b) = a^2 \text{Var}(X). \end{align*}\]$\mathbf{Proof.}$
By the linearity of the expectation operator, $\text{Var}(aX+b) = \mathbb{E}[(aX + b - a \mu - b)^2] = \mathbb{E}[a^2 (X- \mu)^2] = a^2 \mathbb{E}[(X-\mu^2)] = a^2 \text{Var}(X)$. $_\blacksquare$
$\mathbf{Remark.}$ Based on this theorem, $\sigma_{aX+b} = | a | \sigma_X$.
$\mathbf{Example\ 1.9.1,\ 1.9.2.}$
$\mathbf{Def\ 1.9.3}$ (Moment Generating Function). Let $X$ be a random variable s.t. for some $h > 0$, and the expectation of $e^{tX}$ exists for $-h < t < h$. The moment generating function (mgf) of $X$ is defined to be the function $M(t) = \mathbb{E}[e^{tX}]$ for $-h < t < h$.
$\mathbf{Note.}$ When a mgf exists, we must have for $t=0$ that $M(0) = \mathbb{E}[1] = 1$.
$\mathbf{Example\ 1.9.4.}$
The next theorem shows that mgfs uniquely identify distributions; i.e., a distribution that has a moment generating function $M$ is completely determined by $M$.
$\mathbf{Thm\ 1.9.2.}$ Let $X$ and $Y$ be random variables with mgfs $M_X$ and $M_Y$, respectively, existing in open intervals about $0$. Then,
for some $h > 0$.
$\mathbf{Proof.}$
The forward direction is obvious. The proof of converse, though, is beyond the scope of this text; see Chung (1974).
Since a distribution is completely determined by $M(t)$, it would not be surprising if we could obtain some properties of the distribution directly from $M(t)$.
Upon setting $t = 0$, we have
\[\begin{align*} M'(0) = \mathbb{E}(X) = \mu. \end{align*}\]Similarily, the second derivative is
so that
\[\begin{align*} M''(0) = \mathbb{E}(X^2) \Rightarrow Var(X) = \mathbb{E}(X^2) -[\mathbb{E}(X)]^2 = M''(0) - [M'(0)]^2. \end{align*}\]In general, if $m$ is a positive integer and if $M^{(m)}(t)$ means the $m$-th derivative of $M(t)$, we have $M^{(m)}(0) = E(X^m).$
$\mathbf{Definition.}$ For a random variable $X$ and $m \in \mathbb{N}$, the m-th moment of the distribution of $X$ is $\mathbb{E}[X^m]$.
1.10 Important Inequalities
In this section, we discuss some popular inequalities involving expectations. (We make use of theses inequalities in the remainder of the text.)
$\mathbf{Thm\ 1.10.1.}$ Let $X$ be a random variable and let $m$ be a positive integer. Suppose $\mathbb{E}[X^m]$ exists. If $k$ is a positive integer and $k \leq m$, then $\mathbb{E}[X^k]$ exists.
$\mathbf{Proof.}$
We prove it for the continuous case; but the proof is similar for the discrete case if we replace integrals by sums. Let $f(x)$ be the pdf of $X$.
$\mathbf{Thm\ 1.10.2}$ (Markov’s Inequality). Let $u(X)$ be a nonnegative function of the random variable $X$. If $\mathbb{E}[u(X)]$ exists, then for every positive constant $c$,
\[\begin{align*} P[u(X) \geq c] \leq \frac{\mathbb{E}[u(X)]}{c}. \end{align*}\]$\mathbf{Proof.}$
Let $f(x)$ denote the pdf of $X$ and
\[\begin{align*} A = \{ x : u(x) \geq c \} \end{align*}\]Then,
\[\begin{align*} \mathbb{E}[u(X)] = \int_{-\infty}^{\infty} u(x) f(x) \cdot dx = \int_A u(x) f(x) \cdot dx + \int_{A^c} u(x) f(x) \cdot dx \end{align*}\]Since each of the integrals in the extreme right-hand member of the preceding equation is nonnegative, the left-hand member is greater than or equal to either of them. In particular,
The preceding thm is a generalization of an inequality that is often called Chebyshev’s Inequality.
$\mathbf{Thm\ 1.10.3}$ (Chebyshev’s Inequality). Let $X$ be a random variable with finite variance $\sigma^2$ (by $Thm\ 1.10.1$, it implies that the mean $\mu = E(X)$ exists). Then for every positive constant $k$,
\[\begin{align*} P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}, \end{align*}\]or equivalently,
\[\begin{align*} P(|X - \mu| < k\sigma) \leq 1 - \frac{1}{k^2} \end{align*}\]$\mathbf{Proof.}$
In $\mathbf{Thm\ 1.10.2}$ take $u(X) = (X - \mu)^2$ and $c = k^2 \sigma^2$. Then we have
\[\begin{align*} P[(X - \mu)^2 \geq k^2 \sigma^2] \leq \frac{\mathbb{E}[(X-\mu)^2]}{k^2 \sigma^2}. \end{align*}\]Since the numerator of the right-hand member of the preceding inequality is $\sigma^2$, the inequality may be written
$\mathbf{Example\ 1.10.2.}$
Next, before stating & proving Jensen’s Inequality, we need some preliminary definition and result.
$\mathbf{Def\ 1.10.1}$ (convex function). A funtion $\phi$ defined on an interval $(a, b), -\infty \leq a < b \leq \infty$, is said to be a convex function if for $\forall x, y \in (a, b)$ and for $\forall \gamma \in (0, 1)$,
\[\begin{align*} \phi [\gamma x + (1 - \gamma)y] \leq \gamma \phi (x) + (1 - \gamma) \phi (y). \end{align*}\]We say $\phi$ is strictly convex if the above inequality is strict.
$\mathbf{Thm\ 1.10.4.}$ If $\phi$ is differentiable on $(a, b)$, then
(a) $\phi$ is convex iff $\phi’(x) \leq \phi’(y)$, for all $a < x < y < b$,
(b) $\phi$ is strictly convex iff $\phi’(x) < \phi’(y)$, for all $a < x < y < b$,
If $\phi$ is twice differentiable on $(a, b)$, then
(a) $\phi$ is convex iff $\phi^{\prime \prime}(x) \geq 0$, for all $a < x < b$,
(b) $\phi$ is strictly convex iff $\phi^{\prime \prime}(x) > 0$, for all $a < x < b$.
Now, let’s see the main inequality.
$\mathbf{Thm\ 1.10.5}$ (Jensen’s Inequality). If $\phi$ is convex on an open interval $I$ and $X$ is a random variable whose support is contained in $I$ and has finite expectation, then
\[\begin{align*} \phi [\mathbb{E}(X)] \leq \mathbb{E}[\phi(X)]. \end{align*}\]If $\phi$ is strictly convex, then the inequality is strict unless $X$ is a constant random variable.
$\mathbf{Proof.}$
For our proof we assume that $\phi$ has a 2nd derivative, but in general only convexity is required. Expand $\phi (x)$ into a Taylor series about $\mu = \mathbb{E}[X]$ of order 2:
\[\begin{align*} \phi (x) = \phi (\mu) + \phi^\prime (\mu) (x-\mu) + \frac{\phi^{\prime \prime}(\epsilon)(x-\mu)^2}{2} \end{align*}\]where $\epsilon$ is between $x$ and $\mu$. As the last term on the LHS is nonnegative, we have
\[\begin{align*} \phi (x) \geq \phi (\mu) + \phi^\prime (\mu) (x-\mu) \end{align*}\]Taking expectations of both sides leads to the result. The inequality is strict if $\phi(x) > 0$, for all $x \in (a, b)$, provided $X$ is not a constant. $_\blacksquare$
$\mathbf{Example\ 1.10.3,\ 1.10.4.}$
Reference
[1] Hogg, R., McKean, J. & Craig, A., Introduction to Mathematical Statistics, Pearson 2019
Leave a comment