[Statistics] Multivariate Distributions - Part II

20 minute read

(This post is a summary of Chapter 2 of [1])

2.5 The Correlation Coefficient

Let $(X, Y)$ denote a random vector. In the last section, we discussed the concept of independence between $X$ and $Y$. What if, though, $X$ and $Y$ are dependent and, if so, how are they related?

$\mathbf{Def\ 2.5.1}$ (Covariance). Let $(X, Y)$ have a joint distribution. Denote the means of $X$ and $Y$ respectively by $\mu_1$ and $\mu_2$ and their respective variances by $\sigma_1^2$ and $\sigma_2^2$. The covariance of $(X, Y)$ is denoted by $cov(X,Y)$ and is defined by the expectation

\[\begin{aligned} \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_1) (Y - \mu_2)]. \end{aligned}\]

$\mathbf{Remark.}$ By the linearity of expectation,

$\begin{aligned} \text{Cov}(X, Y) = \mathbb{E}[XY - \mu_2 X - \mu_1 Y + \mu_1 \mu_2] = \mathbb{E}[XY] - \mu_1 \mu_2. \end{aligned}$

$\mathbf{Def\ 2.5.2}$ (Correlation coefficient). If each of $\sigma_1$ and $\sigma_2$ is positive, then the correlation coefficient between $X$ and $Y$ is defined by

\[\begin{aligned} \rho = \frac {\text{Cov}(X, Y)}{\sigma_1 \sigma_2} \end{aligned}\]

$\mathbf{Note.}$ $\mathbb{E}[XY] = \mu_1 \mu_2 + \text{Cov}(X, Y) = \mu_1 \mu_2 + \rho \sigma_1 \sigma_2$.

$\mathbf{Example\ 2.5.1.}$

We next estabilish that, in general, $| \rho | \leq 1$.

$\mathbf{Thm\ 2.5.1.}$ For all jointly distributed random variables $(X, Y)$ whose correlation coefficient $\rho$ exists, $-1 \leq \rho \leq 1$.

$\mathbf{Proof.}$ Consider the polynomial in $v$ given by

\[\begin{aligned} h(v) = \mathbb{E} [((X - \mu_1) + v(Y - \mu_2))^2 ] = \sigma_1^2 + 2v \rho \sigma_1 \sigma_2 + \sigma_2^2. \end{aligned}\]

Then $h(v) \geq 0$ for all $v$. Hence, the discriminant of $h(v)$ is less than or equal to 0. i.e., $4 \rho^2 \sigma_1^2 \sigma_2^2 \leq 4 \sigma_1^2 \sigma_2^2$ or $\rho^2 \leq 1._\blacksquare$

$\mathbf{Thm\ 2.5.2.}$ If $X$ and $Y$ are independent random variables then $\text{Cov}(X, Y) = 0$ and, hence $\rho = 0.$

$\mathbf{Proof.}$ By $\mathbf{Thm\ 2.4.4.}$, $\mathbb{E}[XY] = \mathbb{E}[X] \mathbb{E}[Y] = \mu_1 \mu_2._\blacksquare$

But the converse of this theorem is not true:

$\mathbf{Example\ 2.5.3.}$

It is because, the correlation coefficient $\rho$ indicates to the degree to which a pair of variables are linearly correlated, not dependent. For more detail, see.

$\mathbf{Note.}$ By cosidering the case when the discriminant of the polynomial $h(v)$ is 0, we can show that if $\rho = 1$ then $Y = (\sigma_2 / \sigma_1) X - (\sigma_2 / \sigma_1) \mu_1 + \mu_2$ with probability $1$, and if $\rho = -1$ then $Y = -(\sigma_2 / \sigma_1) X + (\sigma_2 / \sigma_1) \mu_1 + \mu_2$ with probability $1$. That means, $X$ and $Y$ are linearly correlated.

In generally, we summarize these thoughts in the next theorem.

$\mathbf{Notation.}$ Let $f(x, y)$ be the joint pdf of 2 random variables $X$ and $Y$ and let $f_1 (x)$ denote the marginal pdf of $X$. Recall that the conditional pdf of $Y$ given $X = x$ is $f_{2 \mid 1} (y \mid x) = f(x, y) / f_1 (x)$ at points where $f_1 (x) > 0$. And the conditional mean of $Y$ given $X = x$ is given by

\[\begin{aligned} \mathbb{E}[Y \mid x] = \int_{-\infty}^{\infty} y f_{2 \mid 1} (y \mid x) dy = \frac{\int_{-\infty}^{\infty} y f(x, y) dy}{f_1 (x)} \end{aligned}\]

$\mathbf{Thm\ 2.5.3.}$ Suppose $(X, Y)$ have a joint distribution with the variance of $X$ and $Y$ finite and positive. Denote the means and variances of $X$ and $Y$ by $\mu_1, \mu_2$ and $\sigma_1^2, \sigma_2^2$, respectively. And let $\rho$ be the correlation coefficient between $X$ and $Y$. If $\mathbb{E} [Y \mid X]$ is linear in $X$ then

\[\begin{aligned} \mathbb{E}[Y \mid X] = \mu_2 + \rho \frac{\sigma_2}{\sigma_1}(X - \mu_1) \end{aligned}\]

and

\[\begin{aligned} \mathbb{E}[\text{Var}(Y \mid X)] = \sigma_2^2 (1 - \rho^2). \end{aligned}\]

$Proof:$ Let $\mathbb{E}[Y \mid x] = a + bx$. We have

\[\begin{aligned} \int_{-\infty}^{\infty} y f(x, y) dy = (a + bx) f_1 (x). \end{aligned}\]

If both members of the above equation are “integrated” on $x$, it is seen that

\[\begin{aligned} \mathbb{E}[Y] = a + b \mathbb{E}[X] \to \mu_2 = a + b \mu_1. \end{aligned}\]

If both members of the above equation are “multiplied” by $x$, and then “integrated” on $x$, we have

\[\begin{aligned} \mathbb{E}[XY] = a\mathbb{E}[X] + b \mathbb{E}[X^2] \to \rho \sigma_1 \sigma_2 + \mu_1 \mu_2 = a \mu_1 + b(\sigma_1^2 + \mu_1^2). \end{aligned}\]

Thus, we can determine the solution of equations

\[\begin{aligned} a = \mu_2 - \rho \frac{\sigma_2}{\sigma_1} \mu_1, b = \rho \frac{\sigma_2}{\sigma_1}. \end{aligned}\]

Next, the conditional variance of $Y$ is given by

\[\begin{aligned} \text{Var}(Y \mid x) = \int_{-\infty}^{\infty} [y - \mu_2 - \rho \frac{\sigma_2}{\sigma_1} (x - \mu_1)]^2 f_{2 | 1} (y | x) dy = \frac{1}{f_1 (x)}\int_{-\infty}^{\infty} [y - \mu_2 - \rho \frac{\sigma_2}{\sigma_1} (x - \mu_1)]^2 f (x, y) dy. \end{aligned}\]

It is nonnegative and is at most a function of $x$ alone. If it is multiplied by $f_1 (x)$ and integrated on $x$, the result obtained is also nonnegative.

$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} [y - \mu_2 - \rho \frac{\sigma_2}{\sigma_1} (x - \mu_1)]^2 f_{2 | 1} (y | x) dy dx$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} [ (y-\mu_2)^2 - 2\rho \frac{\sigma_2}{\sigma_1} (y-\mu_2)(x-\mu_1) + \rho^2 \frac{\sigma_2^2}{\sigma_1^2} (x - \mu_1)^2] f(x, y) dydx$
$= E[(Y-\mu_2)^2] - 2\rho \frac{\sigma_2}{\sigma_1} E[(X - \mu_1)(Y- \mu_2)] + \rho^2 \frac{\sigma_2^2}{\sigma_1^2} E[(X-\mu_1)^2]$
$= \sigma_2^2 (1 - \rho^2)._\blacksquare$

$\mathbf{Note.}$ From $M_{X, Y} (t_1, t_2) = E[e^{t_1 X_1 + t_2 X_2}] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{t_1 X_1 + t_2 X_2} f(x, y) dy dx$.

$\frac{\partial^{k+m} M(t_1, t_2)}{\partial t_1^k \partial t_2^m}$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \frac{\partial^{k+m}}{\partial t_1^k \partial t_2^m} [e^{t_1 x + t_2 y} f(x, y)] dy dx$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^k y^m e^{t_1 x + t_2 y} f(x, y) dy dx$.

With $t_1 = t_2 = 0$ we have

$\frac{\partial^{k+m} M(0, 0)}{\partial t_1^k \partial t_2^m} = E(X^k Y^m)$

For instance,
$E(X) = \frac{\partial M(0, 0)}{\partial t_1}$
$E(Y) = \frac{\partial} M(0, 0)}{\partial t_2}$
$Var(X) = \frac{\partial^2 M(0, 0)}{\partial t_1^2} - [E(X)]^2$
$Cov(X, Y) = \frac{\partial^2 M(0, 0)}{\partial t_1 \partial t_2} - E(X)E(Y)$

$\mathbf{Example\ 2.5.6.}$

2.6 Extension to Several Random Variables

The notions about two random variables can be extended immediately to $n$ random variables. We make the following definition of the space of n random variables.

$\mathbf{Def\ 2.6.1}$ (Random Vector). Consider a random experiment with the sample space $\mathbf{C}$. Let the random variable $X_i$ assign to each element $c \in \mathbf{C}$ one an only one real number $X_i (c) = x_i$, $i = 1, 2, …, n$. We say that $(X_1, …, X_n)$ is an $n$-dimensional random vector. The space of this random vector is the set of ordered $n$-tuples $\mathbf{D} = { (x_1, …, x_n) : x_1 = X_1 (c), …, x_n = X_n (c), c \in \mathbf{C} }.$ Furthermore, let $A$ be a subset of the space $\mathbf{D}$. Then $P[(X_1, …, X_n) \in A] = P(C)$, where $C = { c : c \in \mathbf{C} and (X_1 (c) ,…, X_n (c)) \in A }.$

$\mathbf{Note.}$ In this section, we often use vector notation. We denote

$\mathbf{X} = (X_1, ..., X_n)'$

by the $n$-dimensional column vector $\mathbf{X}$ and the observed values $\mathbf{x} = (x_1, …, x_n)’$ of the random variables by $\mathbf{x}$.
The joint cdf is defined to be

$F_{\mathbf{X}}(\mathbf{x}) = P[X_1 \leq x_1, ..., X_n \leq x_n]$

We say that the $n$ random variables $X_1, …, X_n$ are of the discrete type or of the continuous type and have a distribution of that type according to whether the joint cdf can be expressed as

$F_{\mathbf{X}}(\mathbf{x}) = \sum \cdots \sum_{w_1 \leq x_1, ..., w_n \leq x_n} p(w_1, ..., w_n),$

or as

$F_{\mathbf{X}}(\mathbf{x}) = \int_{-\infty}^{x_1} \int_{-\infty}^{x_2} \cdots \int_{-\infty}^{x_n} f(w_1, ..., w_n) dw_n \cdots dw_1.$

For the continuous case,

$\frac{\partial^n}{\partial x_1 \cdot \partial x_n} F_{\mathbf{X}}(\mathbf{x}) = f(\mathbf{x}),$

except possibly on points that have probability 0.

$\mathbf{Example\ 2.6.1.}$

$\mathbf{Definition}$ (Expected value). Let $(X_1, …, X_n)$ be a random vector and let $Y = u(X_1, …, X_n)$ for some function $u$. If

$\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} | u(x_1, ..., x_n) | f(x_1, ..., x_n) dx_1 \cdots dx_n < \infty,$

then the expected value of Y is

$E[Y] = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} u(x_1, ..., x_n) f(x_1, ..., x_n) dx_1 \cdots dx_n < \infty.$

The definition of discrete case is analogous: just replace integral by sum.

$\mathbf{Note.}$ As in the case of two random variables, the expectation operator is linear here as well.

$\mathbf{Definition}$ (Marginal probability density function). For continuous random variables $X_1, X_2, …, X_n$ with joint pdf $f(x_1,x_2,…,x_n)$, the marginal probability density function of $X_i$ is

$f_i(x_i) = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} f(x_1, x_2, ..., x_n) dx_1dx_2 \cdots dx_n.$

The marginal pmf of $X_i$ is similarly defined for discrete random variables.

$\mathbf{Note.}$ Also we can define the joint pdf of $k < n$ of random variables and find the joint pdf of them. (It is called the marginal pdf of this particular group of $k$ variables.)
For example, take $n = 6, k = 3$, and select $X_2, X_4, X_5$. Then, the marginal pdf of them is

$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x_1, x_2, x_3, x_4, x_5, x_6) dx_1 dx_3 dx_6.$

$\mathbf{Definition}$ (Conditional probability density function). For random variables $X_1, X_2, …, X_n$ with joint pdf $f(x_1, x_2, …, x_n)$, with $f_i (x_i) > 0$ we have the joint conditional probability density function of $X_1, …, X_n$ given $X_i = x_i$ is

$f_{1, 2, ..., n | i} (x_1, x_2, ..., x_n | x_i) = \frac{f(x_1, ..., x_n)}{f_i (x_i)}.$

$\mathbf{Note.}$ We can similarily define a joint conditional pdf given the values of several of the variables by dividing the joint pdf $f(x_1, …, x_n)$ by the marginal pdf of the particular given group of variables. For example,

$f_{1, 2 | 3, 4} (x_1, x_2 | x_3, x_4) = \frac{f(x_1, ..., x_n)}{f_{3, 4} (x_3, x_4)}.$

Because a conditional pdf is the pdf of a certain number of random variables, the expectation of a function of these random variables has been defined.

$\mathbf{Definition}$ (Conditional expectation). Let $X_1, …, X_n$ be continuous random variables. The conditional expectation of $u(X_2, …, X_n)$ given $X_1 = x_1$ is

$E[u(X_1, ..., X_n) | x_1] = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} u(x_2, ..., x_n) f_{2, ..., n | 1} (x_2, ..., x_n | x_1) dx_2 \cdots dx_n$

provided $f_1 (x_1) > 0$ and the integral converges (absolutely).

The above discussion of marginal and conditional distributions generalizes to random variables of the discrete type by using pmfs and summations instead of integrals.

$\mathbf{Definition}$ (mutually independent). Let the $X_1, …, X_n$ be random variables and have the joint pdf $f(x_1, …, x_n)$ and the marginal pdf $f_1 (x_1), …, f_n (x_n)$, respectively. The definition of the independence of $X_1, …, X_n$ are said to be mutually independent iff

$f(x_1, ..., x_n) = f_1 (x_1) \cdots f_n (x_n).$

for the continuous case. In the discrete case, $X_1, …, X_n$ are said to be mutually independent iff

$p(x_1, ..., x_n) = p_1 (x_1) \cdots p_n (x_n).$

$\mathbf{Note.}$ Similarly with Thm 2.4.4., if random variables $X_1, …, X_n$ are mutually independent and if $E[u_i(X_i)]$ exists for $i = 1, …, n$ then

$E[u_1 (X_1) \cdots u_n (X_n)] = E[u_1 (X_1)] \cdots E[u_n (X_n)].$

$\mathbf{Definition}$ (moment-generating function). The mgf of the joint distribution of $n$ random variables $X_1, …, X_n$ is defined as

$E[e^{t_1 X_1 + \cdots + t_n X_n}].$

if it exists for $-h_i < t_i < h_i$, $i = 1, …, n$, where each $h_i$ is positive. This expectation is denoted by $M(t_1, …, t_n)$.
Or, we can write $M_{\mathbf{X}}(\mathbf{t}) = E[e^{\mathbf{t’} \mathbf{X}}]$
Also, a mgf of the marginal distributions of $X_i$ is $M(0, …, 0, t_i, 0, …, 0)$ for $i = 1, 2, …, n$. And, the mgf of the marginal distribution of $X_i$ and $X_j$ is $M(0, …, 0, t_i, 0, …, 0, t_j, 0, …, 0)$ and so forth.

$\mathbf{Note.}$ As in the cases of one and two variables, this mgf is unique and uniquely determines the joint distribution of the n variables (and hence all marginal distributions). Thm 2.4.5. can be also generalized,

$M(t_1, ..., t_n) = \prod_{i=1}^{n} M(0, ..., 0, t_i, 0, ..., 0)$

We can find the moment generating function of a linear combination of random variables, as given in the following theorem, which is a theorem that proves useful in the sequel.

$\mathbf{Thm\ 2.6.1.}$ Suppose $X_1, …, X_n$ are $n$ mutually independent random variables. Suppose the mgf for $x_i$ is $M_i (t)$ for $-h_i < t < h_i$, where $h_i > 0$. Let $T = \sum_{i = 1}^{n} k_i X_i$, where $k_1, …, k_n$ are constants. Then $T$ has the mgf given by

$M_{T} (t) = \prod_{i=1}^{n} M_i(k_i t)$, $-\underset{i}{min} \{ h_i \} < t < \underset{i}{min} \{ h_i \}$.

$Proof.$ Assume $t$ is in the interval $(-\underset{i}{min} { h_i }, \underset{i}{min} { h_i })$. Then, by independence,

$M_{T} (t) = E[e^{\sum_{i=1}^{n} tk_i X_i}] = E[ \prod_{i=1}^{n} e^{(t k_i)X_i} ] = \prod_{i=1}^{n} E[e^{t k_i X_i}] = \prod_{i=1}^{n} M_i(k_i t)._\blacksquare$

$\mathbf{Example\ 2.6.2.}$

$\mathbf{Remark\ 2.6.1.}$ “mutually” independence is much stronger than “pairwise” independence.

In other words, a finite collection of random variables $X_1, …, X_n$ can be pairwise independent (i.e., $X_i$ and $X_j$ are independent for any $i \neq j$) but not mutually independent.

Obviously, if $i \neq j$, we have $p_{ij}(x_i, x_j) = p_i (x_i) p_j (x_j)$ and thus $X_i$ and $X_j$ are independent. However, $p(x_1, x_2, x_3) \neq p_1 (x_1) p_2 (x_2) p_3 (x_3)$.

$\mathbf{Definition}$ (Independent and Identically distributed). SIf several ranodm variables are mutually independent and have the same distribution, we say that they are independent and identically distributed, which we abbreviate as i.i.d.
The following is a useful corollary to Theorem 2.6.1 for iid random variables.

$\mathbf{Corollary\ 2.6.1.}$ Suppose $X_1, …, X_n$ are iid random variables with the common mgf $M(t)$, for $-h < t < h$, where $h > 0$. Let $T = \sum_{i=1}^{n} X_i$. Then $T$ has the mgf given by

$M_{T} (t) = [M(t)]^n$, $-h < t < h$.

2.6.1 Multivariate Variance-Covariance Matrix

In this section we want to extend this discussion to the $n$-variate case.

$\mathbf{Definition.}$ Let $\mathbf{X} = (X_1, …, X_n)’ be an $n$-dimensional random vector. We define the expectation of $\mathbf{X}$ as $E[\mathbf{X}] = (E(X_1), E(X_2), …, E(X_n))’$.
For a set of random variables ${ X_{ij} | 1 \leq i \leq m, 1 \leq j \leq m }$, define the $m \times n$ random matrix $\mathbf{W} = [W_{ij}]$ and the expectation of random matrix $\mathbf{W}$ as the $m \times n$ matrix $E[\mathbf{W}] = [E(X_{ij})]$.

As the following theorem shows, the linearity of the expectation operator easily follows from this definition.

$\mathbf{Thm\ 2.6.2.}$ Let $\mathbf{W_1}$ and $\mathbf{W_2}$ be $m \times n$ matrices of random variables, let $\mathbf{A_1}$ and $\mathbf{A_2}$ be $k \times m$ matrices of constants, and let $\mathbf{B}$ be an $n \times l$ matrix of constants. Then

$E[\mathbf{A_1} \mathbf{W_1} + \mathbf{A_2} \mathbf{W_2}] = \mathbf{A_1} E[ \mathbf{W_1}] + \mathbf{A_2} E[ \mathbf{W_2} ]$
$E[\mathbf{A_1} \mathbf{W_1} \mathbf{B}] = \mathbf{A_1} E[\mathbf{W_1}] \mathbf{B}.$

$Proof:$ We have for the $(i, j)$th components of expression that
$E[\sum_{s = 1}^{m} a_{1is} W_{1sj} + \sum_{s = 1}^{m} a_{2is} W_{2sj}] = \sum_{s = 1}^{m} a_{1is} E[W_{1sj}] + \sum_{s = 1}^{m} a_{2is} E[W_{2sj}] = (\mathbf{A_1} E[\mathbf{W_1}] + \mathbf{A_2} E[\mathbf{W_2}]){ij}$. The other expression can be derived by the same manner. $\blacksquare$

$\mathbf{Definition.}$ Let $X_1, …, X_n$ be random variables, let $\mathbf{X} = (X_1, …, X_n)’$, and suppose $\sigma_i^2 = Var(X_i) < \infty$ for $i = 1, …, n$. The mean of $\mathbf{X}$ is $\mu = E[\mathbf{X}]$. The variance-covariance matrix is

$Cov(\mathbf{X}) = E[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})'] = [\sigma_{ij}]$

where $\sigma_{ii} = \sigma_i^2$, otherwise $\sigma_{ij} = Cov(X_i, X_j)$.

$\mathbf{Example\ 2.6.3.}$

$\mathbf{Thm\ 2.6.3.}$ Let $\mathbf{X} = (X_1, …, X_n)’$ be an $n$-dimensional random vector, such that $\sigma_i^2 = \sigma_{ii} = Var(X_i) < \infty$. Let $\mathbf{A}$ be an $m \times n$ matrix of constants. Then

$Cov(\mathbf{X}) = E[\mathbf{X} \mathbf{X'}] - \mathbf{\mu} \mathbf{\mu'}$
$Cov(\mathbf{A} \mathbf{X}) = \mathbf{A} Cov(\mathbf{X}) \mathbf{A'}.$

$Proof:$ Use Theorem 2.6.2 to derive the first one; i.e.,

$Cov(\mathbf{X}) = E[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})'] = E[\mathbf{X} \mathbf{X'} - \mathbf{\mu}\mathbf{X'} - \mathbf{X}\mathbf{\mu'} + \mathbf{\mu}\mathbf{\mu'}] = E[\mathbf{X} \mathbf{X'}] - \mathbf{\mu} E[\mathbf{X'}] - E[\mathbf{X}]\mathbf{\mu'} + \mathbf{\mu} \mathbf{\mu'}.$

The derivation of the second one follows in the similar way. $_\blacksquare$

$\mathbf{Remark.}$ All variance-covariance matrices are positive semi-definite matrices.
That is, $\mathbf{a’} Cov(\mathbf{X}) \mathbf{a} \geq 0$, for all vectors $\mathbf{a} \in \mathbb{R}^n. To see this, let $\mathbf{X}$ be a random vector and let $\mathbf{a}$ be any $n \times 1$ vector of constants. Then, $Y = \mathbf{a’X}$ is a random variable and, hence, has nonnegative varaince; i.e.,

$0 \leq Var(Y) = Var(\mathbf{a'X}) = \mathbf{a'}Cov(\mathbf{X})\mathbf{a}._\blacksquare$

Note that both “Cov” and “Var” are used to represent the covariance matrix of the vector $\mathbf{X}$. See for example this Wikipedia remark:

Nomenclatures differ. Some statisticians, following the probabilist William Feller, call the matrix $\mathbf{\Sigma}$ the variance of the random vector $\mathbf{X}$, because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector $\mathbf{X}$.

2.7 Transformations for Several Random Variables

one-to-one transformation

$\mathbf{Theorem.}$ Consider an integral of the form

$\int \cdots \int_{A} f(x_1, ..., x_n) dx_1 \cdots dx_n$

taken over a subset of $A$ of an $n$-dimensional space $\mathbf{S}$. Let

div align=center> $y_i = u_i (x_1, x_2, …, x_n)$ for all $i = 1, 2, …, n$, </div>

together with the inverse functions

$x_i = w_i (y_1, y_2, ..., y_n)$ for all $i = 1, 2, ..., n$

define a one-to-one transofrmation that maps $\mathbf{S}$ onto $\mathbf{T}$.
Note that the Jacobian is given by

$J = \begin{vmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_n} \ \frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \cdots & \frac{\partial x_2}{\partial y_n} \ \vdots & \vdots & & \vdots \ \frac{\partial x_n}{\partial y_1} & \frac{\partial x_n}{\partial y_2} & \cdots & \frac{\partial x_n}{\partial y_n} \end{vmatrix}$

(not be identically zero in $\mathbf{T}$). Then

$x_i = w_i (y_1, y_2, ..., y_n)$ for all $i = 1, 2, ..., n$

define a one-to-one transofrmation that maps $\mathbf{S}$ onto $\mathbf{T}$.
Note that the Jacobian is given by

$\int \cdots \int_{A} f(x_1, ..., x_n) dx_1 \cdots dx_n = \int \cdots \int_{B} f[w_1 (y_1, ..., y_n), ..., w_n (y_1, ..., y_n)] |J| dy_1 \cdots dy_n.$

Thus, if the pdf of $X_1, …, X_n$ is $f(x_1, …, x_n)$ then the pdf of $Y_1, …, Y_n$ is

$g(y_1, ..., y_n) = f(w_1 (y_1, ..., y_n), ..., w_n (y_1, ..., y_n))|J|.$

$\mathbf{Example\ 2.7.1.}$

$\mathbf{Example\ 2.7.2.}$

not one-to-one transformation

We now consider some other problems that are encountered when transforming variables. Let $X$ have the Cauchy pdf

$f(x) = \frac{1}{\pi (1+x^2)}, -\infty < x < \infty$.

and let $Y = X^2$. This transformation maps the space of $X$, namely $\mathbf{S} = { x: -\infty < x < \infty }$, onto $\mathbf{T} = { y : 0 \leq y < \infty }$.

But, to each $y \in \mathbf{T}$, with the exception of $y = 0$, there correspond two points $x \in \mathbf{S}$. In such an instance, we represent the space of $X$ as the union of two disjoint sets:

$\mathbf{A_1} = \{ x : -\infty < x < 0}$ and $\mathbf{A_2} = \{ x : 0 \leq x < \infty}$.

Note that $\mathbf{A_1}$ and $\mathbf{A_2}$ are mapped onto different sets. Our difficulty is caused by the fact that $x = 0$ is an element of $\mathbf{S}$.

Since $X$ is a continuous random variable, no probabilities change if we redefine $f(0)$ as $0$ instead of $1 / \pi$. Then, the transformation, $y = x^2$ with the inverse $x = -\sqrt(y)$, maps $\mathbf{A_1}$ onto $\mathbf{T}$ and it is one-to-one. Also, the transformation, $y = x^2$ with the inverse $x = \sqrt(y)$, maps $\mathbf{A_2}$ onto $\mathbf{T}$ and it is one-to-one, too.

Now, Consider the probability $P(Y \in B)$ where $B \subset \mathbf{T}$. Let $\mathbf{A_3} = { x : x = -\sqrt(y), y \in B } \subset \mathbf{A_1}$ and $\mathbf{A_4} = { x : x = \sqrt(y), y \in B } \subset \mathbf{A_2}$. Then $Y \in B$ when and only when $X \in \mathbf{A_3}$ or $X \in \mathbf{A_4}$;

$P(Y \in B) = P(X \in \mathbf{A_3}) + P(X \in \mathbf{A_4}) = \int_{A_3} f(x) dx + \int_{A_4} f(x) dx$.

Then, we can write

$P(Y \in B) = \int_{B} f(-\sqrt(y)) | - \frac {1}{2\sqrt(y)}|dy + \int_{B} f(\sqrt(y)) | \frac {1}{2\sqrt(y)}|dy = \int_{B} [f(-\sqrt(y)) + f(\sqrt(y))] \frac {1}{2\sqrt(y)} dy$.

Thus, the pdf of $Y$ is given by

$g(y) = \frac {1}{\pi (1+y) \sqrt(y)} (0 < y < \infty)$.

$\mathbf{Note.}$ Inspired by the preceeding discussion, we now consider continuous random variables $X_1, X_2, …, X_n$ with joint probability density function $f(x_1, x_2, …, x_n)$. Let $\mathbf{S}$ be the support of $f$ and consider the transformation $y_i = u_i(x_1, x_2, …, x_n)$, which maps $\mathbf{S}$ onto $\mathbf{T}$ where $\mathbf{T}$ is in the $y_1y_2 \cdots y_n$ space. But the transformation may not be one-to-one.
Suppose that $\mathbf{S}$ can be written as the union of a finite number, say $k$, of mutually disjoint sets $A_1, A_2, …, A_k$, so that $S = \bigcup_{i=1}^{k} A_i$, and the transformation is one-to-one on each $A_i$ and maps each $A_i$ onto $\mathbf{T}$. Thus to each point in $\mathbf{T}$ there corresponds exactly one point in each of $A_1, A_2, …, A_k$ (so the transformation is a $k$-to-one mapping).

Since the transformation is one-to-one on each $A_i$ and onto $\mathbf{T}$ then there is an inverse transformation mapping $\mathbf{T}$ onto $A_i$. Say the inverse transformation is

$x_i = w_i (y_1, ..., y_n)$ for $i = 1, 2, ..., k.$

Suppose the first partial derivatives are continuous and let each

$J = \begin{vmatrix} \frac{\partial w_1}{\partial y_1} & \frac{\partial w_1}{\partial y_2} & \cdots & \frac{\partial w_1}{\partial y_n} \\ \frac{\partial w_2}{\partial y_1} & \frac{\partial w_2}{\partial y_2} & \cdots & \frac{\partial w_2}{\partial y_n} \\ \vdots & \vdots & & \vdots \\ \frac{\partial w_n}{\partial y_1} & \frac{\partial w_n}{\partial y_2} & \cdots & \frac{\partial w_n}{\partial y_n} \end{vmatrix}$, for $i = 1, 2, ..., k.$

be not identically equal to zero in $\mathbf{T}$.

Then, similarly, we can see that

$g(y_1, ..., y_n) = \sum_{i=1}^{k} f[w_{1i}(y_1, ..., y_n), ..., w_{ni}(y_1, ..., y_n)]|J_i|,$

provided that $(y_1, …, y_n) \in \mathbf{T}$, and equals $0$ elsewhere.

$\mathbf{Example\ 2.7.3.}$

$\mathbf{Example\ 2.7.4.}$

2.8 Linear Combinations of Random Variables

We considered linear combinations of random variables in Section 2.6. (see Thm 2.6.1 and Corollary 2.6.1). In this section we consider expectations of random variables, variances, and covariances of linear combinations of random variables. With $X_1, X_2, …, X_n$ as the random variables, we define random variable $T = \sum_{i=1}^{n} a_i X_i$ where each $a_i$ is some constants.

$\mathbf{Thm. 2.8.1.}$ Suppose $E(X_i) = \mu_i$ for $i = 1, …, n$. Then $E(T) = \sum_{i = 1}^{n} a_i \mu_i$.

To obtain the variance of $T$, we first state a general result on covariances.

$\mathbf{Thm. 2.8.2.}$ Suppose $T$ is the linear combination as above, and that $W$ is another linear combination given by $T = \sum_{i=1}^{m} b_i Y_i$, for random variables $Y_1, …, Y_m$ and specified constants $b_1, …, b_m$. If $E[X_i^2] < \infty$, and $E[Y_j^2] < \infty$ for $i = 1, …, n$ and $j = 1, …, m$, then

$Cov(T, W) = \sum_{i=1}^{n} \sum_{j=1}^{m} a_i b_j Cov(X_i, Y_j).$

$Proof:$ Using the definition of the covariance and Thm 2.8.1, we have the first equality below, while the second equality follows from the linearity of $E$:

$Cov(T, W) = E[\sum_{i=1}^{n} \sum_{j=1}^{m} (a_i X_i - a_i E(X_i)) (b_j Y_j - b_j E(Y_j))] = \sum_{i=1}^{n} \sum_{j=1}^{m} a_i b_j E[(X_i - E(X_i))(Y_i - E(Y_i))]._\blacksquare$

$\mathbf{Corollary 2.8.1.}$ Let $T = \sum_{i=1}^{n} a_i X_i$. Provided $E[X_i^2] < \infty$ for $i = 1, …, n$,

$Var(T) = Cov(T, T) = \sum_{i=1}^n a_i^2 Var(X_i) + 2 \sum_{i < j} a_i a_j Cov(X_i, X_j).$

$\mathbf{Corollary 2.8.2.}$ If $X_1, …, X_n$ are indepedent random variables and $Var(X_i) = \sigma_i^2$, for $i = 1, …, n$, then

$Var(T) = \sum_{i=1}^n a_i^2 \sigma_i^2.$

Note that we need only $X_i$ and $X_j$ to be uncorrelated for all $i \neq j$ to obtain this result.

In addition to independence, we assume that the random variables have the same distribution. (i.e., i.i.d.) We call such a collection of random variables a random sample which we now state in a formal definition.

$\mathbf{Def\ 2.8.1.}$ (Random Sample). If the random variables $X_1, X_2, …, X_n$ are independent and identically distributed, i.e. each $X_i$ has the same distribution, then we say that these random variables constitute a random sample of size $n$ from that common distribution. We abbreviate independent and identically distributed by iid.

$\mathbf{Definition.}$ (Sample Mean). Let $X_1, …, X_n$ be independent and identically distributed random variables with common mean $\mu$ and variance $\sigma^2$. The sample mean $\overline{X}$ and the sample variance $S^2$ are defined by

$\overline{X} = \frac{\sum_{i=1}^n X_i}{n}$
$S^2 = \frac{\sum_{i=1}^n (X_i - \overline{X})^2}{n - 1} = \frac{\sum_{i=1}^n (X_i^2 - n\overline{X}^2)}{n - 1}$

$\mathbf{Remark.}$ As $\overline{X}$ is a linear combination of the sample observations with $a_i = n^{-1}$, we have

$E(\overline{X}) = \mu$ and $Var(\overline{X}) = \frac{\sigma^2}{n}.$

As $E(\overline{X}) = \mu$, we say that $\overline{X}$ is unbiased for $\mu$.

$\mathbf{Remark.}$ The one reasor for dividing by $n - 1$ instead of $n$ in $S^2$ is that it makes $S^2$ unbiased for $\sigma^2$.
From the fact that $E(X^2) = \sigma^2 + \mu^2$ and $E(\overline{X}^2) = (\sigma^2 / n) + \mu^2, we have

$E(S^2) = \frac{\sum_{i=1}^n (E(X_i^2) - nE(\overline{X}^2))}{n - 1} = \frac{\sum_{i=1}^n (\sigma^2 + \mu^2 - n (\sigma^2 / n) - n \mu^2}{n - 1} = \sigma^2._\blacksquare$

Reference

[1] Hogg, R., McKean, J. & Craig, A., Introduction to Mathematical Statistics, Pearson 2019

Share on

Twitter Facebook LinkedIn

Youngdo Lee