[Statistics] Multivariate Distributions - Part II
(This post is a summary of Chapter 2 of [1])
2.5 The Correlation Coefficient
Let $(X, Y)$ denote a random vector. In the last section, we discussed the concept of independence between $X$ and $Y$. What if, though, $X$ and $Y$ are dependent and, if so, how are they related?
$\mathbf{Def\ 2.5.1}$ (Covariance). Let $(X, Y)$ have a joint distribution. Denote the means of $X$ and $Y$ respectively by $\mu_1$ and $\mu_2$ and their respective variances by $\sigma_1^2$ and $\sigma_2^2$. The covariance of $(X, Y)$ is denoted by $cov(X,Y)$ and is defined by the expectation
$\mathbf{Remark.}$ By the linearity of expectation,
\(\begin{aligned}
\text{Cov}(X, Y) = \mathbb{E}[XY - \mu_2 X - \mu_1 Y + \mu_1 \mu_2] = \mathbb{E}[XY] - \mu_1 \mu_2.
\end{aligned}\)
$\mathbf{Def\ 2.5.2}$ (Correlation coefficient). If each of $\sigma_1$ and $\sigma_2$ is positive, then the correlation coefficient between $X$ and $Y$ is defined by
$\mathbf{Note.}$ $\mathbb{E}[XY] = \mu_1 \mu_2 + \text{Cov}(X, Y) = \mu_1 \mu_2 + \rho \sigma_1 \sigma_2$.
$\mathbf{Example\ 2.5.1.}$
We next estabilish that, in general, $| \rho | \leq 1$.
$\mathbf{Thm\ 2.5.1.}$ For all jointly distributed random variables $(X, Y)$ whose correlation coefficient $\rho$ exists, $-1 \leq \rho \leq 1$.
$\mathbf{Proof.}$ Consider the polynomial in $v$ given by
Then $h(v) \geq 0$ for all $v$. Hence, the discriminant of $h(v)$ is less than or equal to 0. i.e., $4 \rho^2 \sigma_1^2 \sigma_2^2 \leq 4 \sigma_1^2 \sigma_2^2$ or $\rho^2 \leq 1._\blacksquare$
$\mathbf{Thm\ 2.5.2.}$ If $X$ and $Y$ are independent random variables then $\text{Cov}(X, Y) = 0$ and, hence $\rho = 0.$
$\mathbf{Proof.}$ By $\mathbf{Thm\ 2.4.4.}$, $\mathbb{E}[XY] = \mathbb{E}[X] \mathbb{E}[Y] = \mu_1 \mu_2._\blacksquare$
But the converse of this theorem is not true:
$\mathbf{Example\ 2.5.3.}$
It is because, the correlation coefficient $\rho$ indicates to the degree to which a pair of variables are linearly correlated, not dependent. For more detail, see.
$\mathbf{Note.}$ By cosidering the case when the discriminant of the polynomial $h(v)$ is 0, we can show that if $\rho = 1$ then $Y = (\sigma_2 / \sigma_1) X - (\sigma_2 / \sigma_1) \mu_1 + \mu_2$ with probability $1$, and if $\rho = -1$ then $Y = -(\sigma_2 / \sigma_1) X + (\sigma_2 / \sigma_1) \mu_1 + \mu_2$ with probability $1$. That means, $X$ and $Y$ are linearly correlated.
In generally, we summarize these thoughts in the next theorem.
$\mathbf{Notation.}$ Let $f(x, y)$ be the joint pdf of 2 random variables $X$ and $Y$ and let $f_1 (x)$ denote the marginal pdf of $X$. Recall that the conditional pdf of $Y$ given $X = x$ is $f_{2 \mid 1} (y \mid x) = f(x, y) / f_1 (x)$ at points where $f_1 (x) > 0$. And the conditional mean of $Y$ given $X = x$ is given by
$\mathbf{Thm\ 2.5.3.}$ Suppose $(X, Y)$ have a joint distribution with the variance of $X$ and $Y$ finite and positive. Denote the means and variances of $X$ and $Y$ by $\mu_1, \mu_2$ and $\sigma_1^2, \sigma_2^2$, respectively. And let $\rho$ be the correlation coefficient between $X$ and $Y$. If $\mathbb{E} [Y \mid X]$ is linear in $X$ then
and
$Proof:$ Let $\mathbb{E}[Y \mid x] = a + bx$. We have
If both members of the above equation are “integrated” on $x$, it is seen that
If both members of the above equation are “multiplied” by $x$, and then “integrated” on $x$, we have
Thus, we can determine the solution of equations
Next, the conditional variance of $Y$ is given by
It is nonnegative and is at most a function of $x$ alone. If it is multiplied by $f_1 (x)$ and integrated on $x$, the result obtained is also nonnegative.
$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} [y - \mu_2 - \rho \frac{\sigma_2}{\sigma_1} (x - \mu_1)]^2 f_{2 | 1} (y | x) dy dx$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} [ (y-\mu_2)^2 - 2\rho \frac{\sigma_2}{\sigma_1} (y-\mu_2)(x-\mu_1) + \rho^2 \frac{\sigma_2^2}{\sigma_1^2} (x - \mu_1)^2] f(x, y) dydx$
$= E[(Y-\mu_2)^2] - 2\rho \frac{\sigma_2}{\sigma_1} E[(X - \mu_1)(Y- \mu_2)] + \rho^2 \frac{\sigma_2^2}{\sigma_1^2} E[(X-\mu_1)^2]$
$= \sigma_2^2 (1 - \rho^2)._\blacksquare$
$\mathbf{Note.}$ From $M_{X, Y} (t_1, t_2) = E[e^{t_1 X_1 + t_2 X_2}] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{t_1 X_1 + t_2 X_2} f(x, y) dy dx$.
$\frac{\partial^{k+m} M(t_1, t_2)}{\partial t_1^k \partial t_2^m}$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \frac{\partial^{k+m}}{\partial t_1^k \partial t_2^m} [e^{t_1 x + t_2 y} f(x, y)] dy dx$
$= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^k y^m e^{t_1 x + t_2 y} f(x, y) dy dx$.
With $t_1 = t_2 = 0$ we have
$\frac{\partial^{k+m} M(0, 0)}{\partial t_1^k \partial t_2^m} = E(X^k Y^m)$
For instance,
$E(X) = \frac{\partial M(0, 0)}{\partial t_1}$
$E(Y) = \frac{\partial} M(0, 0)}{\partial t_2}$
$Var(X) = \frac{\partial^2 M(0, 0)}{\partial t_1^2} - [E(X)]^2$
$Cov(X, Y) = \frac{\partial^2 M(0, 0)}{\partial t_1 \partial t_2} - E(X)E(Y)$
$\mathbf{Example\ 2.5.6.}$
2.6 Extension to Several Random Variables
The notions about two random variables can be extended immediately to $n$ random variables. We make the following definition of the space of n random variables.
$\mathbf{Def\ 2.6.1}$ (Random Vector). Consider a random experiment with the sample space $\mathbf{C}$. Let the random variable $X_i$ assign to each element $c \in \mathbf{C}$ one an only one real number $X_i (c) = x_i$, $i = 1, 2, …, n$. We say that $(X_1, …, X_n)$ is an $n$-dimensional random vector. The space of this random vector is the set of ordered $n$-tuples $\mathbf{D} = { (x_1, …, x_n) : x_1 = X_1 (c), …, x_n = X_n (c), c \in \mathbf{C} }.$ Furthermore, let $A$ be a subset of the space $\mathbf{D}$. Then $P[(X_1, …, X_n) \in A] = P(C)$, where $C = { c : c \in \mathbf{C} and (X_1 (c) ,…, X_n (c)) \in A }.$
$\mathbf{Note.}$ In this section, we often use vector notation. We denote
by the $n$-dimensional column vector $\mathbf{X}$ and the observed values $\mathbf{x} = (x_1, …, x_n)’$ of the random variables by $\mathbf{x}$.
The joint cdf is defined to be
We say that the $n$ random variables $X_1, …, X_n$ are of the discrete type or of the continuous type and have a distribution of that type according to whether the joint cdf can be expressed as
or as
For the continuous case,
except possibly on points that have probability 0.
$\mathbf{Example\ 2.6.1.}$
$\mathbf{Definition}$ (Expected value). Let $(X_1, …, X_n)$ be a random vector and let $Y = u(X_1, …, X_n)$ for some function $u$. If
then the expected value of Y is
The definition of discrete case is analogous: just replace integral by sum.
$\mathbf{Note.}$ As in the case of two random variables, the expectation operator is linear here as well.
$\mathbf{Definition}$ (Marginal probability density function). For continuous random variables $X_1, X_2, …, X_n$ with joint pdf $f(x_1,x_2,…,x_n)$, the marginal probability density function of $X_i$ is
The marginal pmf of $X_i$ is similarly defined for discrete random variables.
$\mathbf{Note.}$ Also we can define the joint pdf of $k < n$ of random variables and find the joint pdf of them. (It is called the marginal pdf of this particular group of $k$ variables.)
For example, take $n = 6, k = 3$, and select $X_2, X_4, X_5$. Then, the marginal pdf of them is
$\mathbf{Definition}$ (Conditional probability density function). For random variables $X_1, X_2, …, X_n$ with joint pdf $f(x_1, x_2, …, x_n)$, with $f_i (x_i) > 0$ we have the joint conditional probability density function of $X_1, …, X_n$ given $X_i = x_i$ is
$\mathbf{Note.}$ We can similarily define a joint conditional pdf given the values of several of the variables by dividing the joint pdf $f(x_1, …, x_n)$ by the marginal pdf of the particular given group of variables. For example,
Because a conditional pdf is the pdf of a certain number of random variables, the expectation of a function of these random variables has been defined.
$\mathbf{Definition}$ (Conditional expectation). Let $X_1, …, X_n$ be continuous random variables. The conditional expectation of $u(X_2, …, X_n)$ given $X_1 = x_1$ is
provided $f_1 (x_1) > 0$ and the integral converges (absolutely).
The above discussion of marginal and conditional distributions generalizes to random variables of the discrete type by using pmfs and summations instead of integrals.
$\mathbf{Definition}$ (mutually independent). Let the $X_1, …, X_n$ be random variables and have the joint pdf $f(x_1, …, x_n)$ and the marginal pdf $f_1 (x_1), …, f_n (x_n)$, respectively. The definition of the independence of $X_1, …, X_n$ are said to be mutually independent iff
for the continuous case. In the discrete case, $X_1, …, X_n$ are said to be mutually independent iff
$\mathbf{Note.}$ Similarly with Thm 2.4.4., if random variables $X_1, …, X_n$ are mutually independent and if $E[u_i(X_i)]$ exists for $i = 1, …, n$ then
$\mathbf{Definition}$ (moment-generating function). The mgf of the joint distribution of $n$ random variables $X_1, …, X_n$ is defined as
if it exists for $-h_i < t_i < h_i$, $i = 1, …, n$, where each $h_i$ is positive. This expectation is denoted by $M(t_1, …, t_n)$.
Or, we can write $M_{\mathbf{X}}(\mathbf{t}) = E[e^{\mathbf{t’} \mathbf{X}}]$
Also, a mgf of the marginal distributions of $X_i$ is $M(0, …, 0, t_i, 0, …, 0)$ for $i = 1, 2, …, n$. And, the mgf of the marginal distribution of $X_i$ and $X_j$ is $M(0, …, 0, t_i, 0, …, 0, t_j, 0, …, 0)$ and so forth.
$\mathbf{Note.}$ As in the cases of one and two variables, this mgf is unique and uniquely determines the joint distribution of the n variables (and hence all marginal distributions). Thm 2.4.5. can be also generalized,
We can find the moment generating function of a linear combination of random variables, as given in the following theorem, which is a theorem that proves useful in the sequel.
$\mathbf{Thm\ 2.6.1.}$ Suppose $X_1, …, X_n$ are $n$ mutually independent random variables. Suppose the mgf for $x_i$ is $M_i (t)$ for $-h_i < t < h_i$, where $h_i > 0$. Let $T = \sum_{i = 1}^{n} k_i X_i$, where $k_1, …, k_n$ are constants. Then $T$ has the mgf given by
$Proof.$ Assume $t$ is in the interval $(-\underset{i}{min} { h_i }, \underset{i}{min} { h_i })$. Then, by independence,
$\mathbf{Example\ 2.6.2.}$
$\mathbf{Remark\ 2.6.1.}$ “mutually” independence is much stronger than “pairwise” independence.
In other words, a finite collection of random variables $X_1, …, X_n$ can be pairwise independent (i.e., $X_i$ and $X_j$ are independent for any $i \neq j$) but not mutually independent.
Obviously, if $i \neq j$, we have $p_{ij}(x_i, x_j) = p_i (x_i) p_j (x_j)$ and thus $X_i$ and $X_j$ are independent. However, $p(x_1, x_2, x_3) \neq p_1 (x_1) p_2 (x_2) p_3 (x_3)$.
$\mathbf{Definition}$ (Independent and Identically distributed). SIf several ranodm variables are mutually independent and have the same distribution, we say that they are independent and identically distributed, which we abbreviate as i.i.d.
The following is a useful corollary to Theorem 2.6.1 for iid random variables.
$\mathbf{Corollary\ 2.6.1.}$ Suppose $X_1, …, X_n$ are iid random variables with the common mgf $M(t)$, for $-h < t < h$, where $h > 0$. Let $T = \sum_{i=1}^{n} X_i$. Then $T$ has the mgf given by
2.6.1 Multivariate Variance-Covariance Matrix
In this section we want to extend this discussion to the $n$-variate case.
$\mathbf{Definition.}$ Let $\mathbf{X} = (X_1, …, X_n)’ be an $n$-dimensional random vector. We define the expectation of $\mathbf{X}$ as $E[\mathbf{X}] = (E(X_1), E(X_2), …, E(X_n))’$.
For a set of random variables ${ X_{ij} | 1 \leq i \leq m, 1 \leq j \leq m }$, define the $m \times n$ random matrix $\mathbf{W} = [W_{ij}]$ and the expectation of random matrix $\mathbf{W}$ as the $m \times n$ matrix $E[\mathbf{W}] = [E(X_{ij})]$.
As the following theorem shows, the linearity of the expectation operator easily
follows from this definition.
$\mathbf{Thm\ 2.6.2.}$ Let $\mathbf{W_1}$ and $\mathbf{W_2}$ be $m \times n$ matrices of random variables, let $\mathbf{A_1}$ and $\mathbf{A_2}$ be $k \times m$ matrices of constants, and let $\mathbf{B}$ be an $n \times l$ matrix of constants. Then
$E[\mathbf{A_1} \mathbf{W_1} \mathbf{B}] = \mathbf{A_1} E[\mathbf{W_1}] \mathbf{B}.$
$Proof:$ We have for the $(i, j)$th components of expression that
$E[\sum_{s = 1}^{m} a_{1is} W_{1sj} + \sum_{s = 1}^{m} a_{2is} W_{2sj}] = \sum_{s = 1}^{m} a_{1is} E[W_{1sj}] + \sum_{s = 1}^{m} a_{2is} E[W_{2sj}] = (\mathbf{A_1} E[\mathbf{W_1}] + \mathbf{A_2} E[\mathbf{W_2}]){ij}$. The other expression can be derived by the same manner. $\blacksquare$
$\mathbf{Definition.}$ Let $X_1, …, X_n$ be random variables, let $\mathbf{X} = (X_1, …, X_n)’$, and suppose $\sigma_i^2 = Var(X_i) < \infty$ for $i = 1, …, n$. The mean of $\mathbf{X}$ is $\mu = E[\mathbf{X}]$. The variance-covariance matrix is
where $\sigma_{ii} = \sigma_i^2$, otherwise $\sigma_{ij} = Cov(X_i, X_j)$.
$\mathbf{Example\ 2.6.3.}$
$\mathbf{Thm\ 2.6.3.}$ Let $\mathbf{X} = (X_1, …, X_n)’$ be an $n$-dimensional random vector, such that $\sigma_i^2 = \sigma_{ii} = Var(X_i) < \infty$. Let $\mathbf{A}$ be an $m \times n$ matrix of constants. Then
$Cov(\mathbf{A} \mathbf{X}) = \mathbf{A} Cov(\mathbf{X}) \mathbf{A'}.$
$Proof:$ Use Theorem 2.6.2 to derive the first one; i.e.,
The derivation of the second one follows in the similar way. $_\blacksquare$
$\mathbf{Remark.}$ All variance-covariance matrices are positive semi-definite matrices.
That is, $\mathbf{a’} Cov(\mathbf{X}) \mathbf{a} \geq 0$, for all vectors $\mathbf{a} \in \mathbb{R}^n. To see this, let $\mathbf{X}$ be a random vector and let $\mathbf{a}$ be any $n \times 1$ vector of constants. Then, $Y = \mathbf{a’X}$ is a random variable and, hence, has nonnegative varaince; i.e.,
Note that both “Cov” and “Var” are used to represent the covariance matrix of the vector $\mathbf{X}$. See for example this Wikipedia remark:
Nomenclatures differ. Some statisticians, following the probabilist William Feller, call the matrix $\mathbf{\Sigma}$ the variance of the random vector $\mathbf{X}$, because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector $\mathbf{X}$.
2.7 Transformations for Several Random Variables
one-to-one transformation
$\mathbf{Theorem.}$ Consider an integral of the form
taken over a subset of $A$ of an $n$-dimensional space $\mathbf{S}$. Let
div align=center>
$y_i = u_i (x_1, x_2, …, x_n)$ for all $i = 1, 2, …, n$,
</div>
together with the inverse functions
define a one-to-one transofrmation that maps $\mathbf{S}$ onto $\mathbf{T}$.
Note that the Jacobian is given by
$J = \begin{vmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_n} \
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \cdots & \frac{\partial x_2}{\partial y_n} \
\vdots & \vdots & & \vdots \
\frac{\partial x_n}{\partial y_1} & \frac{\partial x_n}{\partial y_2} & \cdots & \frac{\partial x_n}{\partial y_n}
\end{vmatrix}$
(not be identically zero in $\mathbf{T}$). Then
define a one-to-one transofrmation that maps $\mathbf{S}$ onto $\mathbf{T}$.
Note that the Jacobian is given by
Thus, if the pdf of $X_1, …, X_n$ is $f(x_1, …, x_n)$ then the pdf of $Y_1, …, Y_n$ is
$\mathbf{Example\ 2.7.1.}$
$\mathbf{Example\ 2.7.2.}$
not one-to-one transformation
We now consider some other problems that are encountered when transforming variables. Let $X$ have the Cauchy pdf
and let $Y = X^2$. This transformation maps the space of $X$, namely $\mathbf{S} = { x: -\infty < x < \infty }$, onto $\mathbf{T} = { y : 0 \leq y < \infty }$.
But, to each $y \in \mathbf{T}$, with the exception of $y = 0$, there correspond two points $x \in \mathbf{S}$. In such an instance, we represent the space of $X$ as the union of two disjoint sets:
Note that $\mathbf{A_1}$ and $\mathbf{A_2}$ are mapped onto different sets. Our difficulty is caused by the fact that $x = 0$ is an element of $\mathbf{S}$.
Since $X$ is a continuous random variable, no probabilities change if we redefine $f(0)$ as $0$ instead of $1 / \pi$. Then, the transformation, $y = x^2$ with the inverse $x = -\sqrt(y)$, maps $\mathbf{A_1}$ onto $\mathbf{T}$ and it is one-to-one. Also, the transformation, $y = x^2$ with the inverse $x = \sqrt(y)$, maps $\mathbf{A_2}$ onto $\mathbf{T}$ and it is one-to-one, too.
Now, Consider the probability $P(Y \in B)$ where $B \subset \mathbf{T}$. Let $\mathbf{A_3} = { x : x = -\sqrt(y), y \in B } \subset \mathbf{A_1}$ and $\mathbf{A_4} = { x : x = \sqrt(y), y \in B } \subset \mathbf{A_2}$. Then $Y \in B$ when and only when $X \in \mathbf{A_3}$ or $X \in \mathbf{A_4}$;
Then, we can write
Thus, the pdf of $Y$ is given by
$\mathbf{Note.}$ Inspired by the preceeding discussion, we now consider continuous random variables $X_1, X_2, …, X_n$ with joint probability density function $f(x_1, x_2, …, x_n)$. Let $\mathbf{S}$ be the support of $f$ and consider the transformation $y_i = u_i(x_1, x_2, …, x_n)$, which maps $\mathbf{S}$ onto $\mathbf{T}$ where $\mathbf{T}$ is in the $y_1y_2 \cdots y_n$ space. But the transformation may not be one-to-one.
Suppose that $\mathbf{S}$ can be written as the union of a finite number, say $k$, of mutually disjoint sets $A_1, A_2, …, A_k$, so that $S = \bigcup_{i=1}^{k} A_i$, and the transformation is one-to-one on each $A_i$ and maps each $A_i$ onto $\mathbf{T}$. Thus to each point in $\mathbf{T}$ there corresponds exactly one point in each of $A_1, A_2, …, A_k$ (so the transformation is a $k$-to-one mapping).
Since the transformation is one-to-one on each $A_i$ and onto $\mathbf{T}$ then there is an
inverse transformation mapping $\mathbf{T}$ onto $A_i$. Say the inverse transformation is
Suppose the first partial derivatives are continuous and let each
be not identically equal to zero in $\mathbf{T}$.
Then, similarly, we can see that
provided that $(y_1, …, y_n) \in \mathbf{T}$, and equals $0$ elsewhere.
$\mathbf{Example\ 2.7.3.}$
$\mathbf{Example\ 2.7.4.}$
2.8 Linear Combinations of Random Variables
We considered linear combinations of random variables in Section 2.6. (see Thm 2.6.1 and Corollary 2.6.1). In this section we consider expectations of random variables, variances, and covariances of linear combinations of random variables. With $X_1, X_2, …, X_n$ as the random variables, we define random variable $T = \sum_{i=1}^{n} a_i X_i$ where each $a_i$ is some constants.
$\mathbf{Thm. 2.8.1.}$ Suppose $E(X_i) = \mu_i$ for $i = 1, …, n$. Then $E(T) = \sum_{i = 1}^{n} a_i \mu_i$.
To obtain the variance of $T$, we first state a general result on covariances.
$\mathbf{Thm. 2.8.2.}$ Suppose $T$ is the linear combination as above, and that $W$ is another linear combination given by $T = \sum_{i=1}^{m} b_i Y_i$, for random variables $Y_1, …, Y_m$ and specified constants $b_1, …, b_m$. If $E[X_i^2] < \infty$, and $E[Y_j^2] < \infty$ for $i = 1, …, n$ and $j = 1, …, m$, then
$Proof:$ Using the definition of the covariance and Thm 2.8.1, we have the first equality below, while the second equality follows from the linearity of $E$:
$\mathbf{Corollary 2.8.1.}$ Let $T = \sum_{i=1}^{n} a_i X_i$. Provided $E[X_i^2] < \infty$ for $i = 1, …, n$,
$\mathbf{Corollary 2.8.2.}$ If $X_1, …, X_n$ are indepedent random variables and $Var(X_i) = \sigma_i^2$, for $i = 1, …, n$, then
Note that we need only $X_i$ and $X_j$ to be uncorrelated for all $i \neq j$ to obtain this result.
In addition to independence, we assume that the random variables have the same distribution. (i.e., i.i.d.) We call such a collection of random variables a random sample which we now state in a formal definition.
$\mathbf{Def\ 2.8.1.}$ (Random Sample). If the random variables $X_1, X_2, …, X_n$ are independent and identically distributed, i.e. each $X_i$ has the same distribution, then we say that these random variables constitute a random sample of size $n$ from that common distribution. We abbreviate independent and identically distributed by iid.
$\mathbf{Definition.}$ (Sample Mean). Let $X_1, …, X_n$ be independent and identically distributed random variables with common mean $\mu$ and variance $\sigma^2$. The sample mean $\overline{X}$ and the sample variance $S^2$ are defined by
$S^2 = \frac{\sum_{i=1}^n (X_i - \overline{X})^2}{n - 1} = \frac{\sum_{i=1}^n (X_i^2 - n\overline{X}^2)}{n - 1}$
$\mathbf{Remark.}$ As $\overline{X}$ is a linear combination of the sample observations with $a_i = n^{-1}$, we have
As $E(\overline{X}) = \mu$, we say that $\overline{X}$ is unbiased for $\mu$.
$\mathbf{Remark.}$ The one reasor for dividing by $n - 1$ instead of $n$ in $S^2$ is that it makes $S^2$ unbiased for $\sigma^2$.
From the fact that $E(X^2) = \sigma^2 + \mu^2$ and $E(\overline{X}^2) = (\sigma^2 / n) + \mu^2, we have
Reference
[1] Hogg, R., McKean, J. & Craig, A., Introduction to Mathematical Statistics, Pearson 2019
Leave a comment