[Statistics] Maximum Likelihood Method - Part I

3 minute read

Maximum Likelihood Method - Part I

For point estimation, one of the most widely used procedure is maximum likelihood estimation (mle). But one may wonder if mle can really estimate the true value well. In this topic, I will introduce: under certain conditions (we often call it regularity conditions), these likelihood procedures are asymtotically optimal.

The true parameter asymptotically maximizes likelihood

Consider a random variable $X$ whose pdf $f(x; \theta)$ depends on an unknown parameter $\theta$ which is in a set $\Omega$. Suppose that we have a i.i.d. random sample $X_1, \cdots, X_n$ for information. In this post, I will assume that $f$ is continuous and $\theta$ is a scalar, but the results can extend to the discrete case, and the vector case.

The likelihood function is given by

$L(\theta; \mathbf{x}) = \prod_{i=1}^n f(x_i; \theta), \theta \in \Omega$.

where $\mathbf{x} = (x_1, \cdots, x_n)’$. Note that there is no information loss in using log likelihood $l(\theta) = \text{log }L(\theta)$ as the logarithm is one-to-one. Recall that we call $\widehat{\theta}$ that maximizes $L(\theta)$ maximum likelihood estimator of $\theta$.

Let $\theta_0$ be the true value of $\theta$. The next theorem shows that, the likelihood actually has an maximum at the true value of parameter $\theta$, $\theta_0$ asymptotically. In other words, the maximum of $L(\theta)$ asymptotically separates the true model at $\theta_0$ from models at $\theta \neq \theta_0$. But, to prove this, we need some assumptions.

$\mathbf{Assumption\ 1.1}$ (Regularity Conditions).
(R0) The cdfs are distinct; i.e., $\theta_1 \neq \theta_2 \to F(x_i; \theta_1) \neq F(x_i; \theta_2)$.
(R1) The pdfs have common support for all $\theta \in \Omega$.
(R2) The point $\theta_0$ is an interior point in $\Omega$.

(R0) states that the parameter identifies the pdf. (R1) states that the support of $X_i$ does not depend on $\theta$. This assumption is very restrictive: for instance, uniform distribution doesn’t satisfy (R1).

Now, we are ready to prove theorem.

$\mathbf{Thm\ 1.1.}$ Suppose $E_{\theta_0}[f(X_i; \theta)/f(X_i; \theta_0)]$ exists. With (R0) & (R1),

$\displaystyle \lim_{n \to \infty} P_{\theta_0} [L(\theta_0; \mathbf{X}) > L(\theta; \mathbf{X})] = 1, \text{for all } \theta \neq \theta_0$.

$Proof$. $L(\theta_0; \mathbf{X}) > L(\theta; \mathbf{X}) \leftrightarrow \frac{1}{n} \sum_{i=1}^n \text{log} \frac{f(X_i; \theta)}{f(X_i; \theta_0}) < 0$.

By the Law of Large Numbers,

$\sum_{i=1}^n \text{log} \frac{f(X_i; \theta)}{f(X_i; \theta_0)} \overset{P}{\to} E_{\theta_0} [\text{log} \frac{f(X_1; \theta)}{f(X_1; \theta_0)}]$

By Jensen’s inequality,

$E_{\theta_0} [\text{log} \frac{f(X_1; \theta)}{f(X_1; \theta_0)}] < \text{log} E_{\theta_0} [\frac{f(X_1; \theta)}{f(X_1; \theta_0)}]$

But, $E_{\theta_0} [\frac{f(X_i; \theta)}{f(X_i; \theta_0)}] = \int \frac{f(x; \theta)}{f(x; \theta_0)} f(x; \theta_0) dx = 1$. This can be satisfied by (R1). Thus, the statement is true$._\blacksquare$

In summary, asymptotically, the likelihood function is maximized at the true parameter $\theta_0$. So in considering estimates of $\theta_0$, it seems natural to consider the value of $\theta$ that maximizes the likelihood.

Invariance property of MLE

In the previous section, we see that finding the MLE can be really estimating the true parameter. Now on, we will obtain some appealing properties of mle, which supports the maximum likelihood estimation is one of the useful point estimations.

$\mathbf{Thm\ 1.2.}$ Let $\eta = g(\theta)$ be a parameter of interest. Suppose $\widehat{\theta}$ is the mle of $\theta$. Then, $g(\widehat{\theta})$ is the mle of $\eta$, i.e., $\widehat{\eta} = g(\widehat{\theta})$

$Proof$: First, for simple case, suppose $g$ is a one-to-one function. The likelihood of our interest is $L(g(\theta)$. By assumption,

$\underset{\theta}{\text{max }} L(g(\theta)) = \underset{\eta = g(\theta)}{\text{max }} L(\eta) = \underset{\eta}{\text{max }} L(g^{-1}(\eta)).$

And, the maximum occurs when $g^{-1}(\eta) = \widehat{\theta}$, i.e., $\widehat{\eta} = g(\widehat{\theta})$.

Generally, assume that $g$ is not one-to-one. The maximum occurs at $\widehat{\theta}$ and the domain of $g$ $\Omega$ covers $\widehat{\theta}$ by (R3): $\widehat{\theta}$ is an interior point of $\Omega$. Then, by choosing $\widehat{\eta}$ so that the preimage $g^{-1} (\widehat{\theta})$ contains $\widehat{\theta}$. Then, $\widehat{\eta} = g(\widehat{\theta})._\blacksquare$

$\mathbf{Example.}$ Large sample confidence interval of Bernoulli distribution

For example, consider the Bernoulli random variables with probability of success $p$. It is clear that the mle of $p$ is $\widehat{p} = \overline{X}$. And, in the large sample confidence interval for $p$, an estimate of $\sqrt{p(1-p)}$ is required. By previous theorem, the mle of this quantity is $\sqrt{\widehat{p}(1-\widehat{p})}$. The large smaple confidence interval for $p$ can be obtained by CLT easily.

Reference

[1] Hogg, R., McKean, J. & Craig, A., Introduction to Mathematical Statistics, Pearson 2019

Share on

Twitter Facebook LinkedIn

Youngdo Lee