The ASTA team
Idea: choose \(\hat{P}\) to be the value of \(p\) that makes our observations as likely as possible.
Suppose we have observed \(Y_1=y_1, \ldots, Y_n=y_n\). The probability of observing this is \[P(Y_1=y_1, \ldots, Y_n=y_n) = P(Y_1=y_1) \cdot \dotsm \cdot P(Y_n = y_n).\]
Note that \[P(Y_i = y_i) = \begin{cases} p, & y_i=1,\\ (1-p), & y_i = 0. \end{cases}\]
Therefore, if we let \(x = \sum_i y_i\) be the number of 1’s in our sample, \[P(Y_1=y_1, \ldots, Y_n=y_n) = p^x \cdot (1-p)^{(n-x)}.\]
This probability depends on the value of \(p\). We may think of it as a function \[L(p)= P(Y_1=y_1, \ldots, Y_n=y_n) = p^x \cdot (1-p)^{(n-x)}.\]
The maximum likelihood estimate \(\hat{p}\) is the value of \(p\) that maximizes the likelihood function.
Example: Suppose we take a sample of \(n=15\) observations. We observe 5 ones and 10 zeros. The likelihoodfunction becomes \[L(p)= p^5 (1-p)^{10}\]
We plot the graph of \(L(p)\):
We seek the value of \(p\) that maximizes the likelihood function \[L(p)= p^x \cdot (1-p)^{(n-x)}.\]
Recall that \(\ln(x)\) is a strictly increasing function.
The value of \(p\) that maximizes \(L(p)\) also maximizes \(\ln(L(p))\).
This is the log-likelihood function
\[l(p) = \ln(L(p)) = x\ln(p) + (n-x)\ln(1-p).\]
In order to maximize \[l(p) = x\ln(p) + (n-x)\ln(1-p),\] we differentiate \[l'(p) = \frac{x}{p} - \frac{n-x}{1-p}.\]
The maximum must be found in a point with \(l'(p)=0\). Thus, we solve \[l'(p) = \frac{x}{p} - \frac{n-x}{1-p}=0.\]
Multiply by \(p(1-p)\) to get \[x(1-p) - (n-x)p=0\] \[x-xp - np +xp =0\] \[ x=np\] \[p = \frac{x}{n}.\]
Note that this must indeed be a maximum point since \[\lim_{p\to 0} l(p) = \lim_{p\to 1} l(p) = -\infty.\]
Our maximum likelihood estimate of \(p\) is \(\hat{p}= \frac{x}{n}.\)
Estimation of a probability was a simple use of maximum likelihood estimation, which could easily have been treated by more direct methods.
Logistic regression is a more complex case, where we want to model a probability \(p(x)\) that depends on a predictor variable \(x\).
In logistic regression, \(p(x)\) is modelled by a logistic function \[p(x)=\frac{1}{1+e^{-(\alpha + \beta x)}}.\]
Graph of \(p(x)\) when \(\alpha=0\) and \(\beta=1\).
How to estimate \(\alpha\) and \(\beta\)?
A sample consists of \((x_1,y_1),\ldots,(x_n,y_n)\), where \(x_i\) is the predictor and \(y_i\) is the response, which is either 0 or 1.
The probability of our observations is \[P(Y_1=y_1,\ldots, Y_n=y_n) = \prod_i p(Y_i=y_i) = \prod_i p(x_i)^{y_i} (1-p(x_i))^{1-y_i}\] since \[p(x_i)^{y_i} (1-p(x_i))^{1-y_i} = \begin{cases} p(x_i), & y_i=1,\\ 1-p(x_i), & y_i=0.\end{cases}\]
Inserting what \(p(x_i)\) is, we obtain a function of the unknown parameters \(\alpha\) and \(\beta\): \[L(\alpha,\beta) = P(Y_1=y_1,\ldots, Y_n=y_n) = \prod_i \Big(\frac{1}{1+e^{-(\alpha + \beta x_i)}}\Big)^{y_i} \Big(1-\frac{1}{1+e^{-(\alpha + \beta x_i)}}\Big)^{1-y_i}\]
Again, it is easier to maximize the log-likelihood \[l(\alpha,\beta) = \sum_i \Big( y_i\ln(p(x_i)) + (1-y_i)\ln(1-p(x_i))\Big). \]
However, this maximum can only be found using numerical methods.
library(ISLR)
x<-Default$income/10000 # Annual income in 10000 dollars
y<-as.numeric(Default$default=="Yes") # Loan default, 1 means "Yes"
loglik <- function(theta) {
alpha=theta[1]
beta=theta[2]
px<-1/(1+exp(-alpha-beta*x))
logpy<-y*log(px) + (1-y)*log(1-px)
sum(logpy)
}
loglik(c(2,2))
## [1] -84250
optim()
function in R.
c(2,2)
.control=list(fnscale=-1)
ensures that we maximize rather than minimize.## $par
## [1] -3.099 -0.081
##
## $value
## [1] -1458
##
## $counts
## function gradient
## 69 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
We can plot the estimated logistic function \[ \hat{p}(x) = \frac{1}{1+e^{3.099 + 0.081x}}.\]
The maximum likelihood estimates of \(\alpha\) and \(\beta\) can be found directly using R:
##
## Call:
## glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0941 0.1463 -21.16 <2e-16 ***
## x -0.0835 0.0421 -1.99 0.047 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 2916.7 on 9998 degrees of freedom
## AIC: 2921
##
## Number of Fisher Scoring iterations: 6
\[ Y_i \sim N(\mu, \sigma) \]
We would like to estimate the unknown parameters \(\mu\) and \(\sigma\).
For a continuous variable \(Y\) we have \(P(Y=y)=0\) for all \(y\).
Instead we consider the probability density function
\[ f(y ) = {\frac {1}{\sigma {\sqrt {2\pi }}}} \exp \left [ {-{\frac {1}{2}}\left({\frac {y-\mu }{\sigma }}\right)^{2}} \right ] \]
The most likely values are the ones where \(f(y)\) is large.
Thus we will use \(f(y)\) as a measure of how likely it is to observe \(Y=y\).
Since \(Y_1,\ldots, Y_n\) are independent observations, the joint density function becomes a product of marginal densities: \[f_{(Y_1,\ldots,Y_n)}(y_1,\ldots, y_n) = \prod_i f_{Y_i}(y_i). \]
If we have observed a sample \(Y_1=y_1,\ldots, Y_n=y_n\), our likelihood function is defined as
\[\begin{align} L(\mu, \sigma) =f_{(Y_1,\ldots,Y_n)}(y_1,\ldots, y_n) = \prod_i f_{Y_i}(y_i) = \prod_i \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}}. \end{align}\]
\[\begin{align} L(\mu, \sigma) = \prod_i \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}}. \end{align}\]
\[ l(\mu,\sigma) = \ln( L(\mu, \sigma) ) = \sum_i \ln\Big(\frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}} \Big) \] \[= \sum_i\Big( -\ln(\sigma \sqrt{2\pi}) -\frac{(y_i-\mu)^2}{2\sigma^2}\Big) = -n\ln(\sigma \sqrt{2\pi}) -\sum_i\frac{(y_i-\mu)^2}{2\sigma^2} \]
\[\frac{\partial}{\partial \mu } l(\mu,\sigma) = \sum_i\frac{2(y_i-\mu)}{2\sigma^2}= \frac{1}{\sigma^2}\sum_i (y_i-\mu) = \frac{1}{\sigma^2}\Big(\sum_i y_i-n\mu \Big)= 0 \]
Vi får at \(n\mu = \sum_i y_i\), så \(\mu = \frac{1}{n}\sum_i y_i = \bar{y}\).
Then with respect to \(\sigma\):
\[\frac{\partial}{\partial \sigma } l(\mu,\sigma) = -\frac{n}{\sigma} + \sum_i\frac{(y_i-\mu)^2}{\sigma^3} = 0 \]
\[ -{n} + \sum_i\frac{(y_i-\bar{y})^2}{\sigma^2} = 0 \] \[n=\frac{1}{\sigma^2} \sum_i {(y_i-\bar{y})^2} \] \[ \sigma^2 = \frac{1}{n}\sum_i(y_i-\bar{y})^2\]
\[\hat{\mu} = \frac{1}{n} \sum_i y_i = \bar{y}\] \[ \hat{\sigma}^2 = \frac{1}{n}\sum_i(y_i-\bar{y})^2\]
trees
data.## Girth Height Volume
## 1 8.3 70 10
## 2 8.6 65 10
## 3 8.8 63 10
## 4 10.5 72 16
## 5 10.7 81 19
## 6 10.8 83 20
Height
is normally distributed.dnorm(y, mean = mu, sd = sigma)
gives the normal density \(f(y)\) with mean \(\mu\) and standard deviation \(\sigma\) evaluated at \(y\).loglik_normal <- function(theta) {
mu <- theta[1]
sigma <- theta[2]
y<-trees$Height
fy<-dnorm(y , mean = mu, sd = sigma)
sum(log(fy))
}
loglik_normal(c(1,5))
## [1] -3590
optim()
:## $par
## [1] 76.0 6.3
##
## $value
## [1] -101
##
## $counts
## function gradient
## 103 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
## [1] 76
## [1] 6.4
## [1] 6.3
Suppose \(\theta \in \mathbb{R}\) is a parameter that we estimate by \(\hat{\theta}\) using maximum likelihood estimation. Then (under suitable conditions) one may show the following mathematically.
Consistency: For all \(\varepsilon >0\), \[\lim_{n\to \infty} P(|\theta - \hat \theta|>\varepsilon ) = 0\]
Central limit theorem: When \(n\to \infty\), \[ \sqrt{n} (\hat \theta - \theta) \to N(0, \sigma_\theta^2).\] That is, for large \(n\), \[ \sqrt{n} (\hat \theta - \theta) \approx N(0, \sigma_\theta^2),\] or equivalently, \[ \hat \theta \approx N\Big(\theta, \frac{\sigma_\theta^2}{n}\Big).\]