---
title: "Likelihood and maximum likelihood estimation"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: false
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: false
    keep_tex: true
    highlight: tango
    number_section: true
    toc: true

---

# Maximum likelihood estimation of a probability

----

## Estimating a probability

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic","ISLR"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

```{r, include=FALSE}
library(mosaic)
```

* Assume that we want to estimate a probability $p$ of a certain event, e.g.\
  * the  probability that a bank customer will default their loan
  * the probability that a customer will buy a certain product

* We take a sample of $n$ observations $Y_1, \ldots, Y_n$, where
  * $Y_i=1$ if the event happens,
  * $Y_i=0$ if the event does not happen,
  * The $Y_i$, $i=1,\ldots ,n$, are independent random variables with $P(Y_i = 1)=p$.
  
* Let $X=\sum_i Y_i$ be the number of ones in our sample. The natural estimate for $p$ is
   $$\hat{P} = \frac{X}{n}.$$
  * Theoretical justification?
  
----
  
## The likelihood function

* Idea: choose $\hat{P}$ to be the value of $p$ that makes our observations *as likely as possible*.

* Suppose we have observed $Y_1=y_1, \ldots, Y_n=y_n$. The probability of observing this is $$P(Y_1=y_1, \ldots, Y_n=y_n) = P(Y_1=y_1) \cdot \dotsm \cdot P(Y_n = y_n).$$

* Note that
$$P(Y_i = y_i) = \begin{cases} p, & y_i=1,\\
(1-p), & y_i = 0.
\end{cases}$$

* Therefore, if we let $x = \sum_i y_i$ be the number of 1's in our sample, 
 $$P(Y_1=y_1, \ldots, Y_n=y_n) = p^x \cdot (1-p)^{(n-x)}.$$

 
* This probability depends on the value of $p$. We may think of it as a function
$$L(p)= P(Y_1=y_1, \ldots, Y_n=y_n) = p^x \cdot (1-p)^{(n-x)}.$$
  * This is called the **likelihood function**.

* The **maximum likelihood estimate**  $\hat{p}$ is the value of $p$ that maximizes the likelihood function.

----

## Likelihood function - example

* **Example:** Suppose we take a sample of $n=15$ observations. We observe 5 ones and 10 zeros. The likelihoodfunction becomes 
$$L(p)= p^5 (1-p)^{10}$$

* We plot the graph of $L(p)$:

```{r, echo=FALSE, fig.height=6, fig.width=8}
p<-1:100/100
plot(p,p^5*(1-p)^10,type="l",ylab="L(p)")
```

* The probability of our observations seems to be largest when $p$ is around $1/3$.

----

## The log-likelihood function

* We seek the value of $p$ that maximizes the likelihood function
$$L(p)= p^x \cdot (1-p)^{(n-x)}.$$

* Recall that $\ln(x)$ is a strictly increasing function. 

* The value of $p$ that maximizes $L(p)$ also maximizes $\ln(L(p))$.

* This is the **log-likelihood function**

$$l(p) = \ln(L(p)) = x\ln(p) + (n-x)\ln(1-p).$$

* It is often easier to maximize the log-likelihood function.

----

## Maximum likelihood estimation

* In order to maximize 
$$l(p) =  x\ln(p) + (n-x)\ln(1-p),$$
we differentiate 
$$l'(p) = \frac{x}{p} - \frac{n-x}{1-p}.$$

* The maximum must be found in a point with $l'(p)=0$. Thus, we solve
$$l'(p) = \frac{x}{p} - \frac{n-x}{1-p}=0.$$

* Multiply by $p(1-p)$ to get
$$x(1-p) - (n-x)p=0$$
$$x-xp - np +xp =0$$
$$ x=np$$
$$p = \frac{x}{n}.$$
* Note that this must indeed be a maximum point since $$\lim_{p\to 0} l(p) = \lim_{p\to 1} l(p) = -\infty.$$

* Our **maximum likelihood estimate** of $p$ is $\hat{p}= \frac{x}{n}.$


# Maximum likelihood for logistic regression

----

## The logistic regression model

* Estimation of a probability was a simple use of maximum likelihood estimation, which could easily have been treated by more direct methods.

* **Logistic regression** is a more complex case, where we want to model a probability $p(x)$ that depends on a predictor variable $x$.
  * E.g. the probability of a customer buying a certain product as a function of their monthly income.
  
* In logistic regression, $p(x)$ is modelled by a logistic function
$$p(x)=\frac{1}{1+e^{-(\alpha + \beta x)}}.$$

* Graph of $p(x)$ when $\alpha=0$ and $\beta=1$.
```{r, echo=FALSE}
x<-(-100):100/20
y<-1/(1+exp(-x))
plot(x,y, type="l",ylab="p(x)")
```
  * $\alpha$ determines how steep the graph is.
  * $\beta$ shifts the graph along the $x$-axis.

* How to estimate $\alpha$ and $\beta$?

----

## Maximum likelihood estimation for logistic regression

* A sample consists of $(x_1,y_1),\ldots,(x_n,y_n)$, where $x_i$ is the predictor and $y_i$ is the response, which is either 0 or 1. 

* The probability of our observations is
$$P(Y_1=y_1,\ldots, Y_n=y_n) = \prod_i p(Y_i=y_i) = \prod_i p(x_i)^{y_i} (1-p(x_i))^{1-y_i}$$
since
$$p(x_i)^{y_i} (1-p(x_i))^{1-y_i} = \begin{cases}
p(x_i), & y_i=1,\\
1-p(x_i), & y_i=0.\end{cases}$$

* Inserting what $p(x_i)$ is, we obtain a function of the unknown parameters $\alpha$ and $\beta$:
  $$L(\alpha,\beta) = P(Y_1=y_1,\ldots, Y_n=y_n) = \prod_i \Big(\frac{1}{1+e^{-(\alpha + \beta x_i)}}\Big)^{y_i} \Big(1-\frac{1}{1+e^{-(\alpha + \beta x_i)}}\Big)^{1-y_i}$$ 
  
* Again, it is easier to maximize the log-likelihood
$$l(\alpha,\beta) = \sum_i \Big( y_i\ln(p(x_i)) + (1-y_i)\ln(1-p(x_i))\Big). $$
  
* However, this maximum can only be found using numerical methods.

----

## Logistic regression - example

* We consider a dataset from the ISLR package on whether or not 10000 bank costumers will default their loans.
  * Response: default (1=yes, 0=no)
  * Predictor: income

```{r, warning = FALSE}
library(ISLR)
x<-Default$income/10000 # Annual income in 10000 dollars
y<-as.numeric(Default$default=="Yes") # Loan default, 1 means "Yes"
```

* We want to model the probability of default as a logistic function of income
$$p(x)=\frac{1}{1+e^{-(\alpha + \beta x)}}.$$

----

## Logistic regression - example continued

* We make a function in R that computes the log-likelihood function as a function of the vector $\theta=(\alpha,\beta)^T$. 
  * We first compute a vector px that contains all the probabilities $p(x_i)$. 
  * Then we compute the vector logpy which contains all the $\ln(P(Y_i=y_i))=y_i\ln(p(x_i))+(1-y_i)\ln(1-p(x_i))$.
  * Finally, we compute the log-likelihood with the formula
$$l(\alpha,\beta) = \sum_i \Big(y_i\ln(p(x_i))+(1-y_i)\ln(1-p(x_i))\Big)$$

```{r}
loglik <- function(theta) {
alpha=theta[1]
beta=theta[2]
px<-1/(1+exp(-alpha-beta*x))
logpy<-y*log(px) + (1-y)*log(1-px) 
sum(logpy)
}
loglik(c(2,2))
```

* We maximize the log-likelihood using the `optim()` function in R.
  * It needs an initial guess of $\theta$. Here we use `c(2,2)`.
  * The option `control=list(fnscale=-1)` ensures that we maximize rather than minimize.

```{r}  
optim(c(2,2),loglik,control=list(fnscale=-1))
``` 

* We obtain the maximum likelihood estimates $\hat{\alpha}= -3.099$ and $\hat{\beta}=-0.081$.

----

## Logistic regression - example continued

* We can plot the estimated logistic function 
$$ \hat{p}(x) = \frac{1}{1+e^{3.099 + 0.081x}}.$$
```{r, echo=FALSE}
xx<-(1:100)/10
yy<-1/(1+exp(3.099+0.081*xx))
plot(xx,yy,type="l",ylab="p(x)",xlab="x")
```

* The maximum likelihood estimates of $\alpha$ and $\beta$ can be found directly using R:

```{r}
model<-glm(y~x,family="binomial")
summary(model)
```

# Maximum likelihood estimation with continuous variables

----

## The probability density function

* Suppose we have a sample $Y_1,\ldots,Y_n$ of independent variables with

$$
Y_i \sim N(\mu, \sigma)
$$

* We would like to estimate the unknown parameters $\mu$ and $\sigma$.

* For a continuous variable $Y$ we have $P(Y=y)=0$ for all $y$. 
  * We cannot use the probability of observing a given outcome to define the likelihood function.

* Instead we consider the probability density function

$$
f(y ) = {\frac {1}{\sigma {\sqrt {2\pi }}}} \exp \left [ {-{\frac {1}{2}}\left({\frac {y-\mu }{\sigma }}\right)^{2}} \right ]
$$

* E.g. for $\mu = 0$ and $\sigma = 1$:

```{r, echo=FALSE}
plotFun(dnorm(x, mean = 0, sd = 1) ~ x, xlim = c(-3, 3),xlab="y",ylab="f(y)")
```

* The most likely values are the ones where $f(y)$ is large.

* Thus we will use $f(y)$ as a measure of how likely it is to observe $Y=y$.

----

## The likelihood function for $n$ observations

* Since $Y_1,\ldots, Y_n$ are independent observations, the joint density function becomes a product of marginal densities:
$$f_{(Y_1,\ldots,Y_n)}(y_1,\ldots, y_n) = \prod_i f_{Y_i}(y_i). $$

* If we have observed a sample $Y_1=y_1,\ldots, Y_n=y_n$, our likelihood function is defined as

\begin{align}
  L(\mu, \sigma) =f_{(Y_1,\ldots,Y_n)}(y_1,\ldots, y_n) = \prod_i f_{Y_i}(y_i) = \prod_i \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}}.
\end{align}

* The maximum likelihood estimate $(\hat{\mu},\hat{\sigma})$ is the value of $(\mu,\sigma)$ that maximizes the likelihood function. 

----

## Log-likelihood function in the normal case

* We found the log-likelihood function

\begin{align}
  L(\mu, \sigma) = \prod_i \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}}.
\end{align}

* Again it is easier to maximize the log-likelihood function.

$$
 l(\mu,\sigma) = \ln( L(\mu, \sigma) )  = \sum_i \ln\Big(\frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(y_i-\mu)^2}{2\sigma^2}} \Big)
$$
$$=  \sum_i\Big( -\ln(\sigma \sqrt{2\pi})  -\frac{(y_i-\mu)^2}{2\sigma^2}\Big) = -n\ln(\sigma \sqrt{2\pi})  -\sum_i\frac{(y_i-\mu)^2}{2\sigma^2} $$

* We find the partial derivatives and set them equal to 0. First with respect to $\mu$:  

$$\frac{\partial}{\partial \mu } l(\mu,\sigma) = \sum_i\frac{2(y_i-\mu)}{2\sigma^2}= \frac{1}{\sigma^2}\sum_i (y_i-\mu) = \frac{1}{\sigma^2}\Big(\sum_i y_i-n\mu \Big)= 0 $$

* Vi får at $n\mu = \sum_i y_i$, så $\mu = \frac{1}{n}\sum_i y_i = \bar{y}$.

* Then with respect to $\sigma$:

$$\frac{\partial}{\partial \sigma } l(\mu,\sigma) = -\frac{n}{\sigma}  + \sum_i\frac{(y_i-\mu)^2}{\sigma^3} = 0 $$

* We multiply by $\sigma$ and insert $\mu = \bar{y}$:

$$ -{n}  + \sum_i\frac{(y_i-\bar{y})^2}{\sigma^2} = 0 $$
$$n=\frac{1}{\sigma^2} \sum_i {(y_i-\bar{y})^2} $$
$$ \sigma^2 = \frac{1}{n}\sum_i(y_i-\bar{y})^2$$

* In total we get the maximum likelihood estimates:

$$\hat{\mu} = \frac{1}{n} \sum_i y_i = \bar{y}$$
$$ \hat{\sigma}^2 = \frac{1}{n}\sum_i(y_i-\bar{y})^2$$

----

## Numerical solution - normal distribution

* The maximum likelihood estimates can also be found numerically. We consider again the `trees` data.

```{r}
trees <- read.delim("https://asta.math.aau.dk/datasets?file=trees.txt")
head(trees)
```

* We will assume that the variable `Height` is normally distributed.

```{r}
qqnorm(trees$Height)
qqline(trees$Height)
```

----

## Numerical solution - normal distribution

* We define the log-likelihood as a function of the parameter vector $\theta=(\mu,\sigma)^T$. 
  * `dnorm(y, mean = mu, sd = sigma)` gives the normal density $f(y)$ with mean $\mu$ and standard deviation $\sigma$ evaluated at $y$.
```{r, warning=TRUE}
loglik_normal <- function(theta) {
  mu <- theta[1]
  sigma <- theta[2]
  y<-trees$Height
  fy<-dnorm(y , mean = mu, sd = sigma)
  sum(log(fy))
}
loglik_normal(c(1,5))
```

* We maximize again using `optim()`:

```{r, warning=TRUE}
optim(c(1, 5), loglik_normal,control=list(fnscale=-1))
```

* We can compare this to the theoretical formulas for the maximum likelihood estimates:
$$\hat{\mu}=\bar{y},$$
$$\hat{\sigma}= \sqrt{\frac{1}{n}\sum_i (y_i-\bar{y})^2} = \sqrt{\frac{n-1}{n}} s. $$

```{r}
mean(trees$Height)
sd(trees$Height)
n <- length(trees$Height)
sd(trees$Height)*sqrt((n-1)/n)
```


# Properties of maximum likelihood estimators

* Suppose $\theta \in \mathbb{R}$ is a parameter that we estimate by $\hat{\theta}$ using maximum likelihood estimation. Then (under suitable conditions) one may show the following mathematically.
  
* **Consistency:** For all $\varepsilon >0$, $$\lim_{n\to \infty} P(|\theta - \hat \theta|>\varepsilon ) = 0$$

* **Central limit theorem:** When $n\to \infty$, 
$$ \sqrt{n} (\hat \theta - \theta) \to N(0, \sigma_\theta^2).$$
That is, for large $n$,
$$ \sqrt{n} (\hat \theta - \theta) \approx N(0, \sigma_\theta^2),$$
or equivalently,
$$ \hat \theta  \approx N\Big(\theta, \frac{\sigma_\theta^2}{n}\Big).$$