--- title: "Module 3: Binomial model" author: - Ege Rubak and Jesper Møller - Based on material by Søren Højsgaard date: "" output: slidy_presentation: default pdf_document: default fontsize: 14pt --- ## Rolling a six Consider a simple experiment: Roll a die $n$ times; record the number of times six comes up, and denote it $y$. Suppose e.g. $n=10$ and $y=3$. We let $\theta$ denote the true (but to us unknown) probability of rolling a six (success) and $1-\theta$ is the probability of a roll less than six (failure): $$ Pr(S)=\theta, \quad Pr(F)=1-\theta. $$ ## The binomial model The binomial distribution could be a model for these data: $y=3$ is a realization of a binomial random variable $Y\sim bin(n, \theta)$. The density for $y$ is $$ \text{Pr}(Y=y;\theta) = \frac{n!}{y!(n-y)!} \theta^y (1-\theta)^{n-y}. $$ ```{r out.width="30%", fig.height=5, fig.width=6} yval <- 0:10 barplot(dbinom(yval, size=10, prob=1/6), names.arg=yval) ``` --- Thus, if we know $\theta$, then we can make all sorts of interesting computations based on the binomial model. E.g. the mean and variance which are given by $$ \text{E}(Y)=n\theta, \quad \text{Var}(Y) = n\theta(1-\theta). $$ Or we can calculate the probability of seeing $0$ sixes or the probability of seeing $5$ or more sixes when rolling a die $10$ times. For example, if the die is fair and $\theta=1/6$ we get ```{r} dbinom(0, size=10, prob=1/6) # Pr(0 sixes) 1-pbinom(4, size=10, prob=1/6) # Pr(5 or more sixes) ``` ## A moment estimate In practice $\theta$ is unknown and must be estimated from data. Intuition says that $\theta$ should be estimated as $$ \hat\theta = y / n = 3 / 10 = 0.3. $$ It is useful to write $\hat\theta(y)=y/n$ to emphasize the dependence on data. For the corresponding random variable $\hat\theta(Y)=Y/n$, $$ \text{E}(Y/n)=\theta, \quad \text{Var}(Y/n)= \frac{1}{n^2} n \theta (1-\theta) =\theta(1-\theta)/n. $$ Hence $\hat\theta(Y)$ has the correct mean value (unbiased) and the variance of $\hat\theta(Y)$ goes to $0$ when $n\rightarrow \infty$ (consistent). To calculate the variance, we plug in the estimate and find $$ \text{Var}(Y/n) \approx \frac{y}{n}(1-\frac{y}{n})/n = 0.3 \times 0.7 / 10 = 0.021 $$ I.e. the estimated standard deviation of the estimate (called std. error) is approximately $\sqrt{0.021}=0.14$. ## The likelihood When $y=3$ is observed, then the binomial density is a function of $\theta$, and it is called the likelihood function: $$ L(\theta)= Pr_\theta(Y=y) = \frac{n!}{y!(n-y)!} \theta^y (1-\theta)^{n-y} \propto \theta^y (1-\theta)^{n-y}. $$ Hence, the log-likelihood is $$ l(\theta) = \log L(\theta) = y \log \theta + (n-y) \log(1-\theta). $$ --- ```{r out.width="90%", fig.height=5, fig.width=12, echo=-3 } lik <- function(parm, y, n){parm^y * (1 - parm)^(n - y)} loglik <- function(parm, y, n){y * log(parm) + (n - y) * log(1 - parm)} par(mfrow=c(1, 2), mar = c(3,3,3,1)) n <- 10; y <- 3 curve(lik(x, y, n), main = "Likelihood") abline(v = y/n, col = "red", lty = 2) curve(loglik(x, y, n), main = "Log-likelihood", ylim = c(-20, -5)) abline(v = y/n, col = "red", lty = 2) ``` ## The frequentist approach From a frequentist perspective we want to find the "best" estimate of $\theta$ given data, and "best" is here the value of $\theta$ that maximizes $L()$. The estimate is called the maximum likelihood estimate (MLE). In practice it is usually easier to maximize $l(\theta)=\log L(\theta)$ instead of $L(\theta)$ because the $\log$ turns a product into a sum, and sums are easier to differentiate than products. The usual approach to maximizing $l(\theta)$ is to first differentiate $l(\theta)$ to obtain $$ S(\theta) = l'(\theta) $$ where $S(\theta)$ is called the **score function**. Next we solve the **score equation** $S(\theta)=0$ to obtain $\hat\theta$. It is not hard to spot that the maximum of $L()$ (and of $l()$) is at $\theta=y/n=0.3$, but it is informative to look at the plots for different choices of $n$ and $y$. This is left as an exercise. ## Summary of frequentist approach * When we have a statistical model (a probability distribution) and data, then we have the likelihood (and the log-likelihood) functions. * The maximum likelihood estimate (MLE) $\hat\theta$ is the value of $\theta$ that maximizes the likelihood function. * The variance of the corresponding estimator is approximately minus the inverse of the second derivative of the log likelihood evaluated at the MLE. ## The Bayesian approach If we take a Bayesian perspective then things change as explained in the main slides for this module: > The parameter $\theta$ is a random quantity on equal footing with $Y$ and we have to specifiy the prior $$ \pi(\theta). $$ This is our belief about $\theta$ before seeing any data, and then given data $y$ we get the posterior from the likelihood via Bayes' rule $$ \pi(\theta|y) = \frac{\pi(y|\theta)\pi(\theta)}{\pi(y)}, $$ where $\pi(y|\theta)$ is (proportional to) the likelihood we have specified previously and $\pi(y)$ is the marginal probability for the data $y$. When data $y$ is observed then $\pi(y)$ above is just a number, which ensures that $\pi(\theta|y)$ is a density, i.e. that $\int \pi(\theta|y) d\theta = 1$. We do often not calulate $\pi(y)$ directly: We just use that $\pi(\theta|y) \propto \pi(y|\theta) \pi(\theta)$ ## A discrete prior Example: Assume (to ease computations) that the only valid choices for $\theta$ are now $.1, .3, .5, .7$ and $.9$. Before rolling the die we think the die has been rigged somehow and we take the prior to be ```{r} theta <- c(.1, .3, .5, .7, .9) prior <- c(0.10, 0.15, 0.25, 0.30, 0.20) ``` Then we can calculate the likelihood and the posterior ```{r} n <- 10; y <- 3 likval <- lik(theta, y, n) posterior <- likval * prior posterior <- posterior / sum( posterior ) round(100*posterior, 3) ``` --- The plot shows it all: ```{r fig.height=2.5, fig.width=7.5, out.width="80%"} par(mfrow=c(1,3), mar = c(3, 3, 3, 0.5)) barplot(prior, main="prior", names.arg=theta, ylim = c(0, .55)) barplot(likval, main="likelihood", names.arg=theta) barplot(posterior, main="posterior", names.arg=theta, ylim = c(0, .55)) ``` --- We can compute e.g. the prior and posterior means: ```{r} sum(theta * prior) sum(theta * posterior) ``` And corresponding variances (uncertanties before and after seeing data). ```{r} sum(theta^2 * prior) - sum(theta * prior)^2 sum(theta^2 * posterior) - sum(theta * posterior)^2 ```