---
title: "Estimation"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r  include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Point and interval estimates

```{r echo=FALSE, results="hide", message=FALSE}
library(mosaic)
library(VennDiagram)
```
----

## Point and interval estimates

* Suppose we study a population and we are interested in certain parameters of the population  distribution, e.g.\ the mean $\mu$ and the standard deviation $\sigma$.

* Based on a sample we can make a **point estimate** of the parameter.  We have already seen the following examples: 
  * $\bar{x}$ is a point estimate of $\mu$ 
  * $s$ is a point estimate  of $\sigma$
      
* We often supplement the point estimate
  with an **interval estimate** (also called a **confidence interval**). 
  This is an interval around the
  point estimate, in which we are confident (to a certain degree)
  that the population parameter is located.
 
----

## Point estimators: Bias

* If we want to estimate the population mean $\mu$ we have several
  possibilities e.g.
  * the sample mean $\bar{X}$
  * the average $X_T$ of the sample upper and lower quartiles

* Advantage of $X_T$: Very large/small observations have little
  effect, i.e. it has practically no effect if there are a few errors
  in the data set.

* Disadvantage of $X_T$: If the distribution of the population is
  skewed, i.e. asymmetrical, then $X_T$ is **biased**, i.e.\ $E(X_T)\neq \mu$. This means that in the long run
  this estimator systematically over or under estimates the value of
  $\mu$.

* Generally we prefer that an estimator is **unbiased**, i.e. its expected value equals the true parameter value.

* Recall that for a sample from a population with mean $\mu$, the sample mean   $\bar{X}$ also has mean $\mu$. That is, $\bar{X}$ is an unbiased estimate of the population mean $\mu$.  

----

## Point estimators: Consistency
* From previous lectures we know that the standard error of $\bar{X}$ is 
$\frac{\sigma}{\sqrt{n}}$,  so
  * the standard error decreases when the sample size increases.
* In general an estimator with this property is called **consistent**.
* $X_T$ is also a consistent estimator, but has a variance that is greater than $\bar{X}$.

----
  
## Point estimators: Efficiency
* Since the variance of $X_T$ is greater than the variance of $\bar{X}$, we prefer $\bar{X}$.
* In general, we prefer the estimator with the smallest possible variance. This estimator is said to be **efficient**.
* $\bar{X}$ is an efficient estimator.

----

## Notation
* The symbol $\hat{\ }$ above a parameter denotes a (point)
  estimate of the parameter. We have looked at an
    * estimate of the population mean $\mu$, namely $\hat{\mu}=\bar{x}$.
    * estimate of the population standard deviation $\sigma$, namely $\hat{\sigma}=s$
    * estimate of the population proportion $p$, namely the sample proportion $\hat{p}$.

# Confidence intervals

----

## Confidence Interval

* A **confidence interval** for a parameter is constructed as
an interval, where we expect the population parameter to be.
* The probability that this construction yields an interval which
      includes the population parameter is called the **confidence level**. 
  * We write the confidence level as $100(1-\alpha)\%$.
  * The confidence level is typically chosen to be $95\%$. 
  * $\alpha$ is called the **error probability**. 
  * For a $95\%$ confidence level $\alpha=1-0.95=0.05$.
* In practice the interval is often constructed as a symmetric interval
  around a point estimate:
    * **point estimate$\pm$margin of error**
    * Rule of thumb: With a margin of error of 2 times
      the standard error you get a confidence interval, where the
      confidence level is approximately $95\%$.
    * I.e: **point estimate** $\pm$ **2 x standard error** has
      confidence level of approximately $95\%$.
* Interpretation: We say that we are $100(1-\alpha)\%$ confident that the population parameter lies in the confidence interval.

----

## Confidence interval for the mean (known standard deviation)

* Consider a population with population mean $\mu$ and standard deviation $\sigma$. We would like to make a $100(1-\alpha)\%$ confidence interval for $\mu$. 

* Suppose we draw a random sample $X_1,\ldots,X_n$. As a point estimate for $\mu$ we use $\bar{X}$.

* If the population follows a normal distribution or if $n\geq 30$, we may assume $\bar{X}\sim \texttt{norm}(\mu,\tfrac{\sigma}{\sqrt{n}})$. 

* The $z$-score of $\bar{X}$ follows a standard normal distribution:

$$Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim \texttt{norm(0,1)}.$$

* We determine the **critical $z$-value** $z_{\alpha/2}$ such that $P(Z>z_{\alpha/2}) = \alpha/2$. This implies by symmetry that
$$ P(-z_{\alpha/2}\leq Z \leq z_{\alpha/2})=1-\alpha.$$

```{r ,echo=FALSE,fig.width=3,fig.height=2}
x <- (-70:70)/20
par(mar=c(3,0,0,0))
plot(x, dnorm(x), axes=F, xlab="", ylab="", type="l", ylim=c(-.01,.4), main="")
abline(h=0)
lines(c(-2,-2),c(0,dnorm(2)))
lines(c(2,2 ),c(0,dnorm(2)))
text(0,1,expression(1-alpha))
axis(1,at=2,labels=expression(z[alpha/2]),pos=0,cex.axis=1.5)
axis(1,at=-2,labels=expression(-z[alpha/2]),pos=0,cex.axis=1.5)
axis(1,at=0,labels=0,pos=0,cex.axis=1.5)

```


* Inserting what $Z$ is, we get
$$ P\left(-z_{\alpha/2}\leq \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \leq z_{\alpha/2}\right)=1-\alpha.$$
* Isolating $\mu$ in both in inequalities, we get
$$ P\left(\bar{X}-z_{\alpha/2}\tfrac{\sigma}{\sqrt{n}}\leq \mu  \leq \bar{X} + z_{\alpha/2} \tfrac{\sigma}{\sqrt{n}}\right)=1-\alpha.$$
* That is, for $100(1-\alpha)\%$ of all samples, the population mean $\mu$ lies in the interval $$\left[\bar{X}-z_{\alpha/2}\tfrac{\sigma}{\sqrt{n}}; \bar{X} + z_{\alpha/2} \tfrac{\sigma}{\sqrt{n}}\right].$$

----



## Unknown standard deviation


* The confidence interval was derived using the $z$-score $Z= \tfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}$.

* Problem: In practice $\sigma$ is typically unknown.

* If we replace $\sigma$ by the sample standard deviation $S$, we get the **$t$-score** $$T= \tfrac{\bar{X}-\mu}{S/\sqrt{n}}.$$

* Since $S$ is random with a certain variance, this causes $T$ to vary more than $Z$.

* As a consequence, $T$ no longer follows a normal distribution, but a **$t$-distribution** with $n-1$ degrees of freedom.

----



## Confidence interval (unknown standard deviation)
 

* In the situation where $\sigma$ is unknown, we use that 
$$ T=\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim \texttt{t}(n-1).$$

* We determine the **critical $t$-value** $t_{\alpha/2}$ such that $P(T>t_{\alpha/2}) = \alpha/2$. This implies by symmetry that
$$ P(-t_{\alpha/2}\leq T \leq t_{\alpha/2})=1-\alpha.$$

```{r ,echo=FALSE,fig.width=3,fig.height=2}
x <- (-70:70)/20
par(mar=c(3,0,0,0))
plot(x, dt(x,df=5), axes=F, xlab="", ylab="", type="l", ylim=c(-.01,.4), main="")
abline(h=0)
lines(c(-2,-2),c(0,dt(2,df=5)))
lines(c(2,2 ),c(0,dt(2,df=5)))
text(0,1,expression(1-alpha))
axis(1,at=2,labels=expression(t[alpha/2]),pos=0,cex.axis=1.5)
axis(1,at=-2,labels=expression(-t[alpha/2]),pos=0,cex.axis=1.5)
axis(1,at=0,labels=0,pos=0,cex.axis=1.5)

```

* By exactly the same computations as before, we find that for $100(1-\alpha)\%$ of all samples, $\mu$ lies in 
$$\left[\bar{X}-t_{\alpha/2}\tfrac{S}{\sqrt{n}}; \bar{X} + t_{\alpha/2} \tfrac{S}{\sqrt{n}}\right].$$

* This interval is what we call the **$100(1-\alpha)\%$ confidence interval** for $\mu$.

----

## Calculation of critical $t$-value in **R**

* To apply the formula, we need to be able to compute the critical $t$-value $t_{\alpha/2}=P(T>\alpha/2)$. 

* This can be done in R via the function `qdist`.

* Note that we need the point with **right tail** probability $\alpha/2$ while R gives the **left tail** probailities. The right tail probability $\alpha/2$ corresponds to the left tail probability $1-\alpha/2$.

* So to find $t_{\alpha/2}$ with $\alpha=0.05$ (corresponding to a $95\%$ confidence level)  in a $t$-distribution with $4$ degrees of freedom, we type:

```{r fig.height=5, fig.align="center"} 
qdist("t", p = 1 - 0.025, df = 4)
```



----

### Example: Confidence interval for mean
* We return to the dataset `mtcars`. We want to construct
a $95\%$ confidence interval for the population mean $\mu$ of the fuel consumption.
```{r}
stats <- favstats( ~ mpg, data = mtcars)
stats
qdist("t", 1 - 0.025, df = 32 - 1, plot = FALSE)
```
* I.e. we have
    * $\bar{x} = `r round(stats$mean, 1)`$
    * $s = `r round(stats$sd, 1)`$
    * $n = `r stats$n`$
    * $df = n-1 = `r stats$n-1`$
    * $t_{crit} = `r round(qt( 1 - 0.025, df = 32 - 1), 3)`$.
* The confidence interval is $\bar{x}\pm t_{crit}\frac{s}{\sqrt{n}} =
        [`r round( confint( t.test( ~ mpg, data = mtcars))$lower, 1)` ,
        `r round( confint( t.test( ~ mpg, data = mtcars))$upper, 1)`]$
* All these calculations can be done automatically by R:
```{r}
t.test( ~ mpg, data = mtcars, conf.level = 0.95)
```

----

### Example: Plotting several confidence intervals in **R**

* We shall look at a built-in **R** dataset `chickwts`.
* `?chickwts` yields a page with the following information

  > An experiment was conducted to measure and compare the effectiveness
  > of various feed supplements on the growth rate of chickens.
  > Newly hatched chicks were randomly allocated into six groups, and each
  > group was given a different feed supplement. Their weights in grams
  > after six weeks are given along with feed types.
  
* `chickwts` is a data frame with 71 observations on 2 variables:
    * `weight`:\ a numeric variable giving the chick weight.
    * `feed`:\ a factor giving the feed type.
* Calculate a confidence interval for the mean weight for each feed separately;
the confidence interval is from `lower` to `upper` given by `mean`$\pm$`tscore * se`:
```{r}
cwei <- favstats( weight ~ feed, data = chickwts)
se <- cwei$sd / sqrt(cwei$n) # Standard errors
tscore <- qt(p = .975, df = cwei$n - 1) # t-scores for 2.5% right tail probability (qdist cannot take multiple df's)
cwei$lower <- cwei$mean - tscore * se
cwei$upper <- cwei$mean + tscore * se
cwei[, c("feed", "mean", "lower", "upper")]
```
* We can plot the confidence intervals as horizontal line segments using `gf_errorbarh`:
```{r message=FALSE}
gf_errorbarh(feed ~ lower + upper, data = cwei) %>% 
  gf_point(feed ~ mean)
```

----

## Confidence interval for proportion

* Consider a population with a distribution that can only take the values $0$ and $1$. The interesting parameter of this distribution is the proportion $p$ of the population that has the value 1.  
* Given a random sample $X_1,\ldots,X_n$, recall that we estimate $p$ by
$$
\hat{P} = \frac{\sum_{i=1}^n X_i}{n} = \bar{X}.
$$
* Based on the central limit theorem we have:
  $$\hat{P}\approx N \left (p,\frac{\sigma}{\sqrt{n}} \right)$$
  *if both $n\hat{p}$ and $n(1-\hat{p})$ are at least $15$.*
* It can be shown that a random variable that takes the value $1$ with probability $p$ and $0$ with probability $(1-p)$ has standard deviation
  $$
  \sigma=\sqrt{p(1-p)}.
  $$
  That is, the standard deviation is not a "free" parameter for a $0/1$ variable
  as it is determined by the probability $p$.
* With a sample size of $n$, the standard error of $\hat{P}$ will be:
  $$
    \frac{\sigma}{\sqrt{n}} =\sqrt{\frac{p(1-p)}{n}}.
  $$
* We do not know $p$ but we insert the estimate $\hat{P}$ and get the 
  **estimated standard error** of $\hat{P}$:
  $$
  se=\sqrt{\frac{\hat{P}(1-\hat{P})}{n}}.
  $$
* By similar calculations as in the case of confidence intervals for the mean, we find the limits of the confidence interval to be
$$
  \hat{P} \pm z_{\alpha/2}\sqrt{\frac{\hat{P}(1-\hat{P})}{n}}.
  $$
* Here $z_{\alpha/2}$ (or $z_{crit}$) is the critical value for which the upper tail probability in the standard normal distribution is $\alpha/2$. (E.g. we have $z=1.96$ when $\alpha=5\%$.)



----

### Example: Point and interval estimate for proportion

* We consider again the data set concerning votes in Chile.  
```{r, echo=FALSE}
Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")

```
* We are interested in the unknown proportion $p$ of females in the population of Chile. 

* The gender distribution in the sample is:
```{r message=FALSE}
library(mosaic)
tally( ~ sex, data = Chile)
tally( ~ sex, data = Chile, format = "prop")
```
* Estimate of $p$ (sample proportion): $\quad \hat{p} = \frac{1379}{1379+1321} = 0.5107$
* An approximate 95% confidence interval for $p$ is: 
$$\quad \hat{p} \pm z_{crit} \times se = 0.5107 \pm 1.96  \sqrt{\frac{0.5107(1-0.5107)}{1379 + 1321}} = (0.49, 0.53)$$  
* Interpretation: We are $95\%$ confident that there is between $49\%$ and $53\%$ females in Chile.

----

### Example: Confidence intervals for proportion in  R 
* R automatically calculates the confidence interval for the
  proportion of females when we do a so-called hypothesis test (we
  will get back to that later):
```{r}
prop.test( ~ sex, data = Chile, correct = FALSE)
```
* The argument `correct = FALSE` is needed to make R use the
"normal" formulas as on the slides and in the book. When
`correct = TRUE` (the default) a mathematical correction which
you have not learned about is applied and slightly different results
are obtained.

----


### Example: Chile data

We could also have computed a 99% confidence interval for the proportion of females in Chile:

* For a  $99\%$-confidence level we have $\alpha=1\%$ and
    1) $z_{crit}$=`qdist("norm", 1 - 0.01/2)`=2.576.
    2) We still have $\hat{p}=0.5107$ and $n=2700$, so $se = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = 0.0096$.
    3) Thereby, a 99%-confidence interval is: $\hat{p}\pm z_{crit}\times se=[0.49, 0.54]$.
* Note that the confidence interval becomes wider, when we want to be more confident that the interval contains the population parameter.

# Confidence interval interpretation

```{r, include=FALSE}
set.seed(20230226)
n <- 20
true_p <- 0.5
samples_to_draw <- 100
conf_level <- 0.95
samples <- replicate(samples_to_draw, {
  x <- rbinom(1, size = n, prob = true_p)
  test_res <- prop.test(x, n, correct = FALSE, conf.level = conf_level)
  c(test_res$conf.int, test_res$estimate, x)
  })
```

* Assume a population parameter $p=0.5$ (e.g. a fair coin)
* Draw `r samples_to_draw` samples of size $n = `r n`$ and for each estimate $\hat{p}$ and `r 100*conf_level`% confidence interval

```{r, echo=FALSE, fig.width=12, fig.height=6}
cis_contains <- 0
plot(1:samples_to_draw, samples[3, ], ylim = c(0, 1), xlab = NA, ylab = NA)
for (i in 1:samples_to_draw) {
  in_ci <- samples[1, i] <= true_p & samples[2, i] >= true_p
  cis_contains <- cis_contains + in_ci
  clr <- ifelse(in_ci, "darkgreen", "darkred")
  lwd <- ifelse(in_ci, 1, 2)
  lines(c(i, i), c(samples[1, i], samples[2, i]), col = clr, lwd = lwd)
  #text(i, samples[(i %% 2) + 1, i], labels = samples[4, i], pos = 2*(i %% 2) + 1, cex = 0.6)
  text(i, samples[2, i], labels = samples[4, i], pos = 3, cex = 0.6)
}
abline(h = true_p, lty = 2)
title(main = paste0(cis_contains, " out of ", samples_to_draw, " CI's contains true parameter"))
```

* Long run interpretation (repeating the experiment many, many times) 
  + Each sample gives new CI limits
  + Before collecting data, the procedure for calculating a CI will 
    provide a CI that with `r 100*conf_level`% probability will contain the true parameter value
  + Once data collected: Contains the true value or not

* CI contains values of the parameter that are compatible with the observed data
  + Either the CI contains the true parameter, or something "extraordinary" has happened (not impossible, just improbable/surprising)

* We are 95% confident that there is between 49% and 53% females in Chile

# Confidence interval for the variance



## The sample variance

* Suppose we are interested in the variance $\sigma^2$ of a population. 

* We draw a random sample $X_1,\ldots,X_n$ and use the sample variance $S^2$ as a point estimate or $\sigma^2$.

* When the population distribution is normal, or $n\geq 30$,
  $$\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1),$$ where $\chi^2(n-1)$ is the **chi-square distribution** with $(n-1)$ degrees of freedom (se next slide).

  * As a rule of thumb, the degrees of freedom are found as the number of observations ($n$) minus the number of unknown parameters describing the mean (one, namely $\mu$). 
  
----

## The $\chi^2$-distribution

* The distribution $\chi^2(k)$ is called the **chi-square** distribution.

  * It is a continuous distribution on $(0,\infty)$.

  * It depends on a the parameter $k$ called the **degrees of freedom**.

  * The degrees of freedom determine the shape of the distribution.

  * The mean value is $k$.

```{r ,echo=FALSE, fig.width=12, fig.height=4}
x <- (1:100)/20
par(mfrow=c(1,3))
par(xpd=TRUE)
plot(x,dchisq(x, 1),axes=F,xlab="",ylab="",type="l", main="1 degree of freedom", cex.main = 1.5,ylim=c(0,0.5),xlim=c(0,5))
axis(1,at=c(0,1,2,3,4,5),c(0,1,2,3,4,5),pos=0)
axis(2,at=c(0,1,2,3),c(0,1,2,3),pos=0)

plot(x,dchisq(x, 3),axes=F,xlab="",ylab="",type="l", main="3 degrees of freedom", cex.main = 1.5,ylim=c(0,0.5),xlim=c(0,5))
axis(1,at=c(0,1,2,3,4,5),c(0,1,2,3,4,5),pos=0)
axis(2,at=c(0,1,2,3,4,5),c(0,1,2,3,4,5),pos=0)

plot(x,dchisq(x, 5),axes=F,xlab="",ylab="",type="l", main="5 degrees of freedom", cex.main = 1.5,ylim=c(0,0.5),xlim=c(0,5))
axis(1,at=c(0,1,2,3,4,5),c(0,1,2,3,4,5),pos=0)
axis(2,at=c(0,1,2,3,4,5),c(0,1,2,3,4,5),pos=0)

```

----

## Confidence interval for variance 

* To make a confidence interval for $\sigma^2$, we draw a random sample $X_1,\ldots,X_n$, compute $S^2$ and recall that    
 $$\frac{(n-1)S^2}{\sigma}\sim \chi^2(n-1).$$ 


* Let $\chi^2_{\alpha/2}$ and $\chi_{1-\alpha/2}^2$ be the critical values in a $\chi^2$-distribution with $(n-1)$ degrees of freedom such the right tail probabilities are $\alpha/2$ and $1-\alpha/2$, respectively.

 ```{r ,echo=FALSE,fig.width=3,fig.height=2} 
x <- (0:140)/20 
 par(mar=c(3,0,0,0)) 
 plot(x, dchisq(x,df=4), axes=F, xlab="", ylab="", type="l", ylim=c(0,.2),xlim=c(0,7), main="") 
 abline(h=0) 
 lines(c(0.8,0.8),c(0,dchisq(0.8,df=4))) 
 lines(c(6,6 ),c(0,dchisq(6,df=4))) 
 axis(1,at=6,labels=expression(chi[alpha/2]^2),pos=0,cex.axis=1.5) 
 axis(1,at=0.8,labels=expression(chi[1-alpha/2]^2),pos=0,cex.axis=1.5) 
 text(3,0.1,expression(1-alpha)) 

 ``` 

 * Then $$P\left(\chi^2_{1-\alpha/2} \leq \frac{(n-1)S^2}{\sigma^2} \leq\chi_{\alpha/2}^2\right)=1-\alpha.$$ 

 * Isolating $\sigma^2$, this is equivalent to  
 $$P\left(\frac{(n-1)S^2}{\chi^2_{\alpha/2}} \leq \sigma^2 \leq \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}\right)=1-\alpha.$$ 

 * So we get the confidence interval for $\sigma^2$: 
 $$\left[\frac{(n-1)S^2}{\chi^2_{\alpha/2}} ; \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}\right].$$ 

 * A confidence interval for $\sigma$ can be found by taking square roots: 
 $$\left[\sqrt{\frac{(n-1)S^2}{\chi^2_{\alpha/2}} };\sqrt{ \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}}\right].$$ 

* Note that these confidence intervals are not symmetric around the point estimate.

----

### Example: confidence interval for a variance 

* We return to the dataset `mtcars` and construct a $95\%$ confidence interval for the population variance $\sigma^2$ of the fuel consumption. We find the sample variance to be $6.026948^2\approx 36.3$ using ´favstats´.
```{r}
stats <- favstats( ~ mpg, data = mtcars)
stats
```
* The critical $\chi^2$-values are found using `qdist`. The degrees of freedom are $n-1$. The $\chi^2$-distribution is not symmetric, so we need to find both $\chi^2_{\alpha/2}$ and $\chi^2_{1-\alpha/2}$.
```{r}
qdist("chisq", 1 - 0.025, df = 32 - 1 )
qdist("chisq",  0.025, df = 32 -1)
```
* I.e. we have
    * $\chi^2_{\alpha/2}=48.23189$
    * $\chi^2_{1-\alpha/2}=17.53874$
* So we get the confidence interval for $\sigma^2$: 
 $$\left[\frac{(n-1)s^2}{\chi^2_{\alpha/2}} ; \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}\right] = \left[\frac{31\cdot 6.026948^2}{48.23189} ; \frac{(31\cdot  6.026948^2 }{17.53874}\right] =[23.3, 64.2].$$ 
* A confidence interval for $\sigma$ is $[\sqrt{23.3}, \sqrt{64.2}] = [4.83,8.01]$.

<!-- # Two populations -->

<!-- ---- -->

<!-- ## Two populations  -->

<!-- * We now imagine a scenario where we have two populations with mean values $\mu_1$ and $\mu_2$ and standard deviations $\sigma_1$ and $\sigma_2$, respectively. -->

<!-- ```{r, echo = FALSE, fig.height = 3.5, fig.width = 4.5} -->
<!-- venn.plot <- draw.pairwise.venn(area1 = 70, area2 = 10, cross.area = 10,  -->
<!--                                 category = c( "Population 1", "Sample 1"), cex = 0, cat.cex = 1) -->
<!-- grid.draw(venn.plot) -->
<!-- grid.newpage() -->
<!-- venn.plot <- draw.pairwise.venn(area1 = 70, area2 = 10, cross.area = 10,  -->
<!--                                 category = c( "Population 2", "Sample 2"), cex = 0, cat.cex = 1) -->
<!-- grid.draw(venn.plot) -->
<!-- grid.newpage() -->
<!-- ```  -->

<!-- * We are interested in comparing the means of the two populations, which we do by considering $\mu_1-\mu_2$. -->

<!-- * We draw a sample from each, with sample sizes $n_1$ and $n_2$. -->

<!-- * From the samples we compute the sample means $\bar{X}_1$ and $\bar{X_1}$ and sample standard deviations $S_1^2$ and $S_2^2$. -->

<!-- ---- -->

<!-- ## The difference between means (known variance) -->

<!-- * We estimate the difference $\mu_1-\mu_2$ by $\bar{X}_1 - \bar{X}_2.$ -->

<!-- * One can show that the standard error of $\bar{X}_1 - \bar{X}_2$ is $$\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}.$$ -->

<!-- * Moreover, when the two populations follow a normal distribution, or if $n_1$ and $n_2$ are both at least 30, -->
<!-- $$\frac{(\bar{X}_1 - \bar{X}_2)-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}} \sim \texttt{norm}(0,1).$$ -->

<!-- * If we know $\sigma_1$ and $\sigma_2$, we can find the limits of the $100(1-\alpha)\%$ confidence interval to be -->
<!-- $$(\bar{X}_1 - \bar{X}_2) \pm z_{\alpha/2}\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}.$$ -->

<!-- ---- -->

<!-- ## The difference between means (unknown variance) -->

<!-- * In practice, $\sigma_1$ and $\sigma_2$ are not known. -->

<!-- * We distinguish between two cases:  -->

<!--   1. Equal variances: $\sigma_1^2=\sigma_2^2$. -->
<!--   2. Unequal variances: $\sigma_1^2\neq \sigma_2^2$. -->

<!-- * We will only cover the Case 1.\ of equal variances in this lecture. We get back to Case 2.\ in Module 2. -->

<!-- ---- -->

<!-- ## The difference between means (unknown equal variance) -->

<!-- * When $\sigma_1^2=\sigma_2^2$, we estimate the **pooled variance** by $$S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 + n_2 -2}.$$ -->

<!-- * We may substitute $S_p^2$ for $\sigma_1^2$ and $\sigma_2^2$ on the previous slide. Then we have that  -->
<!-- $$\frac{(\bar{X}_1 - \bar{X}_2)-(\mu_1-\mu_2)}{S_p^2\sqrt{1/n_1 + 1/n_2}} \sim \texttt{t}(n_1+n_2-2).$$ -->

<!-- * The limits of the $100(1-\alpha)\%$ confidence interval are thus -->
<!-- $$(\bar{X}_1 - \bar{X}_2) \pm t_{\alpha/2}S_p\sqrt{1/n_1 + 1/n_2},$$ -->
<!-- where the degrees of freedom in the $t$-distribution is $df=n_1+n_2-2$. -->

<!-- ---- -->

<!-- ### Example: Confidence interval for difference between means -->

<!-- * We use `favstats` to find all the necessary values. -->

<!-- ```{r} -->
<!-- stats <- favstats( mpg ~ vs , data = mtcars) -->
<!-- stats -->
<!-- qdist("t", 1 - 0.025, df = 18+14-2, plot = FALSE) -->

<!-- ``` -->

<!-- * From the output we find the estimated difference to be $\bar{x}_1-\bar{x_2} = 16.61667 - 24.55714 = -7.94047.$ -->

<!-- * The pooled variance is  -->
<!-- $$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 -2} = \\ -->
<!-- \frac{17\cdot 3.860699^2 + 13\cdot 5.378978^2}{18 + 14 -2} = 20.98397.$$ -->

<!-- *  $t_{crit} = 2.039513$. -->

<!-- * We can now find the limits of the $95\%$ confidence interval by: -->
<!-- $$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2}\cdot s_p\cdot \sqrt{1/n_1 + 1/n_2} = 16.61667 - 24.55714 \pm 2.039513 \cdot \sqrt{20.98397}\cdot \sqrt{1/18 + 1/14}. $$ -->

<!-- * This gives the $95\%$ confidence interval $[-11.3;-4.61]$. -->

<!-- * Note that since the sample sizes are small, the confidence level may not be correct if the populations are not normally distributed. -->



# Determining sample size

----

## Determining sample size

* When planning an experiment, one has to decide how large the sample size should be.

  * If the sample size is too small, the parameter estimates will have high variance and hence confidence intervals will be large.
  * A sample size that is too large is costly in terms of time, money, etc.

----


## Sample size for proportion

* The confidence interval is of the form point estimate$\pm$estimated margin of error.
* Imagine that we want to plan an experiment, where we
  **want to achieve a certain margin of error $M$** (and thus a specific width of the      associated confidence interval).
* When we estimate a proportion the margin of error is
    $$
    M=z_{crit}\sqrt{\frac{p(1-p)}{n}},
    $$
   where the critical $z$-score, $z_{crit}$, is determined by the specified confidence level.
* If we solve the equation above we see that if we choose sample size
        $$n=p(1-p)\left(\frac{z_{crit}}{M}\right)^2,$$ then we obtain an estimate of $\pi$ with
        margin of error $M$.
* If we do not have a good guess for the value of $p$ we can use the worst case value
$p=50\%$. The corresponding sample size $n=\left(\frac{z_{crit}}{2M}\right)^2$ ensures that we 
obtain an estimate with a margin of error, which is at the *most* $M$.

----

### Example
* We want to make a survey to determine the proportion of the Danish population that will a certain party at the next election. How many voters should we ask to get a margin of error, which equals 1\%?
* We set the confidence level to be 95\%, which means that $z_{crit}=1.96$.
* Worst case is $p=0.5$, yielding:
  $$
  n=p(1-p)\left(\frac{z_{crit}}{M}\right)^2=\frac{1}{4}\left(\frac{1.96}{0.01}\right)^2 = 9604.
  $$
* If we are interested in the proportion of voters that vote for 
  "socialdemokratiet'' a good guess is $p=0.23$, yielding
  $$
  n=p(1-p)\left(\frac{z_{crit}}{M}\right)^2=0.23(1-0.23)\left(\frac{1.96}{0.01}\right)^2 = 6804.
  $$
* If we instead are interested in "liberal alliance'' a good guess is $p=0.05$, 
  yielding
  $$ 
  n=p(1-p)\left(\frac{z_{crit}}{M}\right)^2 = 0.05(1-0.05)\left(\frac{1.96}{0.01}\right)^2 = 1825.
  $$

----

## Sample size for mean
* The confidence interval is of the form point estimate$\pm$estimated margin of error.
* Imagine that we want to plan an experiment, where we
  **want to achieve a certain margin of error $M$**.
* When we estimate a mean the margin of error is 
  $$
  M=z_{crit}\frac{\sigma}{\sqrt{n}},
  $$
  where the critical $z$-score, $z_{crit}$, is determined by the specified confidence level.
* If we solve the equation above we see:
    * If we choose sample size
        $n=\left(\frac{z_{crit}\sigma}{M}\right)^2$, then we obtain an estimate with
        margin of error $M$.
* Problem: We usually do not know $\sigma$. Possible solutions:
    * Based on similar studies conducted previously, we make a
        qualified guess at $\sigma$.
    * Based on a pilot study a value of $\sigma$ is estimated.
