---
title: "Estimation"
author: "The ASTA team"
output:
  slidy_presentation:
    css: "https://asta.math.aau.dk/course/asta/2018-1/?file=lecture_style.css"
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Point and interval estimates

## Point and interval estimates

* We want to study hypotheses for population parameters, e.g.
  the mean $\mu$ and the standard deviation $\sigma$.
    * If $\mu$ is e.g. the mean waiting time in a queue, then it might
      be relevant to investigate whether it exceeds 2 minutes.
* Based on a sample we make a **point estimate** which is
  a guess of the parameter value.
    * For instance we have used $\bar{y}$ as an estimate of $\mu$ and
      $s$ as an estimate of $\sigma$.
* We often want to supplement the point estimate
  with an **interval estimate** (also called a **confidence interval**). 
  This is an interval around the
  point estimate, in which we are confident (to a certain degree)
  that the population parameter is located.
* The parameter estimate can then be used to investigate our hypothesis. 


## Point estimators: Bias

* If we want to estimate the population mean $\mu$ we have several
  possibilities e.g.
    * the sample mean $\bar{y}$
    * the average $y_T$ of the sample upper and lower quartiles
* Advantage of $y_T$: Very large/small observations have little
  effect, i.e. it has practically no effect if there are a few errors
  in the data set.
* Disadvantage of $y_T$: If the distribution of the population is
  skewed, i.e. asymmetrical, then $y_T$ is **biased**, meaning that in the long run
  this estimator systematically over or under estimates the value of
  $\mu$.
* Generally we prefer that an estimator is **unbiased**, i.e. its
  distribution is centered around the true parameter value.
* Recall that for a sample from a population with mean $\mu$, the sample mean   $\bar{y}$ also has mean $\mu$. That is, $\bar{y}$ is an unbiased estimate    of the population mean $\mu$.  


## Point estimators: Efficiency

* From previous results we know that the standard error of $\bar{y}$ is 
$\frac{\sigma}{\sqrt{n}}$,  i.e. the standard error approaches zero when the sample size increases.
* It is difficult to express the standard error of $y_T$, but it is
  greater than the standard error of $\bar{y}$, which is why we usually prefer $\bar{y}$ as
  estimator.
* In general we prefer that an estimator is **efficient**, which means that when the
  sample size grows, the standard error of the estimator approaches
  zero. And the rate at which it approaches zero is as fast or
  faster than other estimators.
  This means that for most populations $y_T$ is inefficient.



## Notation
* The symbol $\hat{\ }$ above a parameter is often used to denote a (point)
  estimate of the parameter. We have looked at an
    * estimate of the population mean:\ $\hat{\mu}=\bar{y}$
    * estimate of the population standard deviation:\ $\hat{\sigma}=s$
* When we observe a $0/1$ variable, which e.g. is used to denote
  yes/no or male/female, then we will use the notation
  $$\pi=P(Y=1)$$ 
  for the proportion of the population with the
  property $Y=1$.
* The estimate $\hat{\pi}=(y_1+y_2+\ldots+y_n)/n$ is the
  relative frequency of the property $Y=1$ in the sample.


## Confidence Interval

* The general definition of a confidence interval for a population
parameter is as follows:
    * A **confidence interval** for a parameter is constructed as
      an interval, where we expect the parameter to be.
    * The probability that this construction yields an interval which
      includes the parameter is called the **confidence level** and
      it is typically chosen to be $95\%$. 
    * (1-confidence level) is called the
      **error probability** (in this case $1-0.95=0.05$,\ i.e.\ $5\%$).
* In practice the interval is often constructed as a symmetric interval
  around a point estimate:
    * **point estimate$\pm$margin of error**
    * Rule of thumb: With a margin of error of 2 times
      the standard error you get a confidence interval, where the
      confidence level is approximately $95\%$.
    * I.e: **point estimate** $\pm$ **2 x standard error** has
      confidence level of approximately $95\%$.


## Confidence interval for proportion

* Consider a population with a distribution where the probability of having a given
  characteristic is $\pi$ and the probability of not having it is $1-\pi$.
* When $no/yes$ to the characteristic is denoted $0/1$, i.e. $y$ is $0$
  or $1$, the distribution of $y$ have a standard deviation of: 
  $$
  \quad \sigma=\sqrt{\pi(1-\pi)}.
  $$
  That is, the standard deviation is not a "free" parameter for a $0/1$ variable
  as it is directly linked to the probability $\pi$.
* With a sample size of $n$ the standard error of $\hat{\pi}$ will be (since $\hat{\pi} = \frac{\sum_{i=1}^n y_i}{n}$): 
  $$
  \quad \sigma_{\hat{\pi}}=
  \frac{\sigma}{\sqrt{n}} =\sqrt{\frac{\pi(1-\pi)}{n}}.
  $$
* We do not know $\pi$ but we insert the estimate and get the 
  **estimated standard error** of $\hat{\pi}$:
  $$
  se=\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}.
  $$
* The rule of thumb gives that the interval
  $$\hat{\pi}\pm 2\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}$$
  has confidence level of approximately $95\%$. I.e., before the data is known the
  random interval given by the formula above has approximately 95% probability of
  covering the true value $\pi$.

----

### Example: Point and interval estimate for proportion

* Now we will have a look at a data set concerning votes in Chile. 
  Information about the data can be found [here](https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Chile).
```{r}
Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
```
* We focus on the variable `sex`,\ i.e. the gender distribution in the sample.
```{r message=FALSE}
library(mosaic)
tally( ~ sex, data = Chile)
tally( ~ sex, data = Chile, format = "prop")
```
* Unknown population proportion of females (F),\ $\pi$.
* Estimate of $\pi$: $\quad \hat{\pi} = \frac{1379}{1379+1321} = 0.5107$
* Rule of thumb : $\quad \hat{\pi} \pm 2 \times se = 0.5107 \pm 2  \sqrt{\frac{0.5107(1-0.5107)}{1379 + 1321}} = (0.49, 0.53)$ is an approximate 95% confidence interval for $\pi$.  

----

### Example: Confidence intervals for proportion in  **R** 
* **R** automatically calculates the confidence interval for the
  proportion of females when we do a so-called hypothesis test (we
  will get back to that later):
```{r}
prop.test( ~ sex, data = Chile, correct = FALSE)
```
* The argument `correct = FALSE` is needed to make **R** use the
"normal" formulas as on the slides and in the book. When
`correct = TRUE` (the default) a mathematical correction which
you have not learned about is applied and slightly different results
are obtained.

----

## General confidence intervals for proportion

* Based on the central limit theorem we have:
  $$\hat{\pi}\approx N \left (\pi,\sqrt{\frac{\pi(1-\pi)}{n}} \right)$$
  **if** $n\hat{\pi}$ and $n(1-\hat{\pi})$ both are at least $15$.
* To construct a confidence interval with (approximate) confidence level $1-\alpha$:
    1) Find the critical value $z_{crit}$ for which the upper tail probability in the standard normal distribution is $\alpha/2$.  <!-- E.g. we have $z=1.96$ when $\alpha=5\%$.-->
    2) Calculate $se=\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}$
    3) Then $\hat{\pi}\pm z_{crit}\times se$ is a confidence interval with
      confidence level $1-\alpha$.

----

### Example: Chile data
Compute for the `Chile` data set the 99% and 95%-confidence intervals for the probability that a person is female:

* For a  $99\%$-confidence level we have $\alpha=1\%$ and
    1) $z_{crit}$=`qdist("norm", 1 - 0.01/2)`=2.576.
    2) We know that $\hat{\pi}=0.5107$ and $n=2700$, so $se = \sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}} = 0.0096$.
    3) Thereby, a 99%-confidence interval is: $\hat{\pi}\pm z_{crit}\times se=(0.4859, 0.5355)$.
* For a  $95\%$-confidence level we have $\alpha=5\%$ and
    1) $z_{crit}$=`qdist("norm", 1 - 0.05/2)`=1.96.
    2) Again, $\hat{\pi}=0.5107$ and $n=2700$ and so $se=0.0096$.
    3) Thereby, we find as 95%-confidence interval $\hat{\pi}\pm z_{crit}\times se=(0.4918, 0.5295)$ (as the result of `prop.test`).



## Confidence Interval for mean - normally distributed sample
* When it is reasonable to assume that the population distribution is
  normal we have the **exact** result
  $$
  \bar{y}\sim \texttt{norm}(\mu,\frac{\sigma}{\sqrt{n}}), 
  $$
  i.e.\ $\bar{y}\pm z_{crit}\times \frac{\sigma}{\sqrt{n}}$ is not only an
  approximate but rather an exact confidence interval for the population mean, $\mu$.
* In practice we **do not know** $\sigma$ and instead we are forced to apply the sample standard deviation $s$
  to find the **estimated standard error** $se=\frac{s}{\sqrt{n}}.$ 
* This extra uncertainty, however, implies that an exact confidence interval for the population mean $\mu$ cannot be constructed using the $z$-score.
* Luckily, an exact interval can still be constructed by using the so-called **$t$-score**,
  which apart from the confidence level depends on the **degrees of freedom**, 
  which are $df = n-1$. That is the confidence interval now takes the form
  $$
  \bar{y}\pm t_{crit}\times se.
  $$


## $t$-distribution and $t$-score
 * Calculation of $t$-score is based on the **$t$-distribution**, which is very similar to the standard normal $z$-distribution:
    * it is symmetric around zero and bell shaped, but  
    * it has "heavier" tails and thereby 
    * a slightly larger standard deviation than the standard normal distribution.
    * Further, the  $t$-distribution's standard deviation decays as a function of its 
      **degrees of freedom**, which we denote $df$. 
    * and when $df$ grows the $t$-distribution approaches the standard normal distribution.  
  
The expression of the density function is of slightly complicated form and will not be stated here, instead the $t$-distribution is plotted below for $df =1,2,10$ and $\infty$.
    
```{r echo=FALSE, fig.keep="last", fig.width=12}
cols <- c("black", "gray40", "gray60", "gray80")
figkey <- list(lines = list(col=rev(cols), lwd = 3),
               space = "right",
               text = list(c("t(df=1)", "t(df=2)", "t(df=10)", expression(paste("t(df=", infinity,")=N(0,1)")))))
plotDist("norm", mean = 0, sd = 1, key = figkey, col = cols[1], xlim = c(-5, 5), lwd = 3)
plotDist("t", df = 10, add = TRUE, col = cols[2], lwd = 3)
plotDist("t", df = 2, add = TRUE, col = cols[3], lwd = 3)
plotDist("t", df = 1, add = TRUE, col = cols[4], lwd = 3)
```

----

### Calculation of $t$-score in **R**

```{r}
qdist("t", p = 1 - 0.025, df = 4)
```
* A $t$-score is the quantile (i.e. value on the x-axis) such that we have a given **right tail** probability. 
* To get e.g. the $t$-score corresponding to a right tail probability of 2.5 % we have to look up the 97.5 % quantile using `qdist` with `p = 1 - 0.025` since `qdist` looks at the area to the **left hand side**.
* The degrees of freedom are determined by the sample size; in this example we just used df = 4 for illustration.
* As the $t$-score given a right probability of 2.5 % is 2.776 and the $t$-distribution is symmetric around 0, we have that an observation falls between  -2.776 and 2.776 with 
probability 1 - 2 $\cdot$ 0.025 = 95 % for a $t$-distribution with 4 degrees of freedom.


## Example: Confidence interval for mean
* We return to the dataset `Ericksen` and want to construct
a $95\%$ confidence interval for the population mean $\mu$ of the variable `crime`.
```{r}
Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
stats <- favstats( ~ crime, data = Ericksen)
stats
qdist("t", 1 - 0.025, df = 66 - 1, plot = FALSE)
```
* I.e. we have
    * $\bar{y} = `r round(stats$mean, 3)`$
    * $s = `r round(stats$sd, 3)`$
    * $n = `r stats$n`$
    * $df = n-1 = `r stats$n-1`$
    * $t_{crit} = `r round(qt( 1 - 0.025, df = 66 - 1), 3)`$.
* The confidence interval is $\bar{y}\pm t_{crit}\frac{s}{\sqrt{n}} =
        (`r round( confint( t.test( ~ crime, data = Ericksen))$lower, 3)` ,
        `r round( confint( t.test( ~ crime, data = Ericksen))$upper, 3)`)$
* All these calculations can be done automatically by **R**:
```{r}
t.test( ~ crime, data = Ericksen, conf.level = 0.95)
```


## Example: Plotting several confidence intervals in **R**

* We shall look at a built-in **R** dataset `chickwts`.
* `?chickwts` yields a page with the following information

  > An experiment was conducted to measure and compare the effectiveness
  > of various feed supplements on the growth rate of chickens.
  > Newly hatched chicks were randomly allocated into six groups, and each
  > group was given a different feed supplement. Their weights in grams
  > after six weeks are given along with feed types.
  
* `chickwts` is a data frame with 71 observations on 2 variables:
    * `weight`:\ a numeric variable giving the chick weight.
    * `feed`:\ a factor giving the feed type.
* Calculate a confidence interval for the mean weight for each feed separately;
the confidence interval is from `lower` to `upper` given by `mean`$\pm$`tscore * se`:
```{r}
cwei <- favstats( weight ~ feed, data = chickwts)
se <- cwei$sd / sqrt(cwei$n) # Standard errors
tscore <- qdist("t", p = .975, df = cwei$n - 1, plot = FALSE) # t-scores for 2.5% right tail probability
cwei$lower <- cwei$mean - tscore * se
cwei$upper <- cwei$mean + tscore * se
cwei[, c("feed", "mean", "lower", "upper")]
```
* We can plot the confidence intervals as horizontal line segments using `gf_errorbarh`:
```{r message=FALSE}
gf_errorbarh(feed ~ mean + lower + upper, data = cwei) %>% 
  gf_point(feed ~ mean)
```

# Determining sample size

## Sample size for proportion

* The confidence interval is of the form point estimate$\pm$estimated margin of error.
* When we estimate a proportion the margin of error is
    $$
    M=z_{crit}\sqrt{\frac{\pi(1-\pi)}{n}},
    $$
   where the critical $z$-score, $z_{crit}$, is determined by the specified confidence level.
* Imagine that we want to plan an experiment, where we
  **want to achieve a certain margin of error $M$** (and thus a specific width of the      associated confidence interval).
* If we solve the equation above we see:
    * If we choose sample size
        $n=\pi(1-\pi)(\frac{z_{crit}}{M})^2$, then we obtain an estimate of $\pi$ with
        margin of error $M$.
* If we do not have a good guess for the value of $\pi$ we can use the worst case value
$\pi=50\%$. The corresponding sample size $n=(\frac{z_{crit}}{2M})^2$ ensures that we 
obtain an estimate with a margin of error, which is at the *most* $M$.

----

### Example
* Let us choose $z_{crit}=1.96$, i.e\ the confidence level is 95\%.
* How many voters should we ask to get a margin of error, which equals 1\%?
* Worst case is $\pi=0.5$, yielding:
  $$
  n=\pi(1-\pi)\left(\frac{z_{crit}}{M}\right)^2=\frac{1}{4}\left(\frac{1.96}{0.01}\right)^2 = 9604.
  $$
* If we are interested in the proportion of voters that vote for 
  "socialdemokratiet'' a good guess is $\pi=0.23$, yielding
  $$
  n=\pi(1-\pi)\left(\frac{z_{crit}}{M}\right)^2=0.23(1-0.23)\left(\frac{1.96}{0.01}\right)^2 = 6804.
  $$
* If we instead are interested in "liberal alliance'' a good guess is $\pi=0.05$, 
  yielding
  $$ 
  n=\pi(1-\pi)\left(\frac{z_{crit}}{M}\right)^2 = 0.05(1-0.05)\left(\frac{1.96}{0.01}\right)^2 = 1825.
  $$



## Sample size for mean
* The confidence interval is of the form point estimate$\pm$estimated margin of error.
* When we estimate a mean the margin of error is 
  $$
  M=z_{crit}\frac{\sigma}{\sqrt{n}},
  $$
  where the critical $z$-score, $z_{crit}$, is determined by the specified confidence level.
* Imagine that we want to plan an experiment, where we
  **want to achieve a certain margin of error $M$**.
* If we solve the equation above we see:
    * If we choose sample size
        $n=(\frac{z_{crit}\sigma}{M})^2$, then we obtain an estimate with
        margin of error $M$.
* Problem: We usually do not know $\sigma$. Possible solutions:
    * Based on similar studies conducted previously, we make a
        qualified guess at $\sigma$.
    * Based on a pilot study a value of $\sigma$ is estimated.
