---
output: html_document
---

# Exam module 1

It is highly recommended that you answer the exam using Rmarkdown
(you can simply use the exam Rmarkdown file as a starting point).

# Part I: Estimating probabilities

Remember to load the `mosaic` package first:
```{r message=FALSE}
library(mosaic)
options(digits = 4)
```

## Chile referendum data

In this part we will use the dataset `Chile`. Remember to read the [description](http://www.rdocumentation.org/packages/car/functions/Chile) of the dataset as well as the [Wikipedia](https://en.wikipedia.org/wiki/Chilean_national_plebiscite,_1988) entry about the background.

```{r}
Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
```

NB: This dataset has several missing values (`NA`). To remove these when you use `tally` you
can add the argument `useNA = "no"`.

-   Do a cross tabulation of the variables `vote` and `sex`.

-   Estimate the probability of voting "no" (`vote=N`).

-   Make a 95% confidence interval for the probability of voting "no".

-   Estimate the probability of voting "no" given that you are female (`sex=F`).

-   What would the true population probabilities satisfy if `vote` and `sex` were
    statistically independent? Based on your results do you think they are independent?


# Part II: Sampling distributions and the central limit theorem

This is a purely theoretical exercise where we investigate the random
distribution of samples from a known population.

## House prices in Denmark

The Danish real estate agency HOME has a database containing approximately
80,000 house prices for one-family houses under DKK 10 million for the period
2004-2016. The house prices (without all the additional information such as
house size, address etc.) are available as a **R** data file `Home.RData` on the
course webpage. If you download it you can load it using `load("Home.RData")`
assuming you have saved it in the same directory as this Rmarkdown document.
This will add the vector `price` to your work space. Alternatively,
you can add it directly from the course website (this will download it every time
you run the Rmarkdown document, so make sure you have a decent internet connection):

```{r}
load(url("https://asta.math.aau.dk/datasets?file=Home.RData"))
```

Make a histogram of all the house prices using a command like
`gf_histogram(~price, bins = 30)` inserted in a new code chunk
(try to do experiments with the number of bins):

- Explain how a histogram is constructed.
- Does this histogram look like a normal distribution?

In this database (our population) the mean price is `r mean(price)` and the
standard deviation is `r sd(price)`.

In many cases access to such databases is restrictive and in the following we
imagine that we are only allowed access to a random sample of 40 prices and the
mean of this sample will be denoted `y_bar`.

Before obtaining this sample we will use the Central Limit Theorem (CLT) to
predict the distribution of `y_bar`:

- What is the expected value of `y_bar`?

- What is the standard deviation of `y_bar` (also called the standard error)?

- What is the approximate distribution of `y_bar`?

Now make a random sample of 40 house prices and calculate the sample mean:
```{r}
y <- sample(price, 40)
mean(y)
```

Repeat this command a few times. Is each mean price close to what you expected?

Use `replicate` to repeat the sampling 500 times and save each mean value in the
vector `y_bar`:

```{r}
y_bar <- replicate(500, mean(sample(price, 40)))
```

Calculate the mean and standard deviation of the values in `y_bar`.

- How do they match with what you expected?

- Make a histogram of the values in `y_bar` and add the density curve for the
approximate distribution you predicted previously using `gf_dist`. 
For example if you predicted a normal distribution with
mean 2 and standard deviation 0.25:

```{r}
gf_histogram(..density.. ~ y_bar, bins = 15) %>%
gf_dist("norm", mean = 2, sd = 0.25, col = "red")
```

- Make a boxplot of `y_bar` and explain how a boxplot is constructed.

# Part III: Theoretical boxplot for a normal distribution
 
Finally, consider the theoretical boxplot of a general normal distribution with
mean $\mu$ and standard deviation $\sigma$, and find the probability of being an
outlier according to the 1.5$\cdot$IQR criterion:

- First find the $z$-score of the lower/upper quartile. I.e. the value of $z$ such that
  $\mu \pm z\sigma$ is the lower/upper quartile.

- Use this to find the IQR (expressed in terms of $\sigma$).

- Now find the $z$-score of the maximal extent of the whisker. I.e. the value of $z$ such that
  $\mu \pm z\sigma$ is the endpoint of lower/upper whisker.

- Find the probability of being an outlier.