---
title: "Data collection and data wrangling"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
knitr::opts_chunk$set(warning = FALSE)
library(mosaic)
```

# Data 

## Data example

We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins)
```{r}
pingviner <- palmerpenguins::penguins
pingviner
```

```{r include=FALSE}
# Ignore this code. It's just used to write correct numbers later in the text.
y <- pingviner %>% filter(species == "Gentoo") %>% pull(bill_length_mm) %>% na.omit()
nn <- length(y)
xx <- sort(y)
```

# Summaries and plots of qualitative variables

## Tables of qualitative variables

- The main function to make tables from a data frame of observations is `tally()` which tallies (counts up) the number of observations within a given category. E.g:
```{r}
tally(~species, data = pingviner)
```
```{r}
tally(species ~ island, data = pingviner)
```

## Plots of qualitative variables

- The main plotting functions for qualitative variables are `gf_percents()` and `gf_bar()`. 
E.g:
```{r}
gf_percents(~species, data = pingviner)
```

----

```{r}
gf_percents(~species, fill = ~sex, data = pingviner)
```

----

```{r}
gf_percents(~species, fill = ~sex, data = pingviner, position = position_dodge())
```

# Summaries of quantitative variables

## Percentiles

* **The $p$th percentile** is a value such that about $p$% of the sample lies below or at this value and about $(100-p)$% of the sample lies above it.

* To calculate a percentile, first sort data in increasing order. For the `bill_length_mm` of `Gentoo` penguins it is:
    $$
      y_{(1)}=`r xx[1]`, y_{(2)}=`r xx[2]`, y_{(3)}=`r xx[3]`, \ldots, y_{(n)} = `r xx[nn]`.
    $$
Here the number of observations is $n=`r nn`$ (omitting any `NA`s).
```{r echo=FALSE}
p <- 5
N <- nn*p/100
```
* Find the $`r p`$th percentile (i. e.\ $p = `r p`$):\ 
    * The observation number corresponding to the 5-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. 
    * So the 5-percentile lies between the observations with observation number $k=`r (k <- floor(N))`$ and $k+1=`r k+1`$. That is, its value lies somewhere in the interval between $y_{`r k`}= `r xx[k]`$ and $y_{`r k+1`}=`r xx[k+1]`$.
    * One of several methods for estimating the 5-percentile from the value of N is defined as:
    $$
      y_{(k)} + (N - k)(y_{(k+1)} - y_{(k)}) 
    $$
which in this case states
$$
    y_{`r k`} + (`r N` - `r k`)(y_{`r k+1`} - y_{`r k`}) = `r xx[k]` + `r N-k` \cdot (`r xx[k+1]`-`r xx[k]`) = `r xx[k]+(N-k)*(xx[k+1]-xx[k])`
$$    

## Measures of center of data: Mean and median

* A number of numerical summaries can be retrieved using the `favstats` command:
```{r}
favstats(bill_length_mm ~ species, data = pingviner)
```
* The observed values of `bill_length_mm` are $y_1=`r y[1]`$, $y_2=`r y[2]`,\ldots,y_n=`r y[nn]`$, where there are a total of $n=`r nn`$ values (omitting any `NA`s).
As previously defined this constitutes a **sample**.
* **mean** = `r round(mean(y))` is the **average** of the sample, which is calculated by
    $$
      \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i.
    $$
    We may also call $\bar{y}$ the **(empirical) mean** or the **sample mean**.
    It is calculated using `mean()` in **R**.
* **median** = `r round(median(y))` is the 50-percentile,
    i.e. the value that splits the sample in 2 groups of equal size.
    It is calculated using `median()` in **R**.
* An important property of the **mean** and the **median** is that they have the same unit as the observations (e.g. millimeter).


## Measures of variability of data: range, standard deviation and variance

* The **range** is the difference of the largest and smallest observation (`range()` in **R**).
* The **(empirical) variance** (`var()` in **R**) is the average of the squared deviations from the mean:
      $$
        s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2.
      $$
* **sd** $=$ **standard deviation** $= s=\sqrt{s^2}$ (`sd()` in **R**).
* Note:\ If the observations are measured in mm,\ the **variance** has unit $\text{mm}^2$ which is hard to interpret. The **standard deviation** on the other hand has the same unit as the observations.
* The standard deviation describes how much data varies around the (empirical) mean.

----

### The empirical rule

```{r empRule,echo=FALSE,fig.width=12,fig.height=6}
set.seed(5)
x <- rnorm(900)
x <- (x-mean(x))/sd(x)
hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE)
axis(1,at=-3:3,labels=F,pos=0)
axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0)
axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0)
axis(1,at=-1,labels=substitute(bar(y)-s),pos=0)
axis(1,at=0,labels=substitute(bar(y)),pos=0)
axis(1,at=1,labels=substitute(bar(y)+s),pos=0)
axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0)
axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0)
arrows(-1,-.1,1,-.1,col="red",code=3,length=.1)
text(-.01,-.115,"About 68% of measurements",col="red")
arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1)
text(-.01,-.1655,"About 95% of measurements",col="blue")
arrows(-3,-.2,3,-.2,code=3,length=.1)
text(-.01,-.215,"All or nearly all measurements")
```

If the histogram of the sample looks like a bell shaped curve, then

* about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$.
* about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$.
* All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$.


# Target population and random sampling

## Population parameters

* When the sample size grows, then e.g. the mean of the sample, $\overline{y}$, will stabilize around a fixed value, $\mu$, which is usually unknown. The value $\mu$ is called the **population mean**.
* Correspondingly, the standard deviation of the sample, $s$, will stabilize around a fixed value, $\sigma$, which is usually unknown. The value $\sigma$ is called the **population standard deviation**.
* Notation:
	+ $\mu$ (mu) denotes the population mean.
	+ $\sigma$ (sigma) denotes the population standard deviation.

| Population | Sample       |
|:----------:|:------------:|
|$\mu$       |$\overline{y}$|
|$\sigma$    |$s$           |

### A word about terminology

* **Standard deviation**:\ a measure of variability of a population or a sample.
* **Standard error**:\ a measure of variability of an estimate. For example, a measure of variability of the sample mean.

## Aim of statistics

* Statistics is all about "saying something" about a population.
* Typically, this is done by taking a random sample from the population.
* The sample is then analysed and a statement about the population can be made.
* The process of making conclusions about a population from analysing a sample is called **statistical inference**.

## Random sampling schemes

Possible strategies for obtaining a random sample from the target population are explained in Agresti section 2.4:

+ **Simple sampling: each possible sample of equal size equally probable**
+ Systematic sampling
+ Stratified sampling
+ Cluster sampling
+ Multistage sampling
+ ...

# Biases

## Types of biases

Agresti section 2.3:

* Sampling/selection bias
    + Probability sampling: each sample of size $n$ has same probability of being sampled
        + Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, ...)
    + Non-probability sampling: probability of sample not possible to determine
        + E.g. volunteer sampling
* Response bias
    + E.g. poorly worded, confusing or even order of questions
    + Lying if think socially unacceptable
* Non-response bias
    + Non-response rate high; systematic in non-responses (age, health, believes)

## Example of sample bias: United States presidential election, 1936

(Based on Agresti, [this](https://en.wikipedia.org/wiki/United_States_presidential_election,_1936) and [this](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html).)

* Current president: Franklin D. Roosevelt
* Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
* Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections

----

### Results

* Literary Digest poll: Landon: 57%; Roosevelt: 43%
* Actual results: Landon: 38%; Roosevelt: 62%
* Sampling error: 57%-38% = 19%
    + Practically all of the sampling error was the result of **sample bias**
    + Poll size of > 2 mio. individuals participated -- extremely large poll

----

### Problems (biases)

* Mailing list of about 10 mio. names was created
    + Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
    + Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
* "respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample ... it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority" (*The American Statistician*, 1976)
* Biases:
    + Sample bias
        - List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
        - Many unemployed (club memberships and magazine subscribers)
    + Non-response bias
        - Only responses from 2.3/2.4 mio out of 10 million people

## Example of response bias: Wording matters

New York Times/CBS News poll on attitude to increased fuel taxes

* "Are you in favour of a new gasoline tax?" - 12% said yes.
* "Are you in favour of a new gasoline tax to decrease US dependency on
  foreign oil?" - 55% said yes.
* "Do you think a new gas tax would help to reduce global warming?" -
  59% said yes.
  
## Example of response bias: Order of questions matter

US study during cold war asked two questions:

1 "Do you think that US should let Russian newspaper reporters come
  here and sent back whatever they want?"
  
2 "Do you think that Russia should let American newspaper reporters come
  in and sent back whatever they want?"

The percentage of yes to question 1 was 36%, if it was asked
first and 73%, when it was asked last.

## Example of survivior bias: Bullet holes of honor

(Based on [this](https://www.motherjones.com/kevin-drum/2010/09/counterintuitive-world/).)

* World War II
* Royal Air Force (RAF), UK
    + Lost many planes to German anti-aircraft fire
* Armor up!
    + Where?
    + Count up all the bullet holes in planes that returned from missions
        - Put extra armor in the areas that attracted the most fire

----

* Hungarian-born mathematician Abraham Wald:
    + If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren't very dangerous
        - **Survivorship bias**
    + Armor up the areas that (on average) don't have any bullet holes
        - They never make it back, apparently dangerous
        
    |Section of plane$\quad$| Bullet holes per square foot |
    |:----------------------|:----------------------------:|
    |Engine                 |                        $1.11$|
    |Fuselage               |                        $1.73$|
    |Fuel system            |                        $1.55$|
    |Rest of the plane      |                        $1.80$|
        
```{r, fig.width=10, echo=FALSE, fig.align='center', eval = FALSE}
# was ![](https://asta.math.aau.dk/static-files/asta/img/bulletHoles.jpeg)
url <- "https://asta.math.aau.dk/static-files/asta/img/bulletHoles.jpeg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
unlink(z)
```

(See also [this xkcd](https://xkcd.com/1827/))