---
title: "Data collection and data wrangling"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
knitr::opts_chunk$set(warning = FALSE)
library(mosaic)
```

# Data 

## Data example

We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins)
```{r}
pingviner <- palmerpenguins::penguins
pingviner
```

```{r include=FALSE}
# Ignore this code. It's just used to write correct numbers later in the text.
y <- pingviner %>% filter(species == "Gentoo") %>% pull(bill_length_mm) %>% na.omit()
nn <- length(y)
xx <- sort(y)
```

# Summaries and plots of qualitative variables

## Tables of qualitative variables

- The main function to make tables from a data frame of observations is `tally()` which tallies (counts up) the number of observations within a given category. E.g:
```{r}
tally(~species, data = pingviner)
```
```{r}
tally(species ~ island, data = pingviner)
```

## Plots of qualitative variables

- The main plotting functions for qualitative variables are `gf_percents()` and `gf_bar()`. 
E.g:
```{r}
gf_percents(~species, data = pingviner)
```

----

```{r}
gf_percents(~species, fill = ~sex, data = pingviner)
```

----

```{r}
gf_percents(~species, fill = ~sex, data = pingviner, position = position_dodge())
```

# Target population and random sampling

## Population parameters

* When the sample size grows, then e.g. the mean of the sample, $\overline{y}$, will stabilize around a fixed value, $\mu$, which is usually unknown. The value $\mu$ is called the **population mean**.
* Correspondingly, the standard deviation of the sample, $s$, will stabilize around a fixed value, $\sigma$, which is usually unknown. The value $\sigma$ is called the **population standard deviation**.
* Notation:
	+ $\mu$ (mu) denotes the population mean.
	+ $\sigma$ (sigma) denotes the population standard deviation.

| Population | Sample       |
|:----------:|:------------:|
|$\mu$       |$\overline{y}$|
|$\sigma$    |$s$           |

### A word about terminology

* **Standard deviation**:\ a measure of variability of a population or a sample.
* **Standard error**:\ a measure of variability of an estimate. For example, a measure of variability of the sample mean.

## Aim of statistics

* Statistics is all about "saying something" about a population.
* Typically, this is done by taking a random sample from the population.
* The sample is then analysed and a statement about the population can be made.
* The process of making conclusions about a population from analysing a sample is called **statistical inference**.

## Random sampling schemes

Possible strategies for obtaining a random sample from the target population are explained in Agresti section 2.4:

+ **Simple sampling: each possible sample of equal size equally probable**
+ Systematic sampling
+ Stratified sampling
+ Cluster sampling
+ Multistage sampling
+ ...

# Biases

## Types of biases

Agresti section 2.3:

* Sampling/selection bias
    + Probability sampling: each sample of size $n$ has same probability of being sampled
        + Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, ...)
    + Non-probability sampling: probability of sample not possible to determine
        + E.g. volunteer sampling
* Response bias
    + E.g. poorly worded, confusing or even order of questions
    + Lying if think socially unacceptable
* Non-response bias
    + Non-response rate high; systematic in non-responses (age, health, believes)

## Example of sample bias: United States presidential election, 1936

(Based on Agresti, [this](https://en.wikipedia.org/wiki/United_States_presidential_election,_1936) and [this](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html).)

* Current president: Franklin D. Roosevelt
* Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
* Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections

----

### Results

* Literary Digest poll: Landon: 57%; Roosevelt: 43%
* Actual results: Landon: 38%; Roosevelt: 62%
* Sampling error: 57%-38% = 19%
    + Practically all of the sampling error was the result of **sample bias**
    + Poll size of > 2 mio. individuals participated -- extremely large poll

----

### Problems (biases)

* Mailing list of about 10 mio. names was created
    + Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
    + Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
* "respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample ... it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority" (*The American Statistician*, 1976)
* Biases:
    + Sample bias
        - List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
        - Many unemployed (club memberships and magazine subscribers)
    + Non-response bias
        - Only responses from 2.3/2.4 mio out of 10 million people

## Example of response bias: Wording matters

New York Times/CBS News poll on attitude to increased fuel taxes

* "Are you in favour of a new gasoline tax?" - 12% said yes.
* "Are you in favour of a new gasoline tax to decrease US dependency on
  foreign oil?" - 55% said yes.
* "Do you think a new gas tax would help to reduce global warming?" -
  59% said yes.
  
## Example of response bias: Order of questions matter

US study during cold war asked two questions:

1 "Do you think that US should let Russian newspaper reporters come
  here and sent back whatever they want?"
  
2 "Do you think that Russia should let American newspaper reporters come
  in and sent back whatever they want?"

The percentage of yes to question 1 was 36%, if it was asked
first and 73%, when it was asked last.

## Example of survivior bias: Bullet holes of honor

(Based on [this](https://www.motherjones.com/kevin-drum/2010/09/counterintuitive-world/).)

* World War II
* Royal Air Force (RAF), UK
    + Lost many planes to German anti-aircraft fire
* Armor up!
    + Where?
    + Count up all the bullet holes in planes that returned from missions
        - Put extra armor in the areas that attracted the most fire

----

* Hungarian-born mathematician Abraham Wald:
    + If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren't very dangerous
        - **Survivorship bias**
    + Armor up the areas that (on average) don't have any bullet holes
        - They never make it back, apparently dangerous
        
    |Section of plane$\quad$| Bullet holes per square foot |
    |:----------------------|:----------------------------:|
    |Engine                 |                        $1.11$|
    |Fuselage               |                        $1.73$|
    |Fuel system            |                        $1.55$|
    |Rest of the plane      |                        $1.80$|
        
```{r, fig.width=10, echo=FALSE, fig.align='center', eval = FALSE}
# was ![](https://asta.math.aau.dk/static-files/asta/img/bulletHoles.jpeg)
url <- "https://asta.math.aau.dk/static-files/asta/img/bulletHoles.jpeg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
unlink(z)
```

(See also [this xkcd](https://xkcd.com/1827/))