---
title: "Data collection and wrangling"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "png"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Data collection

## Data collection

* Getting numbers to report is easy
* Getting sensible and trustworthy numbers to report is orders of magnitude more difficult

Ronald Fisher (1890-1962):

<div style="font-size: 120%">
> To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
</div>

Said about Fisher:

  * Anders Hald (1913-2007), Danish statistician: "*a genius who almost single-handedly created the foundations for modern statistical science*"
  * Bradley Efron (b. 1938): "*the single most important figure in 20th century statistics*"

## Data collection

* Competences, ideally:
    + Statistics, both conceptually and analyses
    + Data wrangling (loading data; right format for analyses, tables, figures; ...)
    + Visualizations
    + Knowledge about subject (best with access to experts)
* Not just downloading a spreadsheet!
    + Population vs sample
    + Descriptives of the sample (e.g. mean)
    + Statistical inference about population (how close is sample's mean to population's mean)
* Do collect and analyze data, but know about pitfalls and limitations in generalisability!

# Population and sample

##  Population and sample

```{r, fig.width=10, echo=FALSE, fig.align='center'}
url <- "https://asta.math.aau.dk/static-files/asta/img/fig-H1-001.png"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(png::readPNG(z))
unlink(z)
```

Sample 3 of size $n = 30$:


|shape |color | n_sample| p_sample| p_pop| p_diff|
|:-----|:------|--------:|--------:|-----:|------:|
|baby  |black  |        2|     0.07|  0.04|  -0.02|
|baby  |blue   |        1|     0.03|  0.04|   0.01|
|baby  |red    |        0|     0.00|  0.01|   0.01|
|man   |black  |        5|     0.17|  0.12|  -0.04|
|man   |blue   |        8|     0.27|  0.22|  -0.04|
|man   |red    |        3|     0.10|  0.08|  -0.02|
|woman |black  |        3|     0.10|  0.23|   0.13|
|woman |blue   |        8|     0.27|  0.22|  -0.05|
|woman |red    |        0|     0.00|  0.02|   0.02|

* Descriptive vs statistical inference.

# Example: United States presidential election, 1936

## Example: United States presidential election, 1936

(Based on Agresti, [this](https://en.wikipedia.org/wiki/United_States_presidential_election,_1936) and [this](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html).)

* Current president: Franklin D. Roosevelt
* Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
* Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections

## Example: United States presidential election, 1936

* Literary Digest poll ($\hat{\pi}$ and $1-\hat{\pi}$): Landon: 57%; Roosevelt: 43%
* Actual results ($\pi$ and $1-\pi$): Landon: 38%; Roosevelt: 62%
* Sampling error: 57%-38% = 19%
    + Practically all of the sampling error was the result of **sample bias**
    + Poll size of > 2 mio. individuals participated -- extremely large poll

## Example: United States presidential election, 1936

* Mailing list of about 10 mio. names was created
    + Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
    + Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
* "respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample ... it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority" (*The American Statistician*, 1976)
* Biases:
    + Selection bias
        - List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
        - Many unemployed (club memberships and magazine subscribers)
    + Non-response bias
        - Only responses from 2.3/2.4 mio out of 10 million people
        - Cannot force people to participate: but mail may be junk (phone, interviews, online, pay/paid, ...)

# Example: Bullet holes of honor

## Example: Bullet holes of honor

(Based on [this](https://www.motherjones.com/kevin-drum/2010/09/counterintuitive-world/).)

* World War II
* Royal Air Force (RAF), UK
    + Lost many planes to German anti-aircraft fire
* Armor up!
    + Where?
    + Count up all the bullet holes in planes that returned from missions
        - Put extra armor in the areas that attracted the most fire

## Example: Bullet holes of honor

* Hungarian-born mathematician Abraham Wald:
    + If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren't very dangerous
        - **Survivorship bias**
    + Armor up the areas that (on average) don't have any bullet holes
        - They never make it back, apparently dangerous

# Theory: Biases / sampling

## Biases

Agresti section 2.3:

* Sampling/selection bias
    + Probability sampling: each sample of size $n$ has same probability of being sampled
        + Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, ...)
    + Non-probability sampling: probability of sample not possible to determine
        + E.g. volunteer sampling
* Response bias
    + E.g. poorly worded, confusing or even order of questions
    + Lying if think socially unacceptable
* Non-response bias
    + Non-response rate high; systematic in non-responses (age, health, believes)


## Sampling

Agresti section 2.4:

* Random sampling schemes:
    + Simple sampling: each possible sample of equal size equally probable
    + Systematic sampling
    + Stratified sampling
    + Cluster sampling
    + Multistage sampling
    + ...

# Data wrangling

## Data wrangling

This will be illustrated with two specific cases.

The material is on Moodle.
