---
title: "Data collection 1/2"
author: "The ASTA team"
output:
  slidy_presentation:
    css: "https://asta.math.aau.dk/course/asta/2018-2/?file=lecture_style.css"
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "grid", "png"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```


# Data collection

## Motivation

Case

## Data collection

* Getting numbers to report is easy
* Getting sensible and trustworthy numbers to report is orders of magnitude more difficult
* Why important?
    + Difference between meaningless analysis and useful analysis
        + Effect of drugs
        + Economy
        + Sales
        + Climate
        + Energy consumption

## Data collection

Ronald Fisher (1890-1962):

<div style="font-size: 120%">
> To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
</div>

Said about Fisher:

  * Anders Hald (1913-2007), Danish statistician: "*a genius who almost single-handedly created the foundations for modern statistical science*"
  * Bradley Efron (b. 1938): "*the single most important figure in 20th century statistics*"

## Data collection

* Competences, ideally:
    + Statistics, both conceptually and analyses
    + Data wrangling (loading data; right format for analyses, tables, figures; ...)
    + Visualizations
    + Knowledge about subject (best with access to experts)
* Not just downloading a spreadsheet!
    + Population vs sample
    + Descriptives of the sample (e.g. mean)
    + Statistical inference about population (how close is sample's mean to population's mean)
* Do collect and analyze data, but know about pitfalls and limitations in generalisability!

# Population and sample

##  Population and sample

```{r, fig.width=10, echo=FALSE, fig.align='center'}
url <- "https://asta.math.aau.dk/static-files/asta/img/fig-H1-001.png"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(png::readPNG(z))
invisible(file.remove(z))
```

Sample 3 of size $n = 30$:


|shape |color | n_sample| p_sample| p_pop| p_diff|
|:-----|:------|--------:|--------:|-----:|------:|
|baby  |black  |        2|     0.07|  0.04|  -0.02|
|baby  |blue   |        1|     0.03|  0.04|   0.01|
|baby  |red    |        0|     0.00|  0.01|   0.01|
|man   |black  |        5|     0.17|  0.12|  -0.04|
|man   |blue   |        8|     0.27|  0.22|  -0.04|
|man   |red    |        3|     0.10|  0.08|  -0.02|
|woman |black  |        3|     0.10|  0.23|   0.13|
|woman |blue   |        8|     0.27|  0.22|  -0.05|
|woman |red    |        0|     0.00|  0.02|   0.02|

* Descriptive vs statistical inference.


##  Population and sample

```{r, fig.width=10, echo=FALSE, fig.align='center'}
url <- "https://asta.math.aau.dk/static-files/asta/img/fig-H1-002.png"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(png::readPNG(z))
invisible(file.remove(z))
```


# Example: United States presidential election, 1936

## Example: United States presidential election, 1936

(Based on Agresti, [this](https://en.wikipedia.org/wiki/United_States_presidential_election,_1936) and [this](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html).)

* Current president: Franklin D. Roosevelt
* Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
* Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections

## Example: United States presidential election, 1936

* Literary Digest poll ($\hat{\pi}$ and $1-\hat{\pi}$): Landon: 57%; Roosevelt: 43%
* Actual results ($\pi$ and $1-\pi$): Landon: 38%; Roosevelt: 62%
* Sampling error: 57%-38% = 19%
    + Practically all of the sampling error was the result of **sample bias**
    + Poll size of > 2 mio. individuals participated -- extremely large poll

## Example: United States presidential election, 1936

* Mailing list of about 10 mio. names was created
    + Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
    + Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
* "respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample ... it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority" (*The American Statistician*, 1976)
* Biases:
    + Selection bias
        - List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
        - Many unemployed (club memberships and magazine subscribers)
    + Non-response bias
        - Only responses from 2.3/2.4 mio out of 10 million people
        - Cannot force people to participate: but mail may be junk (phone, interviews, online, pay/paid, ...)

# Example: Bullet holes of honor

## Example: Bullet holes of honor

(Based on [this](https://www.motherjones.com/kevin-drum/2010/09/counterintuitive-world/).)

* World War II
* Royal Air Force (RAF), UK
    + Lost many planes to German anti-aircraft fire
* Armor up!
    + Where?
    + Count up all the bullet holes in planes that returned from missions
        - Put extra armor in the areas that attracted the most fire

## Example: Bullet holes of honor

* Hungarian-born mathematician Abraham Wald:
    + If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren't very dangerous
        - **Survivorship bias**
    + Armor up the areas that (on average) don't have any bullet holes
        - They never make it back, apparently dangerous

# Theory: Biases / sampling

## Biases

Agresti section 2.3:

* Sampling/selection bias
    + Probability sampling: each sample of size $n$ has same probability of being sampled
        + Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, ...)
    + Non-probability sampling: probability of sample not possible to determine
        + E.g. volunteer sampling
* Response bias
    + E.g. poorly worded, confusing or even order of questions
    + Lying if think socially unacceptable
+ Non-response bias
    + Non-response rate high; systematic in non-responses (age, health, believes)


## Sampling

Agresti section 2.4:

* Random sampling schemes:
    + Simple sampling: each possible sample equally probable
    + Systematic sampling
    + Stratified sampling
    + Cluster sampling
    + Multistage sampling
    + ...

# Theory: Contingency tables

## A contingency table

* We return to the dataset `popularKids`, where  we study **association** between 2 **factors**: `Goals` and `Urban.Rural`.
* Based on a sample we make a cross tabulation of the factors and we get a so-called **contingency table** (*krydstabel*).

```{r message=FALSE}
popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt")
library(mosaic)
tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE)
tab
``` 

## A conditional distribution

* Another representation of data is the percent-wise distribution of `Goals` for each level of `Urban.Rural`, i.e.\ the sum in each row of the table is $100$ (up to rounding):
```{r}
tab <- tally(~Urban.Rural + Goals, data = popKids)
addmargins(round(100 * prop.table(tab, 1)),margin = 1:2)
``` 
* Here we will talk about the **conditional distribution** of `Goals` given `Urban.Rural`.
* An important question could be:
    + Are the goals of the kids different when they come from urban, suburban or rural areas? I.e.\ are the rows in the table significantly different?
* There is (almost) no difference between urban and suburban, but it looks like rural is different.

# Independence
## Independence

* Recall, that two factors are **independent**, when there is no difference between the population's distributions of one factor given the levels of the other factor.
* Otherwise the factors are said to be **dependent**.
* If we e.g. have the following conditional **population distributions** of `Goals` given `Urban.Rural`:

```{r echo = FALSE}
dat <- data.frame(Goals = rep(c(rep("Grades", 5), rep("Popular", 3), rep("Sports", 2)), 3))
dat$Urban.Rural <- rep(c("Rural", "Suburban", "Urban"), each = 10)
100 * tally(~Urban.Rural + Goals, data = dat)
``` 

* Then the factors `Goals` and `Urban.Rural` are independent.
* We take a sample and "measure" the factors $F_1$ and $F_2$. E.g.  `Goals` and `Urban.Rural` for a random child. 
* The hypothesis of interest today is: $$H_0: F_1 \mbox{ and } F_2 \mbox{ are independent, } \quad  H_a: F_1 \mbox{ and } F_2 \mbox{ are dependent.}$$

## The Chi-squared test for independence

* Our best guess of the distribution of `Goals` is the relative frequencies in the sample:
```{r}
n <- margin.table(tab)
pctGoals <- round(100 * margin.table(tab, 2)/n, 1)
pctGoals
``` 

* If we assume independence, then this is also a guess of the conditional distributions of `Goals` given `Urban.Rural`. 
* The corresponding expected counts in the sample are then:
```{r echo=FALSE}
test <- chisq.test(tab, correct = FALSE)
exptab <- as.table(format(round(addmargins(test$expected), 1)))
pctvec <- paste("(",c(rep(paste(pctGoals),each = 4), rep("100", 4)), "%)", sep = "")
pctexptab <- as.table(matrix(paste(exptab, pctvec), 4, 4, dimnames = dimnames(exptab)))
pctexptab
``` 

## Calculation of expected table

```{r}
pctexptab
``` 

* We note that
    + The relative frequency for a given column is columnTotal divided by tableTotal. For example `Grades`, which is $\frac{247}{478}=51.7\%$.
    + The expected value in a given cell in the table is then the cell's relative column frequency multiplied by the cell's rowTotal. For example `Rural` and `Grades`: $149\times 51.7\%=77.0$.

* This can be summarized to:
    + The expected value in a cell is the product of the cell's rowTotal and columnTotal divided by tableTotal.

## Chi-squared ($\chi^2$) test statistic

* We have an **observed table**:
```{r}
tab
``` 

* And an **expected table**, if $H_0$ is true:

```{r echo=FALSE,size="footnotesize"}
exptab
``` 

* If these tables are "far from each other", then we reject $H_0$.  We want to measure the distance via the Chi-squared test statistic:
    + $X^2=\sum\frac{(f_o-f_e)^2}{f_e}$: Sum over all cells in the table
    + $f_o$ is the frequency in a cell in the observed table
    + $f_e$ is the corresponding frequency in the expected table.
* We have: $$X^2_{obs} = \frac{(57-77)^2}{77} + \ldots + \frac{(26-33.5)^2}{33.5}=18.8$$
* Is this a large distance??

## $\chi^2$-test template.

* We want to test the hypothesis $H_0$ of independence in a table with $r$ rows and $c$ columns:
    + We take a sample and calculate $X^2_{obs}$ - the observed value of the test statistic.
    + p-value: Assume $H_0$ is true. What is then the chance of obtaining a larger $X^2$ than $X^2_{obs}$, if we repeat the experiment?

* This can be approximated by the $\mathbf{\chi^2}$-**distribution** with $df=(r-1)(c-1)$ degrees of freedom.
* For `Goals` and `Urban.Rural` we have $r=c=3$, i.e.\ $df=4$ and $X^2_{obs}=18.8$, so the p-value is:
```{r}
1 - pdist("chisq", 18.8, df = 4)
``` 

* There is clearly a significant association between `Goals` and `Urban.Rural`.

## The function `chisq.test`.

* All of the above calculations can be obtained by the function `chisq.test`.
```{r}
tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat
testStat$expected
```

----

* The frequency data can also be put directly into a matrix.
```{r}
data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26)
tab <- matrix(data, nrow = 3, ncol = 3)
row.names(tab) <- c("Rural", "Suburban", "Urban")
colnames(tab) <- c("Grades", "Popular", "Sports")
tab
chisq.test(tab)
```

# The $\chi^2$-distribution
## The $\chi^2$-distribution

* The $\chi^2$-distribution with $df$ degrees of freedom:
    + Is never negative. And $X^2=0$ only happens if $f_e = f_o$.
    + Has mean $\mu=df$
    + Has standard deviation $\sigma=\sqrt{2df}$
    + Is skewed to the right, but approaches a normal distribution when $df$ grows.

```{r dchisq,echo=FALSE}
x <- seq(0, 40, length = 1000)
dfs <- c(1, 5, 10, 20)
plot(x, dchisq(x, df = dfs[2]), col = 2, type = "l", lwd = 2, xlab = "", ylab = "")
lines(x, dchisq(x, df = dfs[1]), col = 1, lwd = 2)
lines(x, dchisq(x, df = dfs[3]), col = 3, lwd = 2)
lines(x, dchisq(x, df = dfs[4]), col = 4, lwd = 2)
legend("topright", legend = paste("df =", dfs), col = 1:4, lwd = 2)
``` 

# Agresti - Summary
## Summary

* For the the Chi-squared statistic, $X^2$, to be appropriate we require that the expected values have to be $f_e\geq 5$. 
* Now we can summarize the ingredients in the Chi-squared test for independence. 

```{r, fig.width=10, echo=FALSE, fig.align='center'}
# was ![](https://asta.math.aau.dk/static-files/asta/img/AGR.jpg)
url <- "https://asta.math.aau.dk/static-files/asta/img/AGR.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```


# Standardized residuals
## Residual analysis

* If we reject the hypothesis of independence it can be of interest to identify the significant deviations.
* In a given cell in the table, $f_o-f_e$ is the deviation between data and the expected values under the null hypothesis.
* We assume that $f_e\geq 5$.
* If $H_0$ is true, then the standard error of $f_o - f_e$ is given by $$se=\sqrt{f_e (1-\mbox{rowProportion}) (1-\mbox{columnProportion})}$$
* The corresponding $z$-score $$z=\frac{f_o-f_e}{se}$$ should in 95\% of the cells be between $\pm 2$. Values above 3 or below -3 should not appear. 
* In popKids table cell `Rural and Grade` we got $f_e = 77.0$ and $f_o = 57$. Here columnProportion$=51.7\%$ and rowProportion$=149/478=31.2\%$.
* We can then calculate $$z = \frac{57-77}{\sqrt{77(1-0.517)(1-0.312)}} = -3.95$$. 
* Compared to the null hypothesis there are way too few rural kids who find grades important.
* In summary: The standardized residuals allow for cell-by-cell ($f_e$ vs $f_o$) comparison.

## Residual analysis in `R`

* In `R` we can extract the standardized residuals from the output of `chisq.test`:
```{r}
tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat$stdres
```

# Collecting data

## Sources

* Open data
* Questionnaires
    + Google Analyse
    + SurveyXact? 
* User panels (often online)
* ...

# Important take-home messages

## Important take-home messages

* Population vs sample:
    + What is the population?
    + Is the entire population known -- is statistics at all needed?
* Sampling
    + Sampling strategy must ensure random sampling
        - Difficult to investigate it afterwards
    + Convenience sampling often used, dangerous!
    + Be honest with yourself, describe problems: Is the sample representative for the target group/population/market segment/...?
* Badly chosen big sample is much worse than a well-chosen small sample
* Watch out for biases
    + Sample/selection bias
    + Response bias
    + Non-response bias
    + (Survivorship bias)
* Data collection
    + Privacy vs necessary information (< 50 or >= 50, age in years, birth date)
