--- title: "Data collection 1/2" author: "The ASTA team" output: slidy_presentation: css: "https://asta.math.aau.dk/course/asta/2018-2/?file=lecture_style.css" fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 2) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic", "grid", "png"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) ``` # Data collection ## Motivation Case ## Data collection * Getting numbers to report is easy * Getting sensible and trustworthy numbers to report is orders of magnitude more difficult * Why important? + Difference between meaningless analysis and useful analysis + Effect of drugs + Economy + Sales + Climate + Energy consumption ## Data collection Ronald Fisher (1890-1962):

> To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Said about Fisher: * Anders Hald (1913-2007), Danish statistician: "*a genius who almost single-handedly created the foundations for modern statistical science*" * Bradley Efron (b. 1938): "*the single most important figure in 20th century statistics*" ## Data collection * Competences, ideally: + Statistics, both conceptually and analyses + Data wrangling (loading data; right format for analyses, tables, figures; ...) + Visualizations + Knowledge about subject (best with access to experts) * Not just downloading a spreadsheet! + Population vs sample + Descriptives of the sample (e.g. mean) + Statistical inference about population (how close is sample's mean to population's mean) * Do collect and analyze data, but know about pitfalls and limitations in generalisability! # Population and sample ## Population and sample ```{r, fig.width=10, echo=FALSE, fig.align='center'} url <- "https://asta.math.aau.dk/static-files/asta/img/fig-H1-001.png" z <- tempfile() download.file(url, z, mode = "wb") grid::grid.raster(png::readPNG(z)) invisible(file.remove(z)) ``` Sample 3 of size $n = 30$: |shape |color | n_sample| p_sample| p_pop| p_diff| |:-----|:------|--------:|--------:|-----:|------:| |baby |black | 2| 0.07| 0.04| -0.02| |baby |blue | 1| 0.03| 0.04| 0.01| |baby |red | 0| 0.00| 0.01| 0.01| |man |black | 5| 0.17| 0.12| -0.04| |man |blue | 8| 0.27| 0.22| -0.04| |man |red | 3| 0.10| 0.08| -0.02| |woman |black | 3| 0.10| 0.23| 0.13| |woman |blue | 8| 0.27| 0.22| -0.05| |woman |red | 0| 0.00| 0.02| 0.02| * Descriptive vs statistical inference. ## Population and sample ```{r, fig.width=10, echo=FALSE, fig.align='center'} url <- "https://asta.math.aau.dk/static-files/asta/img/fig-H1-002.png" z <- tempfile() download.file(url, z, mode = "wb") grid::grid.raster(png::readPNG(z)) invisible(file.remove(z)) ``` # Example: United States presidential election, 1936 ## Example: United States presidential election, 1936 (Based on Agresti, [this](https://en.wikipedia.org/wiki/United_States_presidential_election,_1936) and [this](https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html).) * Current president: Franklin D. Roosevelt * Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas) * Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections ## Example: United States presidential election, 1936 * Literary Digest poll ($\hat{\pi}$ and $1-\hat{\pi}$): Landon: 57%; Roosevelt: 43% * Actual results ($\pi$ and $1-\pi$): Landon: 38%; Roosevelt: 62% * Sampling error: 57%-38% = 19% + Practically all of the sampling error was the result of **sample bias** + Poll size of > 2 mio. individuals participated -- extremely large poll ## Example: United States presidential election, 1936 * Mailing list of about 10 mio. names was created + Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources + Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine * "respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample ... it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority" (*The American Statistician*, 1976) * Biases: + Selection bias - List generated towards middle- and upper-class voters (e.g. 1936 and telephones) - Many unemployed (club memberships and magazine subscribers) + Non-response bias - Only responses from 2.3/2.4 mio out of 10 million people - Cannot force people to participate: but mail may be junk (phone, interviews, online, pay/paid, ...) # Example: Bullet holes of honor ## Example: Bullet holes of honor (Based on [this](https://www.motherjones.com/kevin-drum/2010/09/counterintuitive-world/).) * World War II * Royal Air Force (RAF), UK + Lost many planes to German anti-aircraft fire * Armor up! + Where? + Count up all the bullet holes in planes that returned from missions - Put extra armor in the areas that attracted the most fire ## Example: Bullet holes of honor * Hungarian-born mathematician Abraham Wald: + If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren't very dangerous - **Survivorship bias** + Armor up the areas that (on average) don't have any bullet holes - They never make it back, apparently dangerous # Theory: Biases / sampling ## Biases Agresti section 2.3: * Sampling/selection bias + Probability sampling: each sample of size $n$ has same probability of being sampled + Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, ...) + Non-probability sampling: probability of sample not possible to determine + E.g. volunteer sampling * Response bias + E.g. poorly worded, confusing or even order of questions + Lying if think socially unacceptable + Non-response bias + Non-response rate high; systematic in non-responses (age, health, believes) ## Sampling Agresti section 2.4: * Random sampling schemes: + Simple sampling: each possible sample equally probable + Systematic sampling + Stratified sampling + Cluster sampling + Multistage sampling + ... # Theory: Contingency tables ## A contingency table * We return to the dataset `popularKids`, where we study **association** between 2 **factors**: `Goals` and `Urban.Rural`. * Based on a sample we make a cross tabulation of the factors and we get a so-called **contingency table** (*krydstabel*). ```{r message=FALSE} popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt") library(mosaic) tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE) tab ``` ## A conditional distribution * Another representation of data is the percent-wise distribution of `Goals` for each level of `Urban.Rural`, i.e.\ the sum in each row of the table is $100$ (up to rounding): ```{r} tab <- tally(~Urban.Rural + Goals, data = popKids) addmargins(round(100 * prop.table(tab, 1)),margin = 1:2) ``` * Here we will talk about the **conditional distribution** of `Goals` given `Urban.Rural`. * An important question could be: + Are the goals of the kids different when they come from urban, suburban or rural areas? I.e.\ are the rows in the table significantly different? * There is (almost) no difference between urban and suburban, but it looks like rural is different. # Independence ## Independence * Recall, that two factors are **independent**, when there is no difference between the population's distributions of one factor given the levels of the other factor. * Otherwise the factors are said to be **dependent**. * If we e.g. have the following conditional **population distributions** of `Goals` given `Urban.Rural`: ```{r echo = FALSE} dat <- data.frame(Goals = rep(c(rep("Grades", 5), rep("Popular", 3), rep("Sports", 2)), 3)) dat$Urban.Rural <- rep(c("Rural", "Suburban", "Urban"), each = 10) 100 * tally(~Urban.Rural + Goals, data = dat) ``` * Then the factors `Goals` and `Urban.Rural` are independent. * We take a sample and "measure" the factors $F_1$ and $F_2$. E.g. `Goals` and `Urban.Rural` for a random child. * The hypothesis of interest today is: $$H_0: F_1 \mbox{ and } F_2 \mbox{ are independent, } \quad H_a: F_1 \mbox{ and } F_2 \mbox{ are dependent.}$$ ## The Chi-squared test for independence * Our best guess of the distribution of `Goals` is the relative frequencies in the sample: ```{r} n <- margin.table(tab) pctGoals <- round(100 * margin.table(tab, 2)/n, 1) pctGoals ``` * If we assume independence, then this is also a guess of the conditional distributions of `Goals` given `Urban.Rural`. * The corresponding expected counts in the sample are then: ```{r echo=FALSE} test <- chisq.test(tab, correct = FALSE) exptab <- as.table(format(round(addmargins(test$expected), 1))) pctvec <- paste("(",c(rep(paste(pctGoals),each = 4), rep("100", 4)), "%)", sep = "") pctexptab <- as.table(matrix(paste(exptab, pctvec), 4, 4, dimnames = dimnames(exptab))) pctexptab ``` ## Calculation of expected table ```{r} pctexptab ``` * We note that + The relative frequency for a given column is columnTotal divided by tableTotal. For example `Grades`, which is $\frac{247}{478}=51.7\%$. + The expected value in a given cell in the table is then the cell's relative column frequency multiplied by the cell's rowTotal. For example `Rural` and `Grades`: $149\times 51.7\%=77.0$. * This can be summarized to: + The expected value in a cell is the product of the cell's rowTotal and columnTotal divided by tableTotal. ## Chi-squared ($\chi^2$) test statistic * We have an **observed table**: ```{r} tab ``` * And an **expected table**, if $H_0$ is true: ```{r echo=FALSE,size="footnotesize"} exptab ``` * If these tables are "far from each other", then we reject $H_0$. We want to measure the distance via the Chi-squared test statistic: + $X^2=\sum\frac{(f_o-f_e)^2}{f_e}$: Sum over all cells in the table + $f_o$ is the frequency in a cell in the observed table + $f_e$ is the corresponding frequency in the expected table. * We have: $$X^2_{obs} = \frac{(57-77)^2}{77} + \ldots + \frac{(26-33.5)^2}{33.5}=18.8$$ * Is this a large distance?? ## $\chi^2$-test template. * We want to test the hypothesis $H_0$ of independence in a table with $r$ rows and $c$ columns: + We take a sample and calculate $X^2_{obs}$ - the observed value of the test statistic. + p-value: Assume $H_0$ is true. What is then the chance of obtaining a larger $X^2$ than $X^2_{obs}$, if we repeat the experiment? * This can be approximated by the $\mathbf{\chi^2}$-**distribution** with $df=(r-1)(c-1)$ degrees of freedom. * For `Goals` and `Urban.Rural` we have $r=c=3$, i.e.\ $df=4$ and $X^2_{obs}=18.8$, so the p-value is: ```{r} 1 - pdist("chisq", 18.8, df = 4) ``` * There is clearly a significant association between `Goals` and `Urban.Rural`. ## The function `chisq.test`. * All of the above calculations can be obtained by the function `chisq.test`. ```{r} tab <- tally(~ Urban.Rural + Goals, data = popKids) testStat <- chisq.test(tab, correct = FALSE) testStat testStat$expected ``` ---- * The frequency data can also be put directly into a matrix. ```{r} data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26) tab <- matrix(data, nrow = 3, ncol = 3) row.names(tab) <- c("Rural", "Suburban", "Urban") colnames(tab) <- c("Grades", "Popular", "Sports") tab chisq.test(tab) ``` # The $\chi^2$-distribution ## The $\chi^2$-distribution * The $\chi^2$-distribution with $df$ degrees of freedom: + Is never negative. And $X^2=0$ only happens if $f_e = f_o$. + Has mean $\mu=df$ + Has standard deviation $\sigma=\sqrt{2df}$ + Is skewed to the right, but approaches a normal distribution when $df$ grows. ```{r dchisq,echo=FALSE} x <- seq(0, 40, length = 1000) dfs <- c(1, 5, 10, 20) plot(x, dchisq(x, df = dfs[2]), col = 2, type = "l", lwd = 2, xlab = "", ylab = "") lines(x, dchisq(x, df = dfs[1]), col = 1, lwd = 2) lines(x, dchisq(x, df = dfs[3]), col = 3, lwd = 2) lines(x, dchisq(x, df = dfs[4]), col = 4, lwd = 2) legend("topright", legend = paste("df =", dfs), col = 1:4, lwd = 2) ``` # Agresti - Summary ## Summary * For the the Chi-squared statistic, $X^2$, to be appropriate we require that the expected values have to be $f_e\geq 5$. * Now we can summarize the ingredients in the Chi-squared test for independence. ```{r, fig.width=10, echo=FALSE, fig.align='center'} # was ![](https://asta.math.aau.dk/static-files/asta/img/AGR.jpg) url <- "https://asta.math.aau.dk/static-files/asta/img/AGR.jpg" z <- tempfile() download.file(url, z, mode = "wb") grid::grid.raster(jpeg::readJPEG(z)) invisible(file.remove(z)) ``` # Standardized residuals ## Residual analysis * If we reject the hypothesis of independence it can be of interest to identify the significant deviations. * In a given cell in the table, $f_o-f_e$ is the deviation between data and the expected values under the null hypothesis. * We assume that $f_e\geq 5$. * If $H_0$ is true, then the standard error of $f_o - f_e$ is given by $$se=\sqrt{f_e (1-\mbox{rowProportion}) (1-\mbox{columnProportion})}$$ * The corresponding $z$-score $$z=\frac{f_o-f_e}{se}$$ should in 95\% of the cells be between $\pm 2$. Values above 3 or below -3 should not appear. * In popKids table cell `Rural and Grade` we got $f_e = 77.0$ and $f_o = 57$. Here columnProportion$=51.7\%$ and rowProportion$=149/478=31.2\%$. * We can then calculate $$z = \frac{57-77}{\sqrt{77(1-0.517)(1-0.312)}} = -3.95$$. * Compared to the null hypothesis there are way too few rural kids who find grades important. * In summary: The standardized residuals allow for cell-by-cell ($f_e$ vs $f_o$) comparison. ## Residual analysis in `R` * In `R` we can extract the standardized residuals from the output of `chisq.test`: ```{r} tab <- tally(~ Urban.Rural + Goals, data = popKids) testStat <- chisq.test(tab, correct = FALSE) testStat$stdres ``` # Collecting data ## Sources * Open data * Questionnaires + Google Analyse + SurveyXact? * User panels (often online) * ... # Important take-home messages ## Important take-home messages * Population vs sample: + What is the population? + Is the entire population known -- is statistics at all needed? * Sampling + Sampling strategy must ensure random sampling - Difficult to investigate it afterwards + Convenience sampling often used, dangerous! + Be honest with yourself, describe problems: Is the sample representative for the target group/population/market segment/...? * Badly chosen big sample is much worse than a well-chosen small sample * Watch out for biases + Sample/selection bias + Response bias + Non-response bias + (Survivorship bias) * Data collection + Privacy vs necessary information (< 50 or >= 50, age in years, birth date)