Data collection 1/2

The ASTA team

Data collection

Motivation

Case

Data collection

Data collection

Ronald Fisher (1890-1962):

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Said about Fisher:

Data collection

Population and sample

Population and sample

Sample 3 of size \(n = 30\):

shape color n_sample p_sample p_pop p_diff
baby black 2 0.07 0.04 -0.02
baby blue 1 0.03 0.04 0.01
baby red 0 0.00 0.01 0.01
man black 5 0.17 0.12 -0.04
man blue 8 0.27 0.22 -0.04
man red 3 0.10 0.08 -0.02
woman black 3 0.10 0.23 0.13
woman blue 8 0.27 0.22 -0.05
woman red 0 0.00 0.02 0.02

Population and sample

Example: United States presidential election, 1936

Example: United States presidential election, 1936

(Based on Agresti, this and this.)

Example: United States presidential election, 1936

Example: United States presidential election, 1936

Example: Bullet holes of honor

Example: Bullet holes of honor

(Based on this.)

Example: Bullet holes of honor

Theory: Biases / sampling

Biases

Agresti section 2.3:

Sampling

Agresti section 2.4:

Theory: Contingency tables

A contingency table

popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt")
library(mosaic)
tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE)
tab
##            Goals
## Urban.Rural Grades Popular Sports Total
##    Rural        57      50     42   149
##    Suburban     87      42     22   151
##    Urban       103      49     26   178
##    Total       247     141     90   478

A conditional distribution

tab <- tally(~Urban.Rural + Goals, data = popKids)
addmargins(round(100 * prop.table(tab, 1)),margin = 1:2)
##            Goals
## Urban.Rural Grades Popular Sports Sum
##    Rural        38      34     28 100
##    Suburban     58      28     15 101
##    Urban        58      28     15 101
##    Sum         154      90     58 302

Independence

Independence

##            Goals
## Urban.Rural Grades Popular Sports
##    Rural       500     300    200
##    Suburban    500     300    200
##    Urban       500     300    200

The Chi-squared test for independence

n <- margin.table(tab)
pctGoals <- round(100 * margin.table(tab, 2)/n, 1)
pctGoals
## Goals
##  Grades Popular  Sports 
##      52      30      19
##            Goals
## Urban.Rural Grades      Popular     Sports      Sum       
##    Rural     77 (51.7%)  44 (29.5%)  28 (18.8%) 149 (100%)
##    Suburban  78 (51.7%)  44 (29.5%)  28 (18.8%) 151 (100%)
##    Urban     92 (51.7%)  52 (29.5%)  34 (18.8%) 178 (100%)
##    Sum      247 (51.7%) 141 (29.5%)  90 (18.8%) 478 (100%)

Calculation of expected table

pctexptab
##            Goals
## Urban.Rural Grades      Popular     Sports      Sum       
##    Rural     77 (51.7%)  44 (29.5%)  28 (18.8%) 149 (100%)
##    Suburban  78 (51.7%)  44 (29.5%)  28 (18.8%) 151 (100%)
##    Urban     92 (51.7%)  52 (29.5%)  34 (18.8%) 178 (100%)
##    Sum      247 (51.7%) 141 (29.5%)  90 (18.8%) 478 (100%)

Chi-squared (\(\chi^2\)) test statistic

tab
##            Goals
## Urban.Rural Grades Popular Sports
##    Rural        57      50     42
##    Suburban     87      42     22
##    Urban       103      49     26
##            Goals
## Urban.Rural Grades Popular Sports Sum
##    Rural     77     44      28    149
##    Suburban  78     44      28    151
##    Urban     92     52      34    178
##    Sum      247    141      90    478

\(\chi^2\)-test template.

1 - pdist("chisq", 18.8, df = 4)

## [1] 0.00086

The function chisq.test.

tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 20, df = 4, p-value = 8e-04
testStat$expected
##            Goals
## Urban.Rural Grades Popular Sports
##    Rural        77      44     28
##    Suburban     78      45     28
##    Urban        92      53     34
data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26)
tab <- matrix(data, nrow = 3, ncol = 3)
row.names(tab) <- c("Rural", "Suburban", "Urban")
colnames(tab) <- c("Grades", "Popular", "Sports")
tab
##          Grades Popular Sports
## Rural        57      50     42
## Suburban     87      42     22
## Urban       103      49     26
chisq.test(tab)
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 20, df = 4, p-value = 8e-04

The \(\chi^2\)-distribution

The \(\chi^2\)-distribution

Agresti - Summary

Summary

Standardized residuals

Residual analysis

Residual analysis in R

tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat$stdres
##            Goals
## Urban.Rural Grades Popular Sports
##    Rural     -3.95    1.31   3.52
##    Suburban   1.77   -0.55  -1.62
##    Urban      2.09   -0.73  -1.82

Collecting data

Sources

Important take-home messages

Important take-home messages