Data collection

Getting numbers to report is easy
Getting sensible and trustworthy numbers to report is orders of magnitude more difficult

Ronald Fisher (1890-1962):

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Said about Fisher:

Anders Hald (1913-2007), Danish statistician: “a genius who almost single-handedly created the foundations for modern statistical science”
Bradley Efron (b. 1938): “the single most important figure in 20th century statistics”

Data collection

Competences, ideally:
- Statistics, both conceptually and analyses
- Data wrangling (loading data; right format for analyses, tables, figures; …)
- Visualizations
- Knowledge about subject (best with access to experts)
Not just downloading a spreadsheet!
- Population vs sample
- Descriptives of the sample (e.g. mean)
- Statistical inference about population (how close is sample’s mean to population’s mean)
Do collect and analyze data, but know about pitfalls and limitations in generalisability!

Population and sample

Sample 3 of size \(n = 30\):

shape	color	n_sample	p_sample	p_pop	p_diff
baby	black	2	0.07	0.04	-0.02
baby	blue	1	0.03	0.04	0.01
baby	red	0	0.00	0.01	0.01
man	black	5	0.17	0.12	-0.04
man	blue	8	0.27	0.22	-0.04
man	red	3	0.10	0.08	-0.02
woman	black	3	0.10	0.23	0.13
woman	blue	8	0.27	0.22	-0.05
woman	red	0	0.00	0.02	0.02

Descriptive vs statistical inference.

Example: United States presidential election, 1936

(Based on Agresti, this and this.)

Current president: Franklin D. Roosevelt
Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections

Example: United States presidential election, 1936

Literary Digest poll (\(\hat{\pi}\) and \(1-\hat{\pi}\)): Landon: 57%; Roosevelt: 43%
Actual results (\(\pi\) and \(1-\pi\)): Landon: 38%; Roosevelt: 62%
Sampling error: 57%-38% = 19%
- Practically all of the sampling error was the result of sample bias
- Poll size of > 2 mio. individuals participated – extremely large poll

Example: United States presidential election, 1936

Mailing list of about 10 mio. names was created
- Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
- Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
“respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample … it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority” (The American Statistician, 1976)
Biases:
- Selection bias
  - List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
  - Many unemployed (club memberships and magazine subscribers)
- Non-response bias
  - Only responses from 2.3/2.4 mio out of 10 million people
  - Cannot force people to participate: but mail may be junk (phone, interviews, online, pay/paid, …)

Example: Bullet holes of honor

(Based on this.)

World War II
Royal Air Force (RAF), UK
- Lost many planes to German anti-aircraft fire
Armor up!
- Where?
- Count up all the bullet holes in planes that returned from missions
  - Put extra armor in the areas that attracted the most fire

Example: Bullet holes of honor

Hungarian-born mathematician Abraham Wald:
- If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren’t very dangerous
  - Survivorship bias
- Armor up the areas that (on average) don’t have any bullet holes
  - They never make it back, apparently dangerous

Theory: Biases / sampling

Biases

Agresti section 2.3:

Sampling/selection bias
- Probability sampling: each sample of size \(n\) has same probability of being sampled
  - Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, …)
- Non-probability sampling: probability of sample not possible to determine
  - E.g. volunteer sampling
Response bias
- E.g. poorly worded, confusing or even order of questions
- Lying if think socially unacceptable
Non-response bias
- Non-response rate high; systematic in non-responses (age, health, believes)

Sampling

Agresti section 2.4:

Random sampling schemes:
- Simple sampling: each possible sample of equal size equally probable
- Systematic sampling
- Stratified sampling
- Cluster sampling
- Multistage sampling
- …

Data collection and wrangling

Data collection

Data collection

Data collection

Population and sample

Population and sample

Example: United States presidential election, 1936

Example: United States presidential election, 1936

Example: United States presidential election, 1936

Example: United States presidential election, 1936

Example: Bullet holes of honor

Example: Bullet holes of honor

Example: Bullet holes of honor

Theory: Biases / sampling

Biases

Sampling

Data wrangling

Data wrangling