Data collection and wrangling
The ASTA team
Data collection
Data collection
- Getting numbers to report is easy
- Getting sensible and trustworthy numbers to report is orders of magnitude more difficult
Ronald Fisher (1890-1962):
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
Said about Fisher:
- Anders Hald (1913-2007), Danish statistician: “a genius who almost single-handedly created the foundations for modern statistical science”
- Bradley Efron (b. 1938): “the single most important figure in 20th century statistics”
Data collection
- Competences, ideally:
- Statistics, both conceptually and analyses
- Data wrangling (loading data; right format for analyses, tables, figures; …)
- Visualizations
- Knowledge about subject (best with access to experts)
- Not just downloading a spreadsheet!
- Population vs sample
- Descriptives of the sample (e.g. mean)
- Statistical inference about population (how close is sample’s mean to population’s mean)
- Do collect and analyze data, but know about pitfalls and limitations in generalisability!
Population and sample
Population and sample
Sample 3 of size \(n = 30\):
baby |
black |
2 |
0.07 |
0.04 |
-0.02 |
baby |
blue |
1 |
0.03 |
0.04 |
0.01 |
baby |
red |
0 |
0.00 |
0.01 |
0.01 |
man |
black |
5 |
0.17 |
0.12 |
-0.04 |
man |
blue |
8 |
0.27 |
0.22 |
-0.04 |
man |
red |
3 |
0.10 |
0.08 |
-0.02 |
woman |
black |
3 |
0.10 |
0.23 |
0.13 |
woman |
blue |
8 |
0.27 |
0.22 |
-0.05 |
woman |
red |
0 |
0.00 |
0.02 |
0.02 |
- Descriptive vs statistical inference.
Example: United States presidential election, 1936
Example: United States presidential election, 1936
(Based on Agresti, this and this.)
- Current president: Franklin D. Roosevelt
- Election: Franklin D. Roosevelt vs Alfred Landon (Republican governor of Kansas)
- Literary Digest: magazine with history of accurately predicting winner of past 5 presidential elections
Example: United States presidential election, 1936
- Literary Digest poll (\(\hat{\pi}\) and \(1-\hat{\pi}\)): Landon: 57%; Roosevelt: 43%
- Actual results (\(\pi\) and \(1-\pi\)): Landon: 38%; Roosevelt: 62%
- Sampling error: 57%-38% = 19%
- Practically all of the sampling error was the result of sample bias
- Poll size of > 2 mio. individuals participated – extremely large poll
Example: United States presidential election, 1936
- Mailing list of about 10 mio. names was created
- Based on every telephone directory, lists of magasine subscribers, rosters of clubs and associations, and other sources
- Each one of 10 mio. received a mock ballot and asked to return the marked ballot to the magazine
- “respondents who returned their questionnaires represented only that subset of the population with a relatively intense interest in the subject at hand, and as such constitute in no sense a random sample … it seems clear that the minority of anti-Roosevelt voters felt more strongly about the election than did the pro-Roosevelt majority” (The American Statistician, 1976)
- Biases:
- Selection bias
- List generated towards middle- and upper-class voters (e.g. 1936 and telephones)
- Many unemployed (club memberships and magazine subscribers)
- Non-response bias
- Only responses from 2.3/2.4 mio out of 10 million people
- Cannot force people to participate: but mail may be junk (phone, interviews, online, pay/paid, …)
Example: Bullet holes of honor
Example: Bullet holes of honor
(Based on this.)
- World War II
- Royal Air Force (RAF), UK
- Lost many planes to German anti-aircraft fire
- Armor up!
- Where?
- Count up all the bullet holes in planes that returned from missions
- Put extra armor in the areas that attracted the most fire
Example: Bullet holes of honor
- Hungarian-born mathematician Abraham Wald:
- If a plane makes it back safely with a bunch of bullet holes in its wings: holes in the wings aren’t very dangerous
- Armor up the areas that (on average) don’t have any bullet holes
- They never make it back, apparently dangerous
Theory: Biases / sampling
Biases
Agresti section 2.3:
- Sampling/selection bias
- Probability sampling: each sample of size \(n\) has same probability of being sampled
- Still problems: undercoverage, groups not represented (inmates, homeless, hospitalized, …)
- Non-probability sampling: probability of sample not possible to determine
- Response bias
- E.g. poorly worded, confusing or even order of questions
- Lying if think socially unacceptable
- Non-response bias
- Non-response rate high; systematic in non-responses (age, health, believes)
Sampling
Agresti section 2.4:
- Random sampling schemes:
- Simple sampling: each possible sample of equal size equally probable
- Systematic sampling
- Stratified sampling
- Cluster sampling
- Multistage sampling
- …
Data wrangling
Data wrangling
This will be illustrated with two specific cases.
The material is on Moodle.