Data collection and data wrangling

The ASTA team

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Summaries and plots of qualitative variables

Tables of qualitative variables

tally(~species, data = pingviner)
## species
##    Adelie Chinstrap    Gentoo 
##       152        68       124
tally(species ~ island, data = pingviner)
##            island
## species     Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

Plots of qualitative variables

gf_percents(~species, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner, position = position_dodge())

Target population and random sampling

Population parameters

Population Sample
\(\mu\) \(\overline{y}\)
\(\sigma\) \(s\)

A word about terminology

Aim of statistics

Random sampling schemes

Possible strategies for obtaining a random sample from the target population are explained in Agresti section 2.4:

Biases

Types of biases

Agresti section 2.3:

Example of sample bias: United States presidential election, 1936

(Based on Agresti, this and this.)

Results

Problems (biases)

Example of response bias: Wording matters

New York Times/CBS News poll on attitude to increased fuel taxes

Example of response bias: Order of questions matter

US study during cold war asked two questions:

1 “Do you think that US should let Russian newspaper reporters come here and sent back whatever they want?”

2 “Do you think that Russia should let American newspaper reporters come in and sent back whatever they want?”

The percentage of yes to question 1 was 36%, if it was asked first and 73%, when it was asked last.

Example of survivior bias: Bullet holes of honor

(Based on this.)

(See also this xkcd)

Example of selection bias

All restaurants:

set.seed(1)
n <- 1000
food <- rnorm(n, mean = 0, sd = 1)
location <- rnorm(n, mean = 0, sd = 1)
gf_point(food ~ location) 

gf_point(food ~ location) %>% gf_lm()

cor.test(food, location)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 0.2, df = 998, p-value = 0.8
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.056  0.068
## sample estimates:
##    cor 
## 0.0064

Total score = food + location

Good review if score > 2

score <- food + location
good <- score > 2
gf_point(food ~ location, color = ~ good)

Focusing on “good” restaurants

gf_point(food[good] ~ location[good])

gf_point(food[good] ~ location[good]) %>% 
  gf_lm()

cor.test(food[good], location[good])
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -6, df = 79, p-value = 4e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.67 -0.35
## sample estimates:
##   cor 
## -0.53