Data collection and data wrangling

The ASTA team

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Summaries and plots of qualitative variables

Tables of qualitative variables

tally(~species, data = pingviner)
## species
##    Adelie Chinstrap    Gentoo 
##       152        68       124
tally(species ~ island, data = pingviner)
##            island
## species     Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

Plots of qualitative variables

gf_percents(~species, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner, position = position_dodge())

Target population and random sampling

Population parameters

Population Sample
\(\mu\) \(\overline{y}\)
\(\sigma\) \(s\)

A word about terminology

Aim of statistics

Random sampling schemes

Possible strategies for obtaining a random sample from the target population are explained in Agresti section 2.4:

Biases

Types of biases

Agresti section 2.3:

Example of sample bias: United States presidential election, 1936

(Based on Agresti, this and this.)

Results

Problems (biases)

Example of response bias: Wording matters

New York Times/CBS News poll on attitude to increased fuel taxes

Example of response bias: Order of questions matter

US study during cold war asked two questions:

1 “Do you think that US should let Russian newspaper reporters come here and sent back whatever they want?”

2 “Do you think that Russia should let American newspaper reporters come in and sent back whatever they want?”

The percentage of yes to question 1 was 36%, if it was asked first and 73%, when it was asked last.

Example of survivior bias: Bullet holes of honor

(Based on this.)

(See also this xkcd)