Data collection and data wrangling

The ASTA team

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipp… body… sex    year
##    <fctr>  <fctr>             <dbl>         <dbl>  <int> <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7    181  3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4    186  3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18.0    195  3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA       NA    NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3    193  3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6    190  3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8    181  3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6    195  4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1    193  3475 <NA>   2007
## 10 Adelie  Torgersen           42.0          20.2    190  4250 <NA>   2007
## # ... with 334 more rows

Summaries and plots of qualitative variables

Tables of qualitative variables

tally(~species, data = pingviner)
## species
##    Adelie Chinstrap    Gentoo 
##       152        68       124
tally(species ~ island, data = pingviner)
##            island
## species     Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

Plots of qualitative variables

gf_percents(~species, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner)

gf_percents(~species, fill = ~sex, data = pingviner, position = position_dodge())

Summaries of quantitative variables

Percentiles

Measures of center of data: Mean and median

favstats(bill_length_mm ~ species, data = pingviner)
##     species min Q1 median Q3 max mean  sd   n missing
## 1    Adelie  32 37     39 41  46   39 2.7 151       1
## 2 Chinstrap  41 46     50 51  58   49 3.3  68       0
## 3    Gentoo  41 45     47 50  60   48 3.1 123       1

Measures of variability of data: range, standard deviation and variance

The empirical rule

If the histogram of the sample looks like a bell shaped curve, then

Target population and random sampling

Population parameters

Population Sample
\(\mu\) \(\overline{y}\)
\(\sigma\) \(s\)

A word about terminology

Aim of statistics

Random sampling schemes

Possible strategies for obtaining a random sample from the target population are explained in Agresti section 2.4:

Biases

Types of biases

Agresti section 2.3:

Example of sample bias: United States presidential election, 1936

(Based on Agresti, this and this.)

Results

Problems (biases)

Example of response bias: Wording matters

New York Times/CBS News poll on attitude to increased fuel taxes

Example of response bias: Order of questions matter

US study during cold war asked two questions:

1 “Do you think that US should let Russian newspaper reporters come here and sent back whatever they want?”

2 “Do you think that Russia should let American newspaper reporters come in and sent back whatever they want?”

The percentage of yes to question 1 was 36%, if it was asked first and 73%, when it was asked last.

Example of survivior bias: Bullet holes of honor

(Based on this.)

(See also this xkcd)