Intro and descriptive statistics

The ASTA team

Software

Rstudio

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Data types

Quantitative variables

Categorical/qualitative variables

Graphics for quantitative variables

Scatterplot

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% 
  gf_lm()

Histogram

gf_histogram( ~ bill_length_mm, data = pingviner)

Summaries of quantitative variables

Percentiles

Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
##     species 0% 25% 50% 75% 100%
## 1    Adelie 32  37  39  41   46
## 2 Chinstrap 41  46  50  51   58
## 3    Gentoo 41  45  47  50   60

Boxplot

Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):

gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)

How to draw a box plot

Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

Measures of center of data: Mean and median

favstats(bill_length_mm ~ species, data = pingviner)
##     species min Q1 median Q3 max mean  sd   n missing
## 1    Adelie  32 37     39 41  46   39 2.7 151       1
## 2 Chinstrap  41 46     50 51  58   49 3.3  68       0
## 3    Gentoo  41 45     47 50  60   48 3.1 123       1

As previously defined this constitutes a sample.

Measures of variability of data: range, standard deviation and variance

The empirical rule

If the histogram of the sample looks like a bell shaped curve, then