Intro and descriptive statistics

The ASTA team

Software

Rstudio

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipp… body… sex    year
##    <fctr>  <fctr>             <dbl>         <dbl>  <int> <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7    181  3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4    186  3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18.0    195  3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA       NA    NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3    193  3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6    190  3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8    181  3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6    195  4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1    193  3475 <NA>   2007
## 10 Adelie  Torgersen           42.0          20.2    190  4250 <NA>   2007
## # ... with 334 more rows

Data types

Quantitative variables

Categorical/qualitative variables

Graphics for quantitative variables

Scatterplot

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% 
  gf_lm()

Histogram

gf_histogram( ~ bill_length_mm, data = pingviner)

Summaries of quantitative variables

Percentiles

Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
##     species 0% 25% 50% 75% 100%
## 1    Adelie 32  37  39  41   46
## 2 Chinstrap 41  46  50  51   58
## 3    Gentoo 41  45  47  50   60

Boxplot

Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):

gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)

How to draw a box plot

Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

Measures of center of data: Mean and median

favstats(bill_length_mm ~ species, data = pingviner)
##     species min Q1 median Q3 max mean  sd   n missing
## 1    Adelie  32 37     39 41  46   39 2.7 151       1
## 2 Chinstrap  41 46     50 51  58   49 3.3  68       0
## 3    Gentoo  41 45     47 50  60   48 3.1 123       1

As previously defined this constitutes a sample.

Measures of variability of data: range, standard deviation and variance

The empirical rule

If the histogram of the sample looks like a bell shaped curve, then