Intro and descriptive statistics

The ASTA team

Software

Rstudio

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data

Data example

We use data about pengiuns from the R package palmerpenguins

pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipp… body… sex    year
##    <fctr>  <fctr>             <dbl>         <dbl>  <int> <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7    181  3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4    186  3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18.0    195  3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA       NA    NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3    193  3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6    190  3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8    181  3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6    195  4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1    193  3475 <NA>   2007
## 10 Adelie  Torgersen           42.0          20.2    190  4250 <NA>   2007
## # ... with 334 more rows

Data types

Quantitative variables

Categorical/qualitative variables

Graphics for quantitative variables

Scatterplot

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)

gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% 
  gf_lm()

gf_point(1 ~ bill_length_mm, color = ~ species, data = pingviner)

gf_point(species ~ bill_length_mm, color = ~ species, data = pingviner)

Histogram

gf_histogram( ~ bill_length_mm, data = pingviner)

gf_histogram( ~ bill_length_mm, color = ~species, data = pingviner)

Boxplot

Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):

gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)

To understand the details of the boxplot we need to introduce percentiles/quantiles and in particular quartiles which can be seen here:

Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
##     species 0% 25% 50% 75% 100%
## 1    Adelie 32  37  39  41   46
## 2 Chinstrap 41  46  50  51   58
## 3    Gentoo 41  45  47  50   60

How to draw a box plot

Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

Data wrangling

Selecting columns/variables

# Assigning the data subset to an object
bill_data <- select(pingviner, bill_length_mm, bill_depth_mm,
                    species, sex, island, year)

# This particular subset can be written shorter by (overwriting the object):
bill_data <- select(pingviner, -flipper_length_mm, -body_mass_g)
bill_data <- pingviner %>% select(-flipper_length_mm, -body_mass_g)
names(bill_data)
## [1] "species"        "island"         "bill_length_mm" "bill_depth_mm" 
## [5] "sex"            "year"

Filtering rows/cases

pingviner %>% filter(island == "Biscoe")
## # A tibble: 168 x 8
##    species island bill_length_mm bill_depth_mm flipper… body_… sex    year
##    <fctr>  <fctr>          <dbl>         <dbl>    <int>  <int> <fct> <int>
##  1 Adelie  Biscoe           37.8          18.3      174   3400 fema…  2007
##  2 Adelie  Biscoe           37.7          18.7      180   3600 male   2007
##  3 Adelie  Biscoe           35.9          19.2      189   3800 fema…  2007
##  4 Adelie  Biscoe           38.2          18.1      185   3950 male   2007
##  5 Adelie  Biscoe           38.8          17.2      180   3800 male   2007
##  6 Adelie  Biscoe           35.3          18.9      187   3800 fema…  2007
##  7 Adelie  Biscoe           40.6          18.6      183   3550 male   2007
##  8 Adelie  Biscoe           40.5          17.9      187   3200 fema…  2007
##  9 Adelie  Biscoe           37.9          18.6      172   3150 fema…  2007
## 10 Adelie  Biscoe           40.5          18.9      180   3950 male   2007
## # ... with 158 more rows
pingviner %>% filter(sex == "male") %>% 
  filter(species == "Gentoo") %>% 
  filter(flipper_length_mm>220)
## # A tibble: 34 x 8
##    species island bill_length_mm bill_depth_mm flipper… body_… sex    year
##    <fctr>  <fctr>          <dbl>         <dbl>    <int>  <int> <fct> <int>
##  1 Gentoo  Biscoe           50.0          16.3      230   5700 male   2007
##  2 Gentoo  Biscoe           49.2          15.2      221   6300 male   2007
##  3 Gentoo  Biscoe           48.7          15.1      222   5350 male   2007
##  4 Gentoo  Biscoe           47.3          15.3      222   5250 male   2007
##  5 Gentoo  Biscoe           59.6          17.0      230   6050 male   2007
##  6 Gentoo  Biscoe           49.6          16.0      225   5700 male   2008
##  7 Gentoo  Biscoe           50.5          15.9      222   5550 male   2008
##  8 Gentoo  Biscoe           50.5          15.9      225   5400 male   2008
##  9 Gentoo  Biscoe           50.1          15.0      225   5000 male   2008
## 10 Gentoo  Biscoe           50.4          15.3      224   5550 male   2008
## # ... with 24 more rows
pingviner %>% filter(sex == "male" & species == "Gentoo" & flipper_length_mm>220)
pingviner %>% filter(species == "Gentoo" | species == "Adelie")
## # A tibble: 276 x 8
##    species island    bill_length_mm bill_depth_mm flipp… body… sex    year
##    <fctr>  <fctr>             <dbl>         <dbl>  <int> <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7    181  3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4    186  3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18.0    195  3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA       NA    NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3    193  3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6    190  3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8    181  3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6    195  4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1    193  3475 <NA>   2007
## 10 Adelie  Torgersen           42.0          20.2    190  4250 <NA>   2007
## # ... with 266 more rows
pingviner %>% filter(species != "Chinstrap")

Arranging rows/cases

pingviner %>% filter(sex == "female") %>% arrange(body_mass_g)
## # A tibble: 165 x 8
##    species   island    bill_length_mm bill_de… flipper… body_… sex    year
##    <fctr>    <fctr>             <dbl>    <dbl>    <int>  <int> <fct> <int>
##  1 Chinstrap Dream               46.9     16.6      192   2700 fema…  2008
##  2 Adelie    Biscoe              36.5     16.6      181   2850 fema…  2008
##  3 Adelie    Biscoe              36.4     17.1      184   2850 fema…  2008
##  4 Adelie    Biscoe              34.5     18.1      187   2900 fema…  2008
##  5 Adelie    Dream               33.1     16.1      178   2900 fema…  2008
##  6 Adelie    Torgersen           38.6     17.0      188   2900 fema…  2009
##  7 Chinstrap Dream               43.2     16.6      187   2900 fema…  2007
##  8 Adelie    Biscoe              37.9     18.6      193   2925 fema…  2009
##  9 Adelie    Dream               37.0     16.9      185   3000 fema…  2007
## 10 Adelie    Dream               37.3     16.8      192   3000 fema…  2009
## # ... with 155 more rows
pingviner %>% filter(sex == "female") %>% arrange(desc(body_mass_g))
## # A tibble: 165 x 8
##    species island bill_length_mm bill_depth_mm flipper… body_… sex    year
##    <fctr>  <fctr>          <dbl>         <dbl>    <int>  <int> <fct> <int>
##  1 Gentoo  Biscoe           46.5          14.8      217   5200 fema…  2008
##  2 Gentoo  Biscoe           45.2          14.8      212   5200 fema…  2009
##  3 Gentoo  Biscoe           49.1          14.8      220   5150 fema…  2008
##  4 Gentoo  Biscoe           44.9          13.3      213   5100 fema…  2008
##  5 Gentoo  Biscoe           45.1          14.5      207   5050 fema…  2007
##  6 Gentoo  Biscoe           45.1          14.5      215   5000 fema…  2007
##  7 Gentoo  Biscoe           42.9          13.1      215   5000 fema…  2007
##  8 Gentoo  Biscoe           50.5          15.2      216   5000 fema…  2009
##  9 Gentoo  Biscoe           47.2          15.5      215   4975 fema…  2009
## 10 Gentoo  Biscoe           42.6          13.7      213   4950 fema…  2008
## # ... with 155 more rows