--- title: "Intro and descriptive statistics" author: "The ASTA team" output: slidy_presentation: fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 2) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) knitr::opts_chunk$set(warning = FALSE) ``` # Software ## **Rstudio** * Make a folder on your computer where you want to keep files to use in **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder). * Set the working directory to this folder: `Session -> Set Working Directory -> Choose Directory` (shortcut: Ctrl+Shift+H). * Make the change permanent by setting the default directory in: `Tools -> Global Options -> Choose Directory`. ## **R** extensions * The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them. * To install a new package in **Rstudio** use the menu: `Tools -> Install Packages` * You need to know the name of the package you want to install. You can also do it through a command: ```{r eval=FALSE} install.packages("mosaic") ``` * When it is installed you can load it through the `library` command: ```{r results="hide", message=FALSE} library(mosaic) ``` * This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that. ## **R** help * You get help via `?`: ```{r eval=FALSE} ?sum ``` * Use `tab` to make **Rstudio** guess what you have started typing. * Search for help: ```{r eval=FALSE} help.search("plot") ``` * You can find a cheat sheet with the **R** functions we use for this course [here](https://asta.math.aau.dk/course/asta/2021-2/?file=cheatsheet.pdf). ```{r, echo = FALSE, eval = FALSE} # OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... # [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf). ``` # Data ## Data example We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins) ```{r} pingviner <- palmerpenguins::penguins pingviner ``` * What is fundamentally different about the the variables (columns) `species` and `body_mass_g`? ## Data types ### Quantitative variables * The measurements have numerical values. * Quantative data often comes about in one of the following ways: * **Continuous variables**: measurements of time, length, size, age, mass, etc. * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc. * Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**. * It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement. * Are any of the measurements in our data set quantitative? ---- ### Categorical/qualitative variables * The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc. * The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. * Factors have two so-called scales: + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color. + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal. * Are any of the measurements in our data set categorical/qualitative? # Graphics for quantitative variables ## Scatterplot - To study the relation between two quantitative variables a scatterplot is used: ```{r} gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) ``` ---- - We could also draw the graph for each species: ```{r} gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner) ``` ---- - If we want a regression line along with the points we can do: ```{r} gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% gf_lm() ``` ---- - A scatterplot is not very good for a single quantitative variable: ```{r} gf_point(1 ~ bill_length_mm, color = ~ species, data = pingviner) ``` ```{r} gf_point(species ~ bill_length_mm, color = ~ species, data = pingviner) ``` ## Histogram - For a single quantitative variable a histogram offers more details: ```{r} gf_histogram( ~ bill_length_mm, data = pingviner) ``` - How to make a histogram for some variable `x`: * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals. * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval. ---- - Not great for comparing groups: ```{r} gf_histogram( ~ bill_length_mm, color = ~species, data = pingviner) ``` ## Boxplot Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots): ```{r bill length vs depth} gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner) ``` To understand the details of the boxplot we need to introduce **percentiles**/quantiles and in particular quartiles which can be seen here: ```{r} Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE) Q ``` * 50-percentile is the **median** and it is a measure of the center of data. * 0-percentile is the **minimum** value. * 25-percentile is called the **lower quartile** (Q1).\ Median of lower 50% of data. * 75-percentile is called the **upper quartile** (Q3).\ Median of upper 50% of data. * 100-percentil is the **maximum** value. * **Interquartile Range (IQR)**: a measure of variability given by the difference of the upper and lower quartiles: * Details about how to find/calculate percentiles are postponed to later. ---- ### How to draw a box plot * Box: * Calculate the median, lower and upper quartiles. * Plot a line by the median and draw a box between the upper and lower quartiles. * Whiskers: * Calculate interquartile range and call it IQR. * Calculate the following values: * L = lower quartile - 1.5*IQR * U = upper quartile + 1.5*IQR * Draw a line from lower quartile to the smallest measurement, which is larger than *L*. * Similarly, draw a line from upper quartile to the largest measurement which is smaller than *U*. * Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles. *Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.* ```{r echo = FALSE} xx <- pingviner %>% filter(species == "Gentoo") %>% arrange(bill_length_mm) %>% pull(bill_length_mm) %>% na.omit() nn <- length(xx) dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "limits")), y = xx) b <- boxplot(y ~ plottype, data = dat, ylim = c(34, 60), xlab = "", ylab = "", pch = 25, main = "Gentoo bill length") points(rep(1, length(dat$y)), dat$y) Q1 <- Q$`25%`[3] Q3 <- Q$`75%`[3] IQR <- Q3-Q1 arrows(3, Q1, 3, Q1-1.5*IQR) text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1) arrows(3, Q3, 3, Q3+1.5*IQR) text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1) abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2) ``` # Data wrangling ## Selecting columns/variables - To subset columns of data use `select()` (automatically loaded from `dplyr` package by `mosaic`): ```{r subsetting data} # Assigning the data subset to an object bill_data <- select(pingviner, bill_length_mm, bill_depth_mm, species, sex, island, year) # This particular subset can be written shorter by (overwriting the object): bill_data <- select(pingviner, -flipper_length_mm, -body_mass_g) ``` - Special role of first argument gives rise to this "pipe" (`%>%`) syntax: ```{r} bill_data <- pingviner %>% select(-flipper_length_mm, -body_mass_g) ``` - We read this as: first take the dataset `pingviner` **and then** select all columns except `flipper_length_mm` and `body_mass_g`. - The resulting dataset doesn't have flipper length and body mass: ```{r} names(bill_data) ``` ## Filtering rows/cases - We use `filter()` to subset rows/cases. E.g. all penguins from Biscoe islands: ```{r} pingviner %>% filter(island == "Biscoe") ``` ---- - All male Gentoo penguins with over 220 mm flippers: ```{r} pingviner %>% filter(sex == "male") %>% filter(species == "Gentoo") %>% filter(flipper_length_mm>220) ``` - This could also have been done with a single `filter()` command (output not shown): ```{r results='hide'} pingviner %>% filter(sex == "male" & species == "Gentoo" & flipper_length_mm>220) ``` ---- - All penguins of species Gentoo or Adelie: ```{r } pingviner %>% filter(species == "Gentoo" | species == "Adelie") ``` - This would be the same as penguins which are not Chinstrap (output not shown): ```{r results='hide'} pingviner %>% filter(species != "Chinstrap") ``` ## Arranging rows/cases - We use `arrange()` to arrange the order of the rows/cases: ```{r} pingviner %>% filter(sex == "female") %>% arrange(body_mass_g) ``` - Use `arrange(desc())` for descending values: ```{r} pingviner %>% filter(sex == "female") %>% arrange(desc(body_mass_g)) ```