--- title: "Intro and descriptive statistics" author: "The ASTA team" output: slidy_presentation: fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 2) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) knitr::opts_chunk$set(warning = FALSE) ``` # Software ## **Rstudio** * Make a folder on your computer where you want to keep files to use in **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder). * Set the working directory to this folder: `Session -> Set Working Directory -> Choose Directory` (shortcut: Ctrl+Shift+H). * Make the change permanent by setting the default directory in: `Tools -> Global Options -> Choose Directory`. ## **R** extensions * The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them. * To install a new package in **Rstudio** use the menu: `Tools -> Install Packages` * You need to know the name of the package you want to install. You can also do it through a command: ```{r eval=FALSE} install.packages("mosaic") ``` * When it is installed you can load it through the `library` command: ```{r results="hide", message=FALSE} library(mosaic) ``` * This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that. ## **R** help * You get help via `?`: ```{r eval=FALSE} ?sum ``` * Use `tab` to make **Rstudio** guess what you have started typing. * Search for help: ```{r eval=FALSE} help.search("plot") ``` * You can find a cheat sheet with the **R** functions we use for this course [here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf). ```{r, echo = FALSE, eval = FALSE} # OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... # [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf). ``` # Data ## Data example We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins) ```{r} pingviner <- palmerpenguins::penguins pingviner ``` * What is fundamentally different about the the variables (columns) `species` and `body_mass_g`? ## Data types ### Quantitative variables * The measurements have numerical values. * Quantative data often comes about in one of the following ways: * **Continuous variables**: measurements of time, length, size, age, mass, etc. * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc. * Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**. * It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement. * Are any of the measurements in our data set quantitative? ---- ### Categorical/qualitative variables * The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc. * The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. * Factors have two so-called scales: + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color. + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal. * Are any of the measurements in our data set categorical/qualitative? # Graphics for quantitative variables ## Scatterplot - To study the relation between two quantitative variables a scatterplot is used: ```{r} gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) ``` ---- - We could also draw the graph for each species: ```{r} gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner) ``` ---- - If we want a regression line along with the points we can do: ```{r} gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% gf_lm() ``` ---- ## Histogram - For a single quantitative variable a histogram offers more details: ```{r} gf_histogram( ~ bill_length_mm, data = pingviner) ``` - How to make a histogram for some variable `x`: * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals. * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval. ```{r include=FALSE} # Ignore this code. It's just used to write correct numbers later in the text. y <- pingviner %>% filter(species == "Gentoo") %>% pull(bill_length_mm) %>% na.omit() nn <- length(y) xx <- sort(y) ``` # Summaries of quantitative variables ## Percentiles * **The $p$th percentile** is a value such that at least $p$% of the sample lies below or at this value and at least $(100-p)$% of the sample lies above or at the value. ```{r} Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE) Q ``` * 50-percentile is the **median** and it is a measure of the center of data as the number of data points below the median is the samme as the number above the median. * 0-percentile is the **minimum** value. * 25-percentile is called the **lower quartile** (Q1).\ Median of lower 50% of data. * 75-percentile is called the **upper quartile** (Q3).\ Median of upper 50% of data. * 100-percentil is the **maximum** value. * **Interquartile Range (IQR)**: a measure of variability given by the difference of the upper and lower quartiles. ## Boxplot Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots): ```{r bill length vs depth} gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner) ``` ---- ### How to draw a box plot * Box: * Calculate the median, lower and upper quartiles. * Plot a line by the median and draw a box between the upper and lower quartiles. * Whiskers: * Calculate interquartile range and call it IQR. * Calculate the following values: * L = lower quartile - 1.5*IQR * U = upper quartile + 1.5*IQR * Draw a line from lower quartile to the smallest measurement, which is larger than *L*. * Similarly, draw a line from upper quartile to the largest measurement which is smaller than *U*. * Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles. *Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.* ```{r echo = FALSE} xx <- pingviner %>% filter(species == "Gentoo") %>% arrange(bill_length_mm) %>% pull(bill_length_mm) %>% na.omit() nn <- length(xx) dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "limits")), y = xx) b <- boxplot(y ~ plottype, data = dat, ylim = c(34, 60), xlab = "", ylab = "", pch = 25, main = "Gentoo bill length") points(rep(1, length(dat$y)), dat$y) Q1 <- Q$`25%`[3] Q3 <- Q$`75%`[3] IQR <- Q3-Q1 arrows(3, Q1, 3, Q1-1.5*IQR) text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1) arrows(3, Q3, 3, Q3+1.5*IQR) text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1) abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2) ``` ## Measures of center of data: Mean and median * A number of numerical summaries can be retrieved using the `favstats` command: ```{r} favstats(bill_length_mm ~ species, data = pingviner) ``` * The observed values of `bill_length_mm` are $y_1=`r y[1]`$, $y_2=`r y[2]`,\ldots,y_n=`r y[nn]`$, where there are a total of $n=`r nn`$ values. As previously defined this constitutes a **sample**. * **mean** = `r round(mean(y))` is the **average** of the sample, which is calculated by $$ \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i. $$ We may also call $\bar{y}$ the **(empirical) mean** or the **sample mean**. It is calculated using `mean()` in **R**. * **median** = `r round(median(y))` is calculated using `median()` in **R**. * An important property of the **mean** and the **median** is that they have the same unit as the observations (e.g. millimeter). ## Measures of variability of data: range, standard deviation and variance * The **range** is the difference of the largest and smallest observation (`range()` in **R**). * The **(empirical) variance** (`var()` in **R**) is the average of the squared deviations from the mean: $$ s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2. $$ * **sd** $=$ **standard deviation** $= s=\sqrt{s^2}$ (`sd()` in **R**). * Note:\ If the observations are measured in mm,\ the **variance** has unit $\text{mm}^2$ which is hard to interpret. The **standard deviation** on the other hand has the same unit as the observations. * The standard deviation describes how much data varies around the (empirical) mean. ---- ### The empirical rule ```{r empRule,echo=FALSE,fig.width=12,fig.height=6} set.seed(5) x <- rnorm(900) x <- (x-mean(x))/sd(x) hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE) axis(1,at=-3:3,labels=F,pos=0) axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0) axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0) axis(1,at=-1,labels=substitute(bar(y)-s),pos=0) axis(1,at=0,labels=substitute(bar(y)),pos=0) axis(1,at=1,labels=substitute(bar(y)+s),pos=0) axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0) axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0) arrows(-1,-.1,1,-.1,col="red",code=3,length=.1) text(-.01,-.115,"About 68% of measurements",col="red") arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1) text(-.01,-.1655,"About 95% of measurements",col="blue") arrows(-3,-.2,3,-.2,code=3,length=.1) text(-.01,-.215,"All or nearly all measurements") ``` If the histogram of the sample looks like a bell shaped curve, then * about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$. * about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$. * All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$.