--- title: "Introduction to R and descriptive statistics" author: "The ASTA team (Modified by Søren Højsgaard, September 2025)" output: slidy_presentation: fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 3) knitr::opts_chunk$set(#echo = !TRUE, cache=!TRUE, size="footnotesize", #warning = FALSE, message = FALSE, fig.height=3.5) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) ``` # Introduction to R ---- ## **Rstudio** * Make a folder on your computer where you want to keep files to use in **Rstudio**. **Do NOT use special characters like æ, ø, å** in the folder name (or anywhere in the path to the folder). * Set the working directory to this folder: `Session -> Set Working Directory -> Choose Directory` (shortcut: Ctrl+Shift+H). * Make the change permanent by setting the default directory in: `Tools -> Global Options -> Choose Directory`. ---- ## **R** basics * Ordinary calculations: ```{r} 4.6 * (2 + 3)^4 ``` * Make a (scalar) object and print it: ```{r} a <- 4 a ``` * Make a (vector) object and print it: ```{r} b <- c(2, 5, 7) b ``` * Make a sequence of numbers and print it: ```{r} s <- 1:4 s ``` * Note: A more flexible command for sequences: ```{r} s <- seq(1, 4, by = 1) ``` * **R** does elementwise calculations: ```{r} a * b a + b b ^ 2 ``` * Sum and product of elements: ```{r} sum(b) prod(b) ``` ---- ## **R** markdown * The slides and all exercises in R (including the exam questions) are made in the special Rmarkdown format. * This allows you to combine text and R code. * You can write formulas using standard LaTeX commands. ---- ## **R** extensions * The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them. * To install a new package in **Rstudio** use the menu: `Tools -> Install Packages` * You need to know the name of the package you want to install. You can also do it through a command: ```{r eval=FALSE} install.packages("mosaic") ``` * When it is installed you can load it through the `library` command: ```{r results="hide", message=FALSE} library(mosaic) ``` * This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that. ---- ## **R** help * You get help via `?`: ```{r eval=FALSE} ?sum ``` * Use `tab` to make **Rstudio** guess what you have started typing. * Search for help: ```{r eval=FALSE} help.search("plot") ``` * You can find a cheat sheet with the **R** functions we use for this course [here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf). * Save your commands in a file for later usage: + Select history tab in top right pane in **Rstudio** . + Mark the commands you want to save. + Press `To Source` button. ```{r, echo = FALSE, eval = FALSE} # OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... # [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf). ``` # Data in **R** ---- ## Data example * Chile dataset in R is a data frame with 2700 rows and 8 columns from the 1988 plebiscite in Chile for or against Pinochet to continue for another eight years as leader. The sample consists of voting intentions for voters from the Chilean population. There are missing values in the dataset. * The data set contains the variables: * `region`: The region in Chile where the voter lives * `population`: Population of the region. * `sex`: The gender of the voter. * `age`: The age of the voter. * `education`: Education level of the voter (primary, secondary, post-secondary). * `income`: Monthly income of the voter. * `statusquo`: To which degree the voter supports the status quo (numbers ranging from about -2 to 2). * `vote`: Should Pinochet continue? `Y` = yes, `N`= no, `U`=undecided, `A`= will abstain from voting. Note: The referendum was held in Chile on 5 October 1988. The "No" side won with 56% of the vote. Democratic elections were held in 1989, leading to the establishment of a new government in 1990. * More information about the data set may be found [here](https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Chile). ```{r} Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt") head(Chile) ``` ---- ## Data types ### Quantitative variables * The measurements have numerical values. * Quantative data often comes about in one of the following ways: * **Continuous variables**: measurements of e.g. speed, temperature, etc. * **Discrete variables**: counts of e.g. number of household members, hits on a webpage, cars passing on a road in one hour, etc. * Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**. * It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement. ### Categorical/qualitative variables * The measurement is one of a set of given categories, e.g. sex (male/female), education level, satisfaction score (low/medium/high), etc. * Factors have two so-called scales: + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color. + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. education level and satisfaction score. * The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal. ---- ## Variables in the data set ```{r} head(Chile) ``` * Quantitative variables in the `Chile` data set: * `population`, `age`, `income`, `statusquo` * Categorical variables: * `region`, `sex`, `education`, `vote` * All the categorical variables are nominal except `education`, which has three ordered categories (primary, secondary, post-secondary). # Descriptive statistics of categorical data ## Tables * To summarize the the variable `vote` we can use the function `tally` from the `mosaic` package (remember the package **must be loaded** via `library(mosaic)` if you did not do so yet): ```{r} tally( ~ vote, data = Chile) ``` * In percent: ```{r} tally( ~ vote, data = Chile, format = "percent") ``` * Here we use an **R** `formula` (characterized by the "tilde" sign `~`) to indicate that we want this variable from the dataset `Chile` (without the tilde it would look for a global variable called `vote` and use that rather than the one in the dataset). ---- ## 2 factors: Cross tabulation * To get an overview over the relation between two categorical variables, we can make a cross tabulation. * To make a table of all combinations of the two factors `vote` and `sex`, we use `tally` again: ```{r size="small"} tally( ~ vote + sex, data = Chile) ``` * We can also get the relative frequencies (in percent) columnwise: ```{r size="small"} tally( ~ vote | sex, data = Chile, format = "percent") ``` * For instance we see that $34.8\%$ of the women said they would vote yes, while this holds for only $29.4\%$ of the men. ---- ## Visualizing categorical data: Bar graph * To create a bar graph plot of table data we use the function `gf_bar` from `mosaic`. For each level of the factor, a box is drawn with the height proportional to the frequency (count) of the level. ```{r bargraph} gf_bar( ~ vote, data = Chile) ``` * The bar graph can also be split by gender: ```{r bargraph_grouped} gf_bar( ~ vote | sex, data = Chile) ``` # Descriptive statistics of quantitative variables ---- ## Data example: Fuel consumption of cars * The data was extracted from the 1974 _Motor Trend_ US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). * The data set is built into **R** under the name `mtcars`, so it does not need to be loaded before use. * A description of the variables can be found here: https://rstudio-pubs-static.s3.amazonaws.com/61800_faea93548c6b49cc91cd0c5ef5059894.html * In particular: vs: cylinder configuration a V-shape (vs=0) or Straight Line (vs=1). am: automatic (am=0) or manual (am=1) transmission ```{r} head(mtcars) ``` ## Visualizing quantitative data: Histogram * The way to get a first impression of a quantitative variable is to draw a histogram. * The histogram of a variable `x` is made as follows: * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals. * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval. * Histogram of `mpg` for the `mtcars` data. The `bins` option sets the number of subintervals to 10. ```{r hist} gf_histogram( ~ mpg, data=mtcars, bins=10) ``` ---- ## Relation between histogram and denity function * Suppose a sample comes from a population having a continuous distribution with density function $f$. * Draw a histogram where the $y$-axis is scaled such that the total area of the bars is 1. * When the number of observations (the sample size) increases we can make a finer interval division and get a more smooth histogram. * When the number of observations tends to infinity, we obtain a nice smooth curve, where the area below the curve is $1$. This curve is exactly the probability density function $f$. ```{r histToPop,echo=FALSE,results='hide',fig.width=10,fig.height=4} par(mfrow=c(1,3),cex.main = 2,cex.lab = 2,mar=c(5,5,4,1)) set.seed(100) varValue <- rnorm(50,10,2) hist(varValue,breaks="FD",ylab="Density",xlab = "y",ylim=c(0,.25),freq=F,main="Histogram of 50 obs.") text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5) text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5) varValue <- rnorm(1000,10,2) hist(varValue,breaks="FD",freq=F,ylab="Density",xlab = "y",ylim=c(0,.25),main="Histogram of 500 obs.") text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5) text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5) varValue <- (20:180)/10 plot(varValue,dnorm(varValue,10,2),ylab="Density",xlab = "y",ylim=c(0,.25),type="l",main="Histogram of population") text(7,.22,bquote(mu == 10),cex=1.5) text(14,.22,bquote(sigma == 2),cex=1.5) ``` * If the histogram looks bell-shaped this may suggest a normal distribution. ---- ## Summary statistics for quantitative data * We return to the `mtcars` example. A summary of the fuel consumption `mpg` can be retrieved using the `favstats` function: ```{r} favstats( ~ mpg, data = mtcars) ``` * The output contains the following information * **min** The minimal value in the sample is $10.4$. * **max** The maximal value in the sample is $33.9$. * **n** The sample size (number of observations) is 32. * **mean** The sample mean is $20.1$. Recall that this was the average of all observations $x_1,\ldots,x_n$, i.e. $$ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i $$ * **sd** The sample standard deviation is $6.03$. Recall that this was given by $$ s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}. $$ * **missing** There are no missing values. * **median** The median (or 50-percentile) is the value such that half of the sample has lower values than the median and half the sample has larger values. * **Q1** and **Q3** will be introduced on later slides. * Both the mean and the median can be considered the center of a distribution. In a symmetric distribution (such as the normal distribution) they are equal, while in a skewed distribution, they tend to be different. ---- ## Calculation of mean, median and standard deviation using **R** * The mean, median and standard deviation are just some of the summaries that can be read of the `favstats` output (shown on previous page). They may also be calculated separately in the following way: * Sample size of `mpg`: ```{r} length(mtcars$mpg) ``` * Mean of `mpg`: ```{r} mean( ~ mpg, data = mtcars) ``` * Median of `mpg`: ```{r} median( ~ mpg, data = mtcars) ``` * Standard deviation for `mpg`: ```{r} sd( ~ mpg, data = mtcars) ``` * We may also calculate the summaries within groups. For instance, for each engine type (variable `vs`) the sample mean is: ```{r} mean( ~ mpg | factor(vs), data = mtcars) ``` ---- ## Interpretation of summary statistics: The empirical rule ```{r empRule,echo=FALSE,fig.width=12,fig.height=6} set.seed(5) x <- rnorm(900) x <- (x-mean(x))/sd(x) hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE) axis(1,at=-3:3,labels=F,pos=0) axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0) axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0) axis(1,at=-1,labels=substitute(bar(y)-s),pos=0) axis(1,at=0,labels=substitute(bar(y)),pos=0) axis(1,at=1,labels=substitute(bar(y)+s),pos=0) axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0) axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0) arrows(-1,-.1,1,-.1,col="red",code=3,length=.1) text(-.01,-.115,"About 68% of measurements",col="red") arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1) text(-.01,-.1655,"About 95% of measurements",col="blue") arrows(-3,-.2,3,-.2,code=3,length=.1) text(-.01,-.215,"All or nearly all measurements") ``` ---- ## Very practical rules of thumb If the histogram of the sample is unimodal approximately bell shaped, then * The mean and median are approximately equal. ```{r} mean(mtcars$mpg) median(mtcars$mpg) ``` And the median is easy to find: Sort data and locate the middle observation. * about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$. If we say that 95% of the observations are "all observations" then we get the very practical rule of thumb: The range of all or nearly all observations is approximately $4s$. That is a very useful interpretation of $s$. ```{r} 4*sd(mtcars$mpg) range(mtcars$mpg)[2]-range(mtcars$mpg)[1] ``` --- ## Percentiles * **The $p$th percentile** is a value such that about $p$% of the population (or sample) lies below or at this value and about $(100-p)$% of the population (or sample) lies above it. ### Percentile calculation for a sample: ```{r echo=FALSE} xx<- sort(mtcars$mpg) nn<-length(xx) ``` * First, sort data from smallest to largest. For the `mpg` variable: $$ x_{(1)}=`r xx[1]`, x_{(2)}=`r xx[2]`, x_{(3)}=`r xx[3]`, \ldots, x_{(n)} = `r xx[nn]`. $$ Here the number of observations is $n=`r nn`$. ```{r echo=FALSE} p <- 10 N <- nn*p/100 ``` * Find the $10$th percentile (i. e.\ $p = `r p`$):\ * The observation number corresponding to the 10-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. * So the 10-percentile lies between the observations with observation number $k=`r floor(N)`$ and $k+1=`r floor(N)+1`$. That is, its value lies somewhere in the interval between $x_{(3)}=13.3$ and $x_{(4)}=14.3$. * One of several methods for estimating the 10-percentile from the value of N is defined as: $$ x_{(k)} + (N - k)(x_{(k+1)} - x_{(k)}) $$ which in this case gives $$ x_{(3)} + (3.2 - 3)(x_{(4)} - x_{(3)}) = 13.3 + 0.2 \cdot (14.3-13.3) = 13.5. $$ ## Median, quartiles and interquartile range Recall ```{r} favstats( ~ mpg, data = mtcars) ``` * 0-percentile = 10.4 is the **minimum** value. * 50-percentile = 20.1 is the **median** and it is a measure of the center of data. * 25-percentile = 15.4 is called the **lower quartile** (Q1).\ Median of lower 50% of data. * 75-percentile = 22.8 is called the **upper quartile** (Q3).\ Median of upper 50% of data. * 100-percentile = 33.9 is the **maximum** value. * **Interquartile Range (IQR)**:\ a measure of variability given by the difference of the upper and lower quartiles: 23 -15 = 8. ## Box-and-whiskers plots (or simply box plots) How to draw a box-and-whiskers plot: * Box: * Calculate the median, lower and upper quartiles. * Plot a line by the median and draw a box between the upper and lower quartiles. * Whiskers: * Calculate interquartile range and call it IQR. * Calculate the following values: * L = lower quartile - 1.5*IQR * U = upper quartile + 1.5*IQR * Draw a line from lower quartile to the smallest measurement, which is larger than *L*. * Similarly, draw a line from upper quartile to the largest measurement which is smaller than *U*. * Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles. * Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers. ```{r echo = FALSE} set.seed(42) dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "whiskers")), y = c(7, runif(20,12,19), 22)) b <- boxplot(y ~ plottype, data = dat, ylim = c(6, 25), xlab = "", ylab = "", pch = 25) points(rep(1, length(dat$y)), dat$y) Q1 <- b$stats[2,2] Q3 <- b$stats[4,2] IQR <- Q3-Q1 arrows(3, Q1, 3, Q1-1.5*IQR) text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1) arrows(3, Q3, 3, Q3+1.5*IQR) text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1) abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2) ``` ---- ## Boxplot for fuel consumption * Boxplot of the fuel consumption separately for each engine type: ```{r boxplot} favstats(mpg ~ vs, data = mtcars) gf_boxplot(mpg ~ factor(vs), data = mtcars) ``` * Cars with engine type 1 seem to use more fuel. * A single car with engine type 0 differs noticeably from the others with a high fuel consumption. ---- ## 2 quantitative variables: Scatter plot * A **scatter plot** is used to visualize two quantitative variables. * For instance, we can plot the relation between fuel consumption and horse powers (`hp`) of a car as follows ```{r scatter0} gf_point(mpg ~ hp, data = mtcars) ``` * This can be either split or coloured according to the engine type`vs`: ```{r} gf_point(mpg ~ hp | factor(vs), data = mtcars) gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) ``` * If we want a regression line along with the points we can do: ```{r scatter05} gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) |> gf_lm() ``` # Quantile plots ---- ## The empirical quantiles The quantiles of a distribution may be used to summarize the distribution or to investigate if a sample comes from a specific distribution. * Recall: The distribution function $F()$ of a random variable $X$ is defined as: $$F(x')=P(X\leq x').$$ * Recall: The $q$-quantile (e.g. the 25% quantile) of the distribution is the is value of $x$, call it $x_q$, such that $F(x_q)=0.25$. * The empirical counter part is the empirical distribution $\hat F()$: Given data points $x_1,x_2, \dots, x_n$. The empirical distribution is given by $$\hat F(x) = \frac{\mbox{number of sample points} \le x}{n}$$ * So $\hat F(x)$ takes the values $0, 1/n, 2/n, \dots, n/n=1$ and jumps whenever there is a new data point * The points where $\hat F(x)$ jumps are called the empirical quantiles and they are easy to find: We rank (sort) the observations in a sample (called order statistics): $$x_{(1)}\leq x_{(2)}\leq \ldots \leq x_{(n)}$$ --- ```{r} plot(ecdf(mtcars$mpg)) ``` * Natural to approximate $F$ at $x_{(i)}$ by empirical distribution $\hat F()$ so $$\hat{F}(x_{(i)}) = \frac{i}{n}.$$ * Hence: $x_{(i)}$ is approximately the $\frac{i}{n}$-quantile. * Note: some authors use slightly different quantiles, e.g. $\frac{i-0.5}{n}$-quantile. --- ## Quantile-quantile plots * The quantiles may be used to investigate if a sample comes from a specific distribution (for example, normal distribution or a uniform distribution). * Do so by comparing the quantiles of the sample with the quantiles $q_i$ of the specific distribution we are considering: * Recall $$ i / n = \hat F(x_{(i)}) = P(X\leq x_{(i)}) $$ * We find the quantiles $q_i$ of the specific distribution in question $$ i/n = Pr(Y \le q_i) $$ * plot $(x_{(i)}, q_i)$. Should be on the unit line: Example: Does `mpg` come from a uniform distribution $U(10, 34)$. Does not look like it: ```{r} x <- mtcars$mpg n <- length(x) x_sorted <- sort(x) i.n <- (1:n) / n q.i <- qunif(i.n, min(x), max(x)) ## min(x) is 10, max(x) is 34 qqplot(q.i, x_sorted) abline(a=0, b=1, col="red", lwd=2) ``` --- ## Normal quantile-quantile plots * Above we needed to specify distribution exactly, i.e. $U(10, 34)$. * For the normal distribution things are much easier. Want to investigate whether the sample comes from a normal distribution $\texttt{norm}(\mu,\sigma)$ and we need not know $\mu$ or $\sigma$. * Recall $$ i / n = \hat F(x_{(i)}) = P(X\leq x_{(i)}) $$ * Recall this: If $Z$ has a standard normal distribution $\texttt{norm}(0,1)$ then $Y = \mu + \sigma Z$ has a $\texttt{norm}(\mu,\sigma)$--distribution. * Let $q_i$ be the $\frac{i}{n}$-quantile of a standard normal distribution, i.e.\ $P(Z\leq q_i)= \frac{i}{n}$. * Then $$ \begin{aligned} \frac{i}{n} &= P(Z\leq q_i) = P(\mu + \sigma Z \leq \mu + \sigma q_i)\\ &= P(Y \leq \mu + \sigma q_i)\\ \end{aligned} $$ So $\mu + \sigma q_i$ is the corresponding $\frac{i}{n}$-quantile of a $\texttt{norm}(\mu,\sigma)$ distribution. Hence if the sample comes from a $\texttt{norm}(\mu,\sigma)$ distribution, then the sample quantiles $x_{(i)}$ should be approximately equal to the population quantiles $\mu + \sigma q_i$: $$ x_{(i)} \approx \mu + \sigma q_i $$ So if we plot ($x_{(i)}, \mu + \sigma q_i)$ and if the sample comes from a $\texttt{norm}(\mu,\sigma)$ distribution the points should be on a straight line with intercept $\mu$ and slope $\sigma$. Looks mostly like a straight line, so `mpg` could be described - approximately - by a normal distribution: ```{r} qqnorm(mtcars$mpg) qqline(mtcars$mpg, col="red", lwd=2) ```