--- title: "Introduction to R and descriptive statistics" author: "The ASTA team" output: slidy_presentation: fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 2) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) ``` # Introduction to R ---- ## **Rstudio** * Make a folder on your computer where you want to keep files to use in **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder). * Set the working directory to this folder: `Session -> Set Working Directory -> Choose Directory` (shortcut: Ctrl+Shift+H). * Make the change permanent by setting the default directory in: `Tools -> Global Options -> Choose Directory`. ---- ## **R** basics * Ordinary calculations: ```{r} 4.6 * (2 + 3)^4 ``` * Make a (scalar) object and print it: ```{r} a <- 4 a ``` * Make a (vector) object and print it: ```{r} b <- c(2, 5, 7) b ``` * Make a sequence of numbers and print it: ```{r} s <- 1:4 s ``` * Note: A more flexible command for sequences: ```{r} s <- seq(1, 4, by = 1) ``` * **R** does elementwise calculations: ```{r} a * b a + b b ^ 2 ``` * Sum and product of elements: ```{r} sum(b) prod(b) ``` ---- ## **R** markdown * The slides and all exercises in R (including the exam questions) are made in the special Rmarkdown format. * This allows you to combine text and R code. * You can write formulas using standard LaTeX commands. ---- ## **R** extensions * The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them. * To install a new package in **Rstudio** use the menu: `Tools -> Install Packages` * You need to know the name of the package you want to install. You can also do it through a command: ```{r eval=FALSE} install.packages("mosaic") ``` * When it is installed you can load it through the `library` command: ```{r results="hide", message=FALSE} library(mosaic) ``` * This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that. ---- ## **R** help * You get help via `?`: ```{r eval=FALSE} ?sum ``` * Use `tab` to make **Rstudio** guess what you have started typing. * Search for help: ```{r eval=FALSE} help.search("plot") ``` * You can find a cheat sheet with the **R** functions we use for this course [here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf). * Save your commands in a file for later usage: + Select history tab in top right pane in **Rstudio** . + Mark the commands you want to save. + Press `To Source` button. ```{r, echo = FALSE, eval = FALSE} # OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... # [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf). ``` # Data in **R** ---- ## Data example * Now we will have a look at a data set concerning the 1988 vote in Chile for or against Pinochet to continue as leader. The sample consists of 2700 voters randomly selected from the Chilenean population. * The data set contains the variables: * `region`: The region in Chile where the voter lives * `population`: Population of the region. * `sex`: The gender of the voter. * `age`: The age of the voter. * `education`: Education level of the voter. * `income`: Monthly income of the voter. * `statusquo`: To which degree the voter supports the status quo. * `vote`: Should Pinochet continue? `Y` = yes, `N`= no, `U`=undecided, `A`= will abstain from voting. * More information about the data set may be found [here](https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Chile). ```{r} Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt") head(Chile) ``` ---- ## Data types ### Quantitative variables * The measurements have numerical values. * Quantative data often comes about in one of the following ways: * **Continuous variables**: measurements of e.g. speed, temperature, etc. * **Discrete variables**: counts of e.g. number of household members, hits on a webpage, cars passing on a road in one hour, etc. * Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**. * It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement. ### Categorical/qualitative variables * The measurement is one of a set of given categories, e.g. sex (male/female), education level, satisfaction score (low/medium/high), etc. * Factors have two so-called scales: + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color. + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. education level and satisfaction score. * The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal. ---- ## Variables in the data set ```{r} head(Chile) ``` * Quantitative variables in the `Chile` data set: * `population`, `age`, `income`, `statusquo` * Categorical variables: * `region`, `sex`, `education`, `vote` * All the categorical variables are nominal except `education`, which has three ordered categories (primary, secondary, post-secondary). # Descriptive statistics of categorical data ## Tables * To summarize the the variable `vote` we can use the function `tally` from the `mosaic` package (remember the package **must be loaded** via `library(mosaic)` if you did not do so yet): ```{r} tally( ~ vote, data = Chile) ``` * In percent: ```{r} tally( ~ vote, data = Chile, format = "percent") ``` * Here we use an **R** `formula` (characterized by the "tilde" sign `~`) to indicate that we want this variable from the dataset `Chile` (without the tilde it would look for a global variable called `vote` and use that rather than the one in the dataset). ---- ## 2 factors: Cross tabulation * To get an overview over the relation between two categorical variables, we can make a cross tabulation. * To make a table of all combinations of the two factors `vote` and `sex`, we use `tally` again: ```{r size="small"} tally( ~ vote + sex, data = Chile) ``` * We can also get the relative frequencies (in percent) columnwise: ```{r size="small"} tally( ~ vote | sex, data = Chile, format = "percent") ``` * For instance we see that $34.8\%$ of the women said they would vote yes, while this holds for only $29.4\%$ of the men. ---- ## Visualizing categorical data: Bar graph * To create a bar graph plot of table data we use the function `gf_bar` from `mosaic`. For each level of the factor, a box is drawn with the height proportional to the frequency (count) of the level. ```{r bargraph} gf_bar( ~ vote, data = Chile) ``` * The bar graph can also be split by group: ```{r bargraph_grouped} gf_bar( ~ vote | sex, data = Chile) ``` # Descriptive statistics of quantitative variables ---- ## Data example: Fuel consumption of cars * In this data set, a car magazine tested the fuel consumption of 32 cars. The variable `mpg` gives the fuel consumption in miles pr. gallon (the data set is from 1974). * The data set is built into **R** under the name `mtcars`, so it does not need to be loaded before use. ```{r} head(mtcars) ``` ## Visualizing quantitative data: Histogram * The way to get a first impression of a quantitative variable is to draw a histogram. * The histogram of a variable `x` is made as follows: * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals. * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval. * Histogram of `mpg` for the `mtcars` data. The `bins` option sets the number of subintervals to 10. ```{r hist} gf_histogram( ~ mpg, data = mtcars,bins=10) ``` ---- ## Relation between histogram and denity function * Suppose a sample comes from a population having a continuous distribution with density function $f$. * Draw a histogram where the $y$-axis is scaled such that the total area of the bars is 1. * When the number of observations (the sample size) increases we can make a finer interval division and get a more smooth histogram. * When the number of observations tends to infinity, we obtain a nice smooth curve, where the area below the curve is $1$. This curve is exactly the probability density function $f$. ```{r histToPop,echo=FALSE,results='hide',fig.width=10,fig.height=4} par(mfrow=c(1,3),cex.main = 2,cex.lab = 2,mar=c(5,5,4,1)) set.seed(100) varValue <- rnorm(50,10,2) hist(varValue,breaks="FD",ylab="Density",xlab = "y",ylim=c(0,.25),freq=F,main="Histogram of 50 obs.") text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5) text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5) varValue <- rnorm(1000,10,2) hist(varValue,breaks="FD",freq=F,ylab="Density",xlab = "y",ylim=c(0,.25),main="Histogram of 500 obs.") text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5) text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5) varValue <- (20:180)/10 plot(varValue,dnorm(varValue,10,2),ylab="Density",xlab = "y",ylim=c(0,.25),type="l",main="Histogram of population") text(7,.22,bquote(mu == 10),cex=1.5) text(14,.22,bquote(sigma == 2),cex=1.5) ``` * If the histogram looks bell-shaped this may suggest a normal distribution. ---- ## Summary statistics for quantitative data * We return to the `mtcars` example. A summary of the fuel consumption `mpg` can be retrieved using the `favstats` function: ```{r} favstats( ~ mpg, data = mtcars) ``` * The output contains the following information * **min** The minimal value in the sample is $10.4$. * **max** The maximal value in the sample is $33.9$. * **n** The sample size (number of observations) is 32. * **mean** The sample mean is $20.1$. Recall that this was the average of all observations $x_1,\ldots,x_n$, i.e. $$ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i $$ * **sd** The sample standard deviation is $6.03$. Recall that this was given by $$ s=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. $$ * **missing** There are no missing values. * **median** The median (or 50-percentile) is the value such that half of the sample has lower values than the median and half the sample has larger values. * **Q1** and **Q3** will be introduced on later slides. * Both the mean and the median can be considered the center of a distribution. In a symmetric distribution (such as the normal distribution) they are equal, while in a skewed distribution, they tend to be different. ---- ## Calculation of mean, median and standard deviation using **R** * The mean, median and standard deviation are just some of the summaries that can be read of the `favstats` output (shown on previous page). They may also be calculated separately in the following way: * Sample size of `mpg`: ```{r} length(mtcars$mpg) ``` * Mean of `mpg`: ```{r} mean( ~ mpg, data = mtcars) ``` * Median of `mpg`: ```{r} median( ~ mpg, data = mtcars) ``` * Standard deviation for `mpg`: ```{r} sd( ~ mpg, data = mtcars) ``` * We may also calculate the summaries within groups. For instance, for each engine type (variable `vs`) the sample mean is: ```{r} mean( ~ mpg | factor(vs), data = mtcars) ``` ---- ## Interpretation of summary statistics: The empirical rule ```{r empRule,echo=FALSE,fig.width=12,fig.height=6} set.seed(5) x <- rnorm(900) x <- (x-mean(x))/sd(x) hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE) axis(1,at=-3:3,labels=F,pos=0) axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0) axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0) axis(1,at=-1,labels=substitute(bar(y)-s),pos=0) axis(1,at=0,labels=substitute(bar(y)),pos=0) axis(1,at=1,labels=substitute(bar(y)+s),pos=0) axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0) axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0) arrows(-1,-.1,1,-.1,col="red",code=3,length=.1) text(-.01,-.115,"About 68% of measurements",col="red") arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1) text(-.01,-.1655,"About 95% of measurements",col="blue") arrows(-3,-.2,3,-.2,code=3,length=.1) text(-.01,-.215,"All or nearly all measurements") ``` * If the histogram of the sample looks like a bell shaped curve, then we have * about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$. * about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$. * All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$. ---- ## Percentiles * **The $p$th percentile** is a value such that about $p$% of the population (or sample) lies below or at this value and about $(100-p)$% of the population (or sample) lies above it. ### Percentile calculation for a sample: ```{r echo=FALSE} xx<- sort(mtcars$mpg) nn<-length(xx) ``` * First, sort data from smallest to largest. For the `mpg` variable: $$ x_{(1)}=`r xx[1]`, x_{(2)}=`r xx[2]`, x_{(3)}=`r xx[3]`, \ldots, x_{(n)} = `r xx[nn]`. $$ Here the number of observations is $n=`r nn`$. ```{r echo=FALSE} p <- 10 N <- nn*p/100 ``` * Find the $10$th percentile (i. e.\ $p = `r p`$):\ * The observation number corresponding to the 10-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. * So the 10-percentile lies between the observations with observation number $k=`r floor(N)`$ and $k+1=`r floor(N)+1`$. That is, its value lies somewhere in the interval between $x_{(3)}=13.3$ and $x_{(4)}=14.3$. * One of several methods for estimating the 10-percentile from the value of N is defined as: $$ x_{(k)} + (N - k)(x_{(k+1)} - x_{(k)}) $$ which in this case gives $$ x_{(3)} + (3.2 - 3)(x_{(4)} - x_{(3)}) = 13.3 + 0.2 \cdot (14.3-13.3) = 13.5. $$ ## Median, quartiles and interquartile range Recall ```{r} favstats( ~ mpg, data = mtcars) ``` * 0-percentile = 10.4 is the **minimum** value. * 50-percentile = 20.1 is the **median** and it is a measure of the center of data. * 25-percentile = 15.4 is called the **lower quartile** (Q1).\ Median of lower 50% of data. * 75-percentile = 22.8 is called the **upper quartile** (Q3).\ Median of upper 50% of data. * 100-percentile = 33.9 is the **maximum** value. * **Interquartile Range (IQR)**:\ a measure of variability given by the difference of the upper and lower quartiles: 23 -15 = 8. ## Box-and-whiskers plots (or simply box plots) How to draw a box-and-whiskers plot: * Box: * Calculate the median, lower and upper quartiles. * Plot a line by the median and draw a box between the upper and lower quartiles. * Whiskers: * Calculate interquartile range and call it IQR. * Calculate the following values: * L = lower quartile - 1.5*IQR * U = upper quartile + 1.5*IQR * Draw a line from lower quartile to the smallest measurement, which is larger than *L*. * Similarly, draw a line from upper quartile to the largest measurement which is smaller than *U*. * Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles. * Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers. ```{r echo = FALSE} set.seed(42) dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "whiskers")), y = c(7, runif(20,12,19), 22)) b <- boxplot(y ~ plottype, data = dat, ylim = c(6, 25), xlab = "", ylab = "", pch = 25) points(rep(1, length(dat$y)), dat$y) Q1 <- b$stats[2,2] Q3 <- b$stats[4,2] IQR <- Q3-Q1 arrows(3, Q1, 3, Q1-1.5*IQR) text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1) arrows(3, Q3, 3, Q3+1.5*IQR) text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1) abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2) ``` ---- ## Boxplot for fuel consumption * Boxplot of the fuel consumption separately for each engine type: ```{r boxplot} favstats(mpg ~ vs, data = mtcars) gf_boxplot(mpg ~ factor(vs), data = mtcars) ``` * Cars with engine type 1 seem to use more fuel. * A single car with engine type 0 differs noticeably from the others with a high fuel consumption. ---- ## 2 quantitative variables: Scatter plot * A **scatter plot** is used to visualize two quantitative variables. * For instance, we can plot the relation between fuel consumption and horse powers (`hp`) of a car as follows ```{r scatter0} gf_point(mpg ~ hp, data = mtcars) ``` * This can be either split or coloured according to the engine type`vs`: ```{r} gf_point(mpg ~ hp | factor(vs), data = mtcars) gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) ``` * If we want a regression line along with the points we can do: ```{r scatter05} gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) %>% gf_lm() ``` # Quantile plots ---- ## The empirical quantiles * Recall that the distribution function of a random variable $X$ was defined as: $$F(x)=P(X\leq x).$$ * The $\frac{i}{n}$ quantiles of a distribution are the points $q_i$ such that $F(q_i)=\frac{i}{n}$, $i=1,\ldots,n$. * If we rank the observations in a sample $$x_{(1)}\leq x_{(2)}\leq \ldots \leq x_{(n)},$$ we can approximate $F$ at $x_{(i)}$ by: $$\hat{F}(x_{(i)}) = \frac{i}{n}.$$ * Interpretation: $x_{(i)}$ is approximately the $\frac{i}{n}$ quantile. * Note: various authors may use slightly different quantiles, e.g. $\frac{i-0.5}{n}$ quantiles. ---- ## Normal quantile-quantile plots * The quantiles may be used to investigate whether the sample comes from a normal distribution. * Call the $\frac{i}{n}$th quantile of a standard normal distribution $q_i$, i.e.\ $P(Z\leq q_i)= \frac{i}{n}$. * If $Y\sim \texttt{norm}(\mu,\sigma)$, then this is equivalent to is equivalent to $$P(Y\leq \mu+\sigma q_i)= \frac{i}{n}.$$ * Suppose the population follows a $\texttt{norm}(\mu,\sigma)$ distribution, then the sample quantiles $x_{(i)}$ should be approximately equal to the population quantiles $\mu+\sigma q_i$. * If we make a scatter plot of the pair $(q_i,x_{(i)})$, these should lie on a straight line. We call this a **normal Q-Q plot** (or quantile-quantile plot). * **Example:** We investigate whether the `mpg` variable in the `mtcars` data set follows a normal distribution: ```{r} qqnorm(mtcars$mpg) qqline(mtcars$mpg) ```