---
title: "Introduction to R and descriptive statistics"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Introduction to R

----

##  **Rstudio** 

* Make a folder on your computer where you want to keep files to use in  **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder).
* Set the working directory to this folder:
`Session -> Set Working Directory -> Choose Directory`
(shortcut: Ctrl+Shift+H).
* Make the change permanent by setting the default directory in:
`Tools -> Global Options -> Choose Directory`.

----

## **R** basics

* Ordinary calculations:
```{r}
4.6 * (2 + 3)^4 
```
* Make a (scalar) object and print it:
```{r}
a <- 4 
a
```
* Make a (vector) object and print it:
```{r}
b <- c(2, 5, 7)
b
```
* Make a sequence of numbers and print it:
```{r}
s <- 1:4
s
```
* Note: A more flexible command for sequences:
```{r}
s <- seq(1, 4, by = 1)
```

* **R** does elementwise calculations:
```{r}
a * b
a + b
b ^ 2
```
* Sum and product of elements:
```{r}
sum(b)
prod(b)
```

----

## **R** markdown

* The slides and all exercises in R (including the exam questions) are made in the special Rmarkdown format.

* This allows you to combine text and R code.

* You can write formulas using standard LaTeX commands. 

----

## **R** extensions

* The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them.
* To install a new package in  **Rstudio**  use the menu:
`Tools -> Install Packages`
* You need to know the name of the package you want to install. You can also do it through a command:
```{r eval=FALSE}
install.packages("mosaic")
```
* When it is installed you can load it through the `library` command:
```{r results="hide", message=FALSE}
library(mosaic)
```
* This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that.

----

## **R** help

* You get help via `?<command>`:
```{r eval=FALSE}
?sum
```
* Use `tab` to make  **Rstudio**  guess what you have started typing.
* Search for help:
```{r eval=FALSE}
help.search("plot")
```

* You can find a cheat sheet with the **R** functions we use for this course 
[here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf).
* Save your commands in a file for later usage:
    + Select history tab in top right pane in  **Rstudio** .
    + Mark the commands you want to save.
    + Press `To Source` button.

```{r, echo = FALSE, eval = FALSE}
# OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... 
# [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf).
```

# Data in **R**

----

## Data example

* Now we will have a look at a data set concerning the 1988 vote in Chile for or against Pinochet to continue as leader. The sample consists of 2700 voters randomly selected from the Chilenean population.
  
* The data set contains the variables:
  * `region`: The region in Chile where the voter lives
  * `population`: Population of the region.
  * `sex`: The gender of the voter.
  * `age`: The age of the voter.
  * `education`: Education level of the voter.
  * `income`: Monthly income of the voter.
  * `statusquo`: To which degree the voter supports the status quo.
  * `vote`: Should Pinochet continue? `Y` = yes, `N`= no, `U`=undecided, `A`= will abstain from voting.
  
* More information about the data set may be found [here](https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Chile).
```{r}
Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
head(Chile)
```


----

## Data types

### Quantitative variables

* The measurements have numerical values.
* Quantative data often comes about in one of the following ways:
    * **Continuous variables**: measurements of e.g. speed, temperature, etc.
    * **Discrete variables**: counts of e.g. number of household members, hits on a webpage, cars passing on a road in one hour, etc.
* Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**.
* It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement.


### Categorical/qualitative variables

* The measurement is one of a set of given categories, e.g. sex (male/female), education level, satisfaction score (low/medium/high), etc.
* Factors have two so-called scales:
    + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color.
    + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. education level and satisfaction score. 
* The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal.

----

## Variables in the data set

```{r}
head(Chile)
```

* Quantitative variables in the `Chile` data set: 

  * `population`, `age`, `income`, `statusquo`

* Categorical variables:

  * `region`, `sex`, `education`, `vote`
  
* All the categorical variables are nominal except `education`, which has three ordered categories (primary, secondary, post-secondary).


# Descriptive statistics of categorical data

## Tables

* To summarize the the variable `vote` we can use the function `tally` from the `mosaic` package (remember the package **must be loaded** via `library(mosaic)` if you did not do so yet):
```{r}
tally( ~ vote, data = Chile)
```
* In percent:
```{r}
tally( ~ vote, data = Chile, format = "percent")
```
* Here we use an **R** `formula` (characterized by the "tilde" sign `~`) to indicate that we want this variable from the dataset `Chile` (without the tilde it would look for a global variable called `vote` and use that rather than the one in the dataset).

----

## 2 factors: Cross tabulation

* To get an overview over the relation between two categorical variables, we can make a cross tabulation.

* To make a table of all combinations of the two factors `vote` and `sex`, we use `tally` again:
```{r size="small"}
tally( ~ vote + sex, data = Chile)
```
* We can also get the relative frequencies (in percent) columnwise:
```{r size="small"}
tally( ~ vote | sex, data = Chile, format = "percent")
```
* For instance we see that $34.8\%$ of the women said they would vote yes, while this holds for only $29.4\%$ of the men. 

----

## Visualizing categorical data: Bar graph

* To create a bar graph plot of table data we use the function `gf_bar` from `mosaic`.
  For each level of the factor, a box is drawn with the height proportional to the frequency (count) of the level.
```{r bargraph}
gf_bar( ~ vote, data = Chile)
```
<!-- Another color theme can be chosen, e.g.as one provided by the `mosaic` package: -->

<!-- ```{r} -->
<!-- trellis.par.set(theme = theme.mosaic()) -->
<!-- ``` -->
* The bar graph can also be split by group:
```{r bargraph_grouped}
gf_bar( ~ vote | sex, data = Chile)
```




# Descriptive statistics of quantitative variables

----

## Data example: Fuel consumption of cars

*  In this data set, a car magazine tested the fuel consumption of 32 cars. The variable `mpg` gives the fuel consumption in miles pr. gallon (the data set is from 1974).

* The data set is built into **R** under the name `mtcars`, so it does not need to be loaded before use.

```{r}
head(mtcars)
```


## Visualizing quantitative data: Histogram 

* The way to get a first impression of a quantitative variable is to draw a histogram.

* The histogram of a variable `x` is made as follows:
    * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals.
    * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval.
* Histogram of `mpg` for the `mtcars` data. The `bins` option sets the number of subintervals to 10.
```{r hist}
gf_histogram( ~ mpg, data = mtcars,bins=10)
```

----

## Relation between histogram and denity function  

* Suppose a sample comes from a population having a continuous distribution with density function $f$.

* Draw a histogram where the $y$-axis is scaled such that the total area of the bars is 1. 
* When the number of observations (the sample size) increases we can make a finer interval division and get a more smooth histogram.
* When the number of observations tends to infinity, we obtain a nice smooth curve, where the area below the curve is $1$. This curve is exactly the probability density function $f$.

```{r histToPop,echo=FALSE,results='hide',fig.width=10,fig.height=4}
par(mfrow=c(1,3),cex.main = 2,cex.lab = 2,mar=c(5,5,4,1))
set.seed(100)
varValue <- rnorm(50,10,2)
hist(varValue,breaks="FD",ylab="Density",xlab = "y",ylim=c(0,.25),freq=F,main="Histogram of 50 obs.")
text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5)
text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5)
varValue <- rnorm(1000,10,2)
hist(varValue,breaks="FD",freq=F,ylab="Density",xlab = "y",ylim=c(0,.25),main="Histogram of 500 obs.")
text(7,.22,bquote(bar(y) == .(round(mean(varValue),2))),cex=1.5)
text(14,.22,bquote(s == .(round(sd(varValue),2))),cex=1.5)
varValue <- (20:180)/10
plot(varValue,dnorm(varValue,10,2),ylab="Density",xlab = "y",ylim=c(0,.25),type="l",main="Histogram of population")
text(7,.22,bquote(mu == 10),cex=1.5)
text(14,.22,bquote(sigma == 2),cex=1.5)
```

* If the histogram looks bell-shaped this may suggest a normal distribution.

----

## Summary statistics for quantitative data

* We return to the `mtcars` example. A summary of the fuel consumption `mpg` can be retrieved using the `favstats`
function:
```{r}
favstats( ~ mpg, data = mtcars)
```
* The output contains the following information
  * **min** The minimal value in the sample is $10.4$. 
  * **max** The maximal value in the sample is $33.9$.
  * **n** The sample size (number of observations) is 32.
  * **mean** The sample mean is $20.1$. Recall that this was the  average of all observations $x_1,\ldots,x_n$, i.e.     $$
      \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i
    $$
  * **sd** The sample standard deviation  is $6.03$. Recall that this was given by
 $$
      s=\frac{1}{n-1}\sum_{i=1}^n  (x_i-\bar{x})^2.
    $$

  * **missing** There are no missing values.

  * **median** The median (or 50-percentile) is the value such that half of the sample has lower values than the median and half the sample has larger values. 
 
  * **Q1** and **Q3** will be introduced on later slides.    

* Both the mean and the median can be considered the center of a distribution. In a symmetric distribution (such as the normal distribution) they are equal, while in a skewed distribution, they tend to be different. 

----

## Calculation of mean, median and standard deviation using **R**
* The mean, median and standard deviation are just some of the summaries that 
can be read of the `favstats` output (shown on previous page). They may also
be calculated separately in the following way:

* Sample size of `mpg`:
```{r}
length(mtcars$mpg) 
```

* Mean of `mpg`:
```{r}
mean( ~ mpg, data = mtcars)
```
* Median of `mpg`:
```{r}
median( ~ mpg, data = mtcars)
```
* Standard deviation for `mpg`:
```{r}
sd( ~ mpg, data = mtcars)
```

*  We may also calculate the summaries within groups. For instance, for each engine type (variable `vs`) the sample mean is:
```{r}
mean( ~ mpg | factor(vs), data = mtcars)
```

----

## Interpretation of summary statistics: The empirical rule

```{r empRule,echo=FALSE,fig.width=12,fig.height=6}
set.seed(5)
x <- rnorm(900)
x <- (x-mean(x))/sd(x)
hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE)
axis(1,at=-3:3,labels=F,pos=0)
axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0)
axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0)
axis(1,at=-1,labels=substitute(bar(y)-s),pos=0)
axis(1,at=0,labels=substitute(bar(y)),pos=0)
axis(1,at=1,labels=substitute(bar(y)+s),pos=0)
axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0)
axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0)
arrows(-1,-.1,1,-.1,col="red",code=3,length=.1)
text(-.01,-.115,"About 68% of measurements",col="red")
arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1)
text(-.01,-.1655,"About 95% of measurements",col="blue")
arrows(-3,-.2,3,-.2,code=3,length=.1)
text(-.01,-.215,"All or nearly all measurements")
```

* If the histogram of the sample looks like a bell shaped curve, then we have

  * about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$.
  * about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$.
  * All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$.

----

## Percentiles

* **The $p$th percentile** is a value such that about $p$% of the population (or sample) lies below or at this value and about $(100-p)$% of the population (or sample) lies above it.

### Percentile calculation for a sample:
```{r echo=FALSE}
xx<- sort(mtcars$mpg)
nn<-length(xx)
```
* First, sort data from smallest to largest. For the `mpg` variable:
    $$
      x_{(1)}=`r xx[1]`, x_{(2)}=`r xx[2]`, x_{(3)}=`r xx[3]`, \ldots, x_{(n)} = `r xx[nn]`.
    $$
Here the number of observations is $n=`r nn`$.
```{r echo=FALSE}
p <- 10
N <- nn*p/100
```
* Find the $10$th percentile (i. e.\ $p = `r p`$):\ 
    * The observation number corresponding to the 10-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. 
    * So the 10-percentile lies between the observations with observation number $k=`r floor(N)`$ and $k+1=`r floor(N)+1`$. That is, its value lies somewhere in the interval between $x_{(3)}=13.3$ and $x_{(4)}=14.3$.
    * One of several methods for estimating the 10-percentile from the value of N is defined as:
    $$
      x_{(k)} + (N - k)(x_{(k+1)} - x_{(k)}) 
    $$
which in this case gives
$$
    x_{(3)} + (3.2 - 3)(x_{(4)} - x_{(3)}) = 13.3 + 0.2 \cdot (14.3-13.3) = 13.5.
$$    
    
## Median, quartiles and interquartile range

Recall
```{r}
favstats( ~ mpg, data = mtcars)
```

* 0-percentile = 10.4 is the **minimum** value.
* 50-percentile = 20.1 is the **median** and it is a measure of the center of data.
* 25-percentile = 15.4 is called the **lower quartile** (Q1).\ Median of lower 50% of data.
* 75-percentile = 22.8 is called the **upper quartile** (Q3).\ Median of upper 50% of data.
* 100-percentile = 33.9 is the **maximum** value.
* **Interquartile Range (IQR)**:\ a measure of variability
  given by the difference of the upper and lower quartiles: 23 -15 = 8.



## Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

* Box:
    * Calculate the median, lower and upper quartiles.
    * Plot a line by the median and draw a box between the upper and
      lower quartiles.
* Whiskers: 
    * Calculate interquartile range and call it IQR.
    * Calculate the following values:
        * L = lower quartile - 1.5*IQR
        * U = upper quartile + 1.5*IQR
    * Draw a line from lower quartile to the smallest measurement,
    which is larger than *L*. 
    * Similarly, draw a line from upper quartile to the largest measurement 
    which is smaller than *U*.
* Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles.

* Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

```{r echo = FALSE}
set.seed(42)
dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "whiskers")),
                  y = c(7, runif(20,12,19), 22))
b <- boxplot(y ~ plottype, data = dat, ylim = c(6, 25), xlab = "", ylab = "", pch = 25)
points(rep(1, length(dat$y)), dat$y)
Q1 <- b$stats[2,2]
Q3 <- b$stats[4,2]
IQR <- Q3-Q1
arrows(3, Q1, 3, Q1-1.5*IQR)
text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1)
arrows(3, Q3, 3, Q3+1.5*IQR)
text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1)
abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2)
```


----

## Boxplot for fuel consumption 

* Boxplot of the fuel consumption separately for each engine type:

```{r boxplot}
favstats(mpg ~ vs, data = mtcars)
gf_boxplot(mpg ~ factor(vs), data = mtcars)
```

* Cars with engine type 1 seem to use more fuel.
* A single car with engine type 0 differs noticeably from the others with a high fuel consumption.

----

## 2 quantitative variables: Scatter plot

* A **scatter plot** is used to visualize two quantitative variables. 

* For instance, we can plot the relation between fuel consumption and horse powers (`hp`) of a car as follows 

```{r scatter0}
gf_point(mpg ~ hp, data = mtcars)
```

* This can be either split or coloured according to the engine type`vs`:

```{r}
gf_point(mpg ~ hp | factor(vs), data = mtcars)
gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars)
```

* If we want a regression line along with the points we can do:
```{r scatter05}
gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) %>% gf_lm()
```

# Quantile plots

----

## The empirical quantiles

* Recall that the distribution function of a random variable $X$ was defined as:
$$F(x)=P(X\leq x).$$
* The $\frac{i}{n}$ quantiles of a distribution are the points $q_i$ such that $F(q_i)=\frac{i}{n}$, $i=1,\ldots,n$.
* If we rank the observations in a sample 
$$x_{(1)}\leq x_{(2)}\leq \ldots \leq x_{(n)},$$
 we can approximate $F$ at $x_{(i)}$ by: 
$$\hat{F}(x_{(i)}) = \frac{i}{n}.$$
* Interpretation: $x_{(i)}$ is approximately the $\frac{i}{n}$ quantile.
* Note: various authors may use slightly different quantiles, e.g. $\frac{i-0.5}{n}$ quantiles.

----

## Normal quantile-quantile plots

* The quantiles may be used to investigate whether the sample comes from a normal distribution.

* Call the $\frac{i}{n}$th quantile of a standard normal distribution $q_i$, i.e.\ $P(Z\leq q_i)= \frac{i}{n}$.

* If $Y\sim \texttt{norm}(\mu,\sigma)$, then this is equivalent to is equivalent to $$P(Y\leq \mu+\sigma q_i)= \frac{i}{n}.$$

* Suppose the population follows a $\texttt{norm}(\mu,\sigma)$ distribution, then the sample quantiles $x_{(i)}$ should be approximately equal to the population quantiles $\mu+\sigma q_i$.

* If we make a scatter plot of the pair $(q_i,x_{(i)})$, these should lie on a straight line. We call this a **normal Q-Q plot** (or quantile-quantile plot).

  * **Example:** We investigate whether the `mpg` variable in the `mtcars` data set follows a normal distribution:
```{r}
qqnorm(mtcars$mpg)
qqline(mtcars$mpg)
```
  
  