---
title: "Intro and descriptive statistics"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
knitr::opts_chunk$set(warning = FALSE)
```

# Software

##  **Rstudio** 

* Make a folder on your computer where you want to keep files to use in  **Rstudio**. 
**Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder).
* Set the working directory to this folder:
`Session -> Set Working Directory -> Choose Directory`
(shortcut: Ctrl+Shift+H).
* Make the change permanent by setting the default directory in:
`Tools -> Global Options -> Choose Directory`.

## **R** extensions

* The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). 
Some are installed by default in **R** and you just need to load them.
* To install a new package in  **Rstudio**  use the menu:
`Tools -> Install Packages`
* You need to know the name of the package you want to install. You can also do it through a command:
```{r eval=FALSE}
install.packages("mosaic")
```
* When it is installed you can load it through the `library` command:
```{r results="hide", message=FALSE}
library(mosaic)
```
* This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). 
It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that.

## **R** help

* You get help via `?<command>`:
```{r eval=FALSE}
?sum
```
* Use `tab` to make  **Rstudio**  guess what you have started typing.
* Search for help:
```{r eval=FALSE}
help.search("plot")
```

* You can find a cheat sheet with the **R** functions we use for this course 
[here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf).
```{r, echo = FALSE, eval = FALSE}
# OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... 
# [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf).
```

# Data 

## Data example

We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins)
```{r}
pingviner <- palmerpenguins::penguins
pingviner
```

* What is fundamentally different about the the variables (columns) `species` and `body_mass_g`?

## Data types

### Quantitative variables

* The measurements have numerical values.
* Quantative data often comes about in one of the following ways:
    * **Continuous variables**: measurements of time, length, size, age, mass, etc.
    * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc.
* Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**.
* It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement.

* Are any of the measurements in our data set quantitative?

----

### Categorical/qualitative variables

* The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc.
* The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female.
* Factors have two so-called scales:
    + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color.
    + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal.

* Are any of the measurements in our data set categorical/qualitative?

# Graphics for quantitative variables

## Scatterplot

- To study the relation between two quantitative variables a scatterplot is used:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)
```

----

- We could also draw the graph for each species:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)
```

----

- If we want a regression line along with the points we can do:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% 
  gf_lm()
```

----

## Histogram

- For a single quantitative variable a histogram offers more details:
```{r}
gf_histogram( ~ bill_length_mm, data = pingviner)
```

- How to make a histogram for some variable `x`:
    * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals.
    * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval.

```{r include=FALSE}
# Ignore this code. It's just used to write correct numbers later in the text.
y <- pingviner %>% filter(species == "Gentoo") %>% pull(bill_length_mm) %>% na.omit()
nn <- length(y)
xx <- sort(y)
```

# Summaries of quantitative variables

## Percentiles

* **The $p$th percentile** is a value such that at least $p$% of the sample lies below or at this value and at least $(100-p)$% of the sample lies above or at the value.

```{r}
Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
```

* 50-percentile is the **median** and it is a measure of the center of data as the number of data points below the median is the samme as the number above the median.
* 0-percentile is the **minimum** value.
* 25-percentile is called the **lower quartile** (Q1).\ Median of lower 50% of data.
* 75-percentile is called the **upper quartile** (Q3).\ Median of upper 50% of data.
* 100-percentil is the **maximum** value.
* **Interquartile Range (IQR)**: a measure of variability
  given by the difference of the upper and lower quartiles.
  
## Boxplot

Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):

```{r bill length vs depth}
gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)
```

----

### How to draw a box plot

* Box:
    * Calculate the median, lower and upper quartiles.
    * Plot a line by the median and draw a box between the upper and
      lower quartiles.
* Whiskers: 
    * Calculate interquartile range and call it IQR.
    * Calculate the following values:
        * L = lower quartile - 1.5*IQR
        * U = upper quartile + 1.5*IQR
    * Draw a line from lower quartile to the smallest measurement,
    which is larger than *L*. 
    * Similarly, draw a line from upper quartile to the largest measurement 
    which is smaller than *U*.
* Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles.

*Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.*

```{r echo = FALSE}
xx <- pingviner %>% filter(species == "Gentoo") %>% arrange(bill_length_mm) %>% pull(bill_length_mm) %>% na.omit()
nn <- length(xx)
dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "limits")),
                  y = xx)
b <- boxplot(y ~ plottype, data = dat, ylim = c(34, 60), xlab = "", ylab = "", pch = 25, main = "Gentoo bill length")
points(rep(1, length(dat$y)), dat$y)
Q1 <- Q$`25%`[3]
Q3 <- Q$`75%`[3]
IQR <- Q3-Q1
arrows(3, Q1, 3, Q1-1.5*IQR)
text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1)
arrows(3, Q3, 3, Q3+1.5*IQR)
text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1)
abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2)
```



## Measures of center of data: Mean and median

* A number of numerical summaries can be retrieved using the `favstats` command:
```{r}
favstats(bill_length_mm ~ species, data = pingviner)
```
* The observed values of `bill_length_mm` are $y_1=`r y[1]`$, $y_2=`r y[2]`,\ldots,y_n=`r y[nn]`$, where there are a total of $n=`r nn`$ values.
<!---
(omitting any `NA`s, ie. missing values(Not Available - for some reason)).
?left out?
-->


As previously defined this constitutes a **sample**.

* **mean** = `r round(mean(y))` is the **average** of the sample, which is calculated by
    $$
      \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i.
    $$
    We may also call $\bar{y}$ the **(empirical) mean** or the **sample mean**.
    It is calculated using `mean()` in **R**.
* **median** = `r round(median(y))` is calculated using `median()` in **R**.
* An important property of the **mean** and the **median** is that they have the same unit as the observations (e.g. millimeter).


## Measures of variability of data: range, standard deviation and variance

* The **range** is the difference of the largest and smallest observation (`range()` in **R**).
* The **(empirical) variance** (`var()` in **R**) is the average of the squared deviations from the mean:
      $$
        s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2.
      $$
* **sd** $=$ **standard deviation** $= s=\sqrt{s^2}$ (`sd()` in **R**).
* Note:\ If the observations are measured in mm,\ the **variance** has unit $\text{mm}^2$ which is hard to interpret. The **standard deviation** on the other hand has the same unit as the observations.
* The standard deviation describes how much data varies around the (empirical) mean.

----

### The empirical rule

```{r empRule,echo=FALSE,fig.width=12,fig.height=6}
set.seed(5)
x <- rnorm(900)
x <- (x-mean(x))/sd(x)
hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE)
axis(1,at=-3:3,labels=F,pos=0)
axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0)
axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0)
axis(1,at=-1,labels=substitute(bar(y)-s),pos=0)
axis(1,at=0,labels=substitute(bar(y)),pos=0)
axis(1,at=1,labels=substitute(bar(y)+s),pos=0)
axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0)
axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0)
arrows(-1,-.1,1,-.1,col="red",code=3,length=.1)
text(-.01,-.115,"About 68% of measurements",col="red")
arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1)
text(-.01,-.1655,"About 95% of measurements",col="blue")
arrows(-3,-.2,3,-.2,code=3,length=.1)
text(-.01,-.215,"All or nearly all measurements")
```

If the histogram of the sample looks like a bell shaped curve, then

* about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$.
* about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$.
* All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$.

  

