---
title: "Intro and descriptive statistics"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "palmerpenguins"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
knitr::opts_chunk$set(warning = FALSE)
```

# Software

##  **Rstudio** 

* Make a folder on your computer where you want to keep files to use in  **Rstudio**. 
**Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder).
* Set the working directory to this folder:
`Session -> Set Working Directory -> Choose Directory`
(shortcut: Ctrl+Shift+H).
* Make the change permanent by setting the default directory in:
`Tools -> Global Options -> Choose Directory`.

## **R** extensions

* The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). 
Some are installed by default in **R** and you just need to load them.
* To install a new package in  **Rstudio**  use the menu:
`Tools -> Install Packages`
* You need to know the name of the package you want to install. You can also do it through a command:
```{r eval=FALSE}
install.packages("mosaic")
```
* When it is installed you can load it through the `library` command:
```{r results="hide", message=FALSE}
library(mosaic)
```
* This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). 
It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that.

## **R** help

* You get help via `?<command>`:
```{r eval=FALSE}
?sum
```
* Use `tab` to make  **Rstudio**  guess what you have started typing.
* Search for help:
```{r eval=FALSE}
help.search("plot")
```

* You can find a cheat sheet with the **R** functions we use for this course 
[here](https://asta.math.aau.dk/course/asta/2021-2/?file=cheatsheet.pdf).
```{r, echo = FALSE, eval = FALSE}
# OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... 
# [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf).
```

# Data 

## Data example

We use data about pengiuns from the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins)
```{r}
pingviner <- palmerpenguins::penguins
pingviner
```

* What is fundamentally different about the the variables (columns) `species` and `body_mass_g`?

## Data types

### Quantitative variables

* The measurements have numerical values.
* Quantative data often comes about in one of the following ways:
    * **Continuous variables**: measurements of time, length, size, age, mass, etc.
    * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc.
* Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**.
* It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement.

* Are any of the measurements in our data set quantitative?

----

### Categorical/qualitative variables

* The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc.
* The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female.
* Factors have two so-called scales:
    + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color.
    + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal.

* Are any of the measurements in our data set categorical/qualitative?

# Graphics for quantitative variables

## Scatterplot

- To study the relation between two quantitative variables a scatterplot is used:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)
```

----

- We could also draw the graph for each species:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)
```

----

- If we want a regression line along with the points we can do:
```{r}
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>% 
  gf_lm()
```

----

- A scatterplot is not very good for a single quantitative variable:
```{r}
gf_point(1 ~ bill_length_mm, color = ~ species, data = pingviner)
```

```{r}
gf_point(species ~ bill_length_mm, color = ~ species, data = pingviner)
```

## Histogram

- For a single quantitative variable a histogram offers more details:
```{r}
gf_histogram( ~ bill_length_mm, data = pingviner)
```

- How to make a histogram for some variable `x`:
    * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals.
    * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval.

----

- Not great for comparing groups:
```{r}
gf_histogram( ~ bill_length_mm, color = ~species, data = pingviner)
```

## Boxplot

Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):

```{r bill length vs depth}
gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)
```

To understand the details of the boxplot we need to introduce **percentiles**/quantiles and in particular quartiles which can be seen here:
```{r}
Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
```
* 50-percentile is the **median** and it is a measure of the center of data.
* 0-percentile is the **minimum** value.
* 25-percentile is called the **lower quartile** (Q1).\ Median of lower 50% of data.
* 75-percentile is called the **upper quartile** (Q3).\ Median of upper 50% of data.
* 100-percentil is the **maximum** value.
* **Interquartile Range (IQR)**: a measure of variability
  given by the difference of the upper and lower quartiles:

* Details about how to find/calculate percentiles are postponed to later.

----

### How to draw a box plot

* Box:
    * Calculate the median, lower and upper quartiles.
    * Plot a line by the median and draw a box between the upper and
      lower quartiles.
* Whiskers: 
    * Calculate interquartile range and call it IQR.
    * Calculate the following values:
        * L = lower quartile - 1.5*IQR
        * U = upper quartile + 1.5*IQR
    * Draw a line from lower quartile to the smallest measurement,
    which is larger than *L*. 
    * Similarly, draw a line from upper quartile to the largest measurement 
    which is smaller than *U*.
* Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles.

*Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.*

```{r echo = FALSE}
xx <- pingviner %>% filter(species == "Gentoo") %>% arrange(bill_length_mm) %>% pull(bill_length_mm) %>% na.omit()
nn <- length(xx)
dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "limits")),
                  y = xx)
b <- boxplot(y ~ plottype, data = dat, ylim = c(34, 60), xlab = "", ylab = "", pch = 25, main = "Gentoo bill length")
points(rep(1, length(dat$y)), dat$y)
Q1 <- Q$`25%`[3]
Q3 <- Q$`75%`[3]
IQR <- Q3-Q1
arrows(3, Q1, 3, Q1-1.5*IQR)
text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1)
arrows(3, Q3, 3, Q3+1.5*IQR)
text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1)
abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2)
```

# Data wrangling

## Selecting columns/variables

- To subset columns of data use `select()` (automatically loaded from `dplyr` package by `mosaic`):
```{r subsetting data}
# Assigning the data subset to an object
bill_data <- select(pingviner, bill_length_mm, bill_depth_mm,
                    species, sex, island, year)

# This particular subset can be written shorter by (overwriting the object):
bill_data <- select(pingviner, -flipper_length_mm, -body_mass_g)
```

- Special role of first argument gives rise to this "pipe" (`%>%`) syntax:
```{r}
bill_data <- pingviner %>% select(-flipper_length_mm, -body_mass_g)
```

- We read this as: first take the dataset `pingviner` **and then** select all columns except `flipper_length_mm` and `body_mass_g`.

- The resulting dataset doesn't have flipper length and body mass:
```{r}
names(bill_data)
```

## Filtering rows/cases

- We use `filter()` to subset rows/cases. E.g. all penguins from Biscoe islands:
```{r}
pingviner %>% filter(island == "Biscoe")
```

----

- All male Gentoo penguins with over 220 mm flippers:
```{r}
pingviner %>% filter(sex == "male") %>% 
  filter(species == "Gentoo") %>% 
  filter(flipper_length_mm>220)
```

- This could also have been done with a single `filter()` command (output not shown):
```{r results='hide'}
pingviner %>% filter(sex == "male" & species == "Gentoo" & flipper_length_mm>220)
```

----

- All penguins of species Gentoo or Adelie:
```{r }
pingviner %>% filter(species == "Gentoo" | species == "Adelie")
```

- This would be the same as penguins which are not Chinstrap (output not shown):
```{r results='hide'}
pingviner %>% filter(species != "Chinstrap")
```

## Arranging rows/cases

- We use `arrange()` to arrange the order of the rows/cases:
```{r}
pingviner %>% filter(sex == "female") %>% arrange(body_mass_g)
```

- Use `arrange(desc())` for descending values:
```{r}
pingviner %>% filter(sex == "female") %>% arrange(desc(body_mass_g))
```
