---
title: "Rmarkdown example"
author: "Team ASTA"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

You need to install the `mosaic` package which we will use a lot throughout the 
course, and the `palmerpenguins` package which contains data (on penguins) to 
be analysed in this introduction. Either use the menu in RStudio or type the
command `install.packages(c("mosaic", "palmerpenguins", "tidyr"))` in the R 
console (which also installs `tidyr` which we also use).
```{r packages, message=FALSE}
# Installing packages:
# Packages only need to be installed once, so in general one would never write
# the installation code lines anywhere, but in the console. However, this is
# an educational document, so we have written two such lines below for
# pedagogical reasons, but they are written as comments such that they don't
# risk to cause an error if you try to knit this document.
# You should run the two lines below
# (just copy each line into the console and hit Enter)
# install.packages(c("mosaic", "palmerpenguins"))
# install.packages("tidyr") # Another package that we will use later

# We only need to install a package once, but we need to load it in each
# new R-session that we run:
library(palmerpenguins)
library(mosaic)

# Now that the palmerpenguins package is loaded, the penguin data is accessible
# for us through the name "penguins" even though the data set is not visible in
# the environment tab:
# head(penguin) # Spelling matters: Small mistakes can lead to errors (comment this line before knitting)
head(penguins)
```

To learn about the function `head`, or any other function that comes with a
package or is build into R, we can open the functions help page by placing the
text cursor on the function name and then press F1, or we can type `?head` into
the console and hit Enter.
The help pages provide useful information on how to use the function and which
arguments it takes.  

Since the penguin data came with a package, we can also read information about
the data by typing `?penguins` into the console (but since `penguins` isn't a
function we cannot use the F1 key for this).  

The data contains observations from `r nrow(penguins)` penguins where e.g.
the characteristics specie, sex, flipper length, body mass, bill length and
bill depth (https://github.com/allisonhorst/palmerpenguins#bill-dimensions) have
been observed/measured and are found in the correspondingly named columns.
We will often refer to each column in a data frame as a variable since we think
of their content as realizations of a random variable.

## Data manipulation

If we are only interested in a subset of the penguin data's variables we
take out these variables using `select()` from the `dplyr` package which is automatically loaded together with `mosaic`:
```{r subsetting data}
# Assigning the data subset to an object
penguin_data <- select(penguins, species, bill_length_mm, bill_depth_mm,
                       flipper_length_mm, body_mass_g, sex)

# This particular subset can be written shorter by (overwriting the object):
penguin_data <- select(penguins, -year, -island)
```

So the first input to `select()` is a dataset while the remaining are names of variables (columns) we wish to select (or omit if we use `-`).
The special role of the first argument as the dataset is sometimes used in a different syntax we will embrace in the course:
```{r}
penguin_data <- penguins %>% select(-year, -island)
```
The so called "pipe" (`%>%`) sends the output of one command as the input to the next command to chain commands together, and we read it as "and then".
So above we assign the value `penguin_data` the following way: first take the dataset `penguins` **and then** select all columns except `year` and `island`.

If we also wanted to rename some variables (columns) we could continue the chain of commands and instead write:
```{r}
penguin_data <- penguins %>% 
  select(-year, -island) %>% 
  rename(bill_length = bill_length_mm,
         bill_depth = bill_depth_mm,
         flipper_length = flipper_length_mm,
         mass = body_mass_g)
```

Again we can check our result by printing the beginning of the dataset:
```{r}
head(penguin_data)
```

The data contain missing values:
```{r}
summary(is.na(penguin_data))
```
Here `is.na()` indicates for each element in the dataset whether it is missing (`NA`) or not and `summary()` gives a summary of each column.
We see that there are more missing values within the sex variable than within
the variables for bill length and bill depth (for instance subject 4 has all of its
variables, but species missing, while subjects 9 to 12 only miss values for sex).

One goal with the penguin data could be to assess whether we may be able to
determine the specie of a penguin from the measured bill length and bill depth.
Then we can make a reduced data set with only these variable and make sure to omit penguins with missing values of these variables:
```{r data cleaning 1}
df_species <- penguin_data %>% 
  select(species, bill_length, bill_depth) %>% 
  filter(!is.na(species)) %>%
  filter(!is.na(bill_length)) %>% 
  filter(!is.na(bill_depth))
```
Instead of checking each variable for missing values we could have used the function `drop_na()` from the `tidyr` package:
```{r data cleaning 2}
# We always need to install a package in order to use its functions, but we
# can actually use functions from packages that we didn't load if we tell R in
# which package to look. We tell this to R by writing package::function
df_species <- penguin_data %>% 
  select(species, bill_length, bill_depth) %>% 
  tidyr::drop_na() # Note that we didn't load the tidyr package
```

Another goal could be to look at possible sex differences of the relations between the variables.
So in this case we keep `sex` and all the other variables and delete any rows with missing values:
```{r data cleaning 3}
df_sex <- penguin_data %>% 
  tidyr::drop_na()
```

## Plotting

To assess if we can determine the specie of a penguin from the measured bill
length and bill depth, we can plot these variables against each other:
```{r bill length vs depth}
gf_point(bill_length ~ bill_depth, color = ~ species, data = penguin_data)
```

We can suppress warnings by inserting the chunk option `warning=FALSE` next to
the chunk name (i.e. to the right of "bill length vs depth"). A variety of chunk
options can be turned on and off by clicking the gear icon in the top right
corner of the code chunk.
In this case we can of course also avoid the warning message by using the
cleaned up data frame `df_species` that we made for this particular analysis:
```{r bill length vs depth 2}
gf_point(bill_length ~ bill_depth, color = ~ species, data = df_species)
```

Don't be afraid if you think that this command looks a bit complex - you will
get to know exactly how to use it during the course. For now the relevant thing
to understand is that the function `gf_point` plots the bill length and bill
depth against each other and then colours the points according to the species of
the penguins.

From the plot we clearly see that the measurements clusters together according
to the species of the penguins: We see that the Adelie penguins tend to have a
shorter bill length than the Chinstrap and Gentoo penguins, while the Gentoo
penguins tend to have a shorter bill depth than the other two species.
This indicates that we can make a rather qualified guess of a penguin's specie
just by knowing the bill length and depth.

To investigate sex differences within each species we can make a similar plot to
the ones above, but with different shapes of the points for each sex:
```{r sex differences}
gf_point(bill_length ~ bill_depth, color = ~ species, shape = ~ sex, data = df_sex)
```

It is a bit difficult to distinguish between the plotted symbols, but since we
look at each specie separately, we can easily split up the plot according to the
sex variable:
```{r sex differences 2}
gf_point(bill_length ~ bill_depth | species, color = ~ sex, data = df_sex)
```

From this plot we clearly see a tendency: Male penguins tend to have longer and
also a little bit deeper bill measurements for all species, but can we use the same straight line to describe the relationship between bill length and depth for each sex?
```{r sex differences 3}
gf_point(bill_length ~ bill_depth | species, color = ~ sex, data = df_sex) %>%
  gf_lm(group = ~1)
```

Maybe just one straight line for both sexes isn't a good idea - at
least not for the adelie specie (i.e. if we try to focus on the points for just
one of the sexes then the line could obviously be drawn better for each sex).
```{r sex differences 4}
gf_point(bill_length ~ bill_depth | species, color = ~ sex, data = df_sex) %>%
  gf_lm(group = ~sex)
```

This plot reveals how difficult it is for a human eye to judge what the best
straight line for a group of points will look like. Therefore statistics and
computer software are indispensable tools.

## Exercise

Now it is your turn to explore if we can predict a penguin's body mass from its flipper length.
First make the same type of plot as we started with above by changing variable names in the command below:
```{r sex differences 5}
## Change the command below to see the relation between mass and flipper length
gf_point(bill_length ~ bill_depth | species, color = ~ sex, data = df_sex)
```

You can also try to insert a new R chunk and make a plot with straight lines added
to the relationship.

# Additional Markdown
Note that we have given each code chunk a name so far. These names have to be
unique or we will get an error when we try to knit the full document.
However, we don't have to name a code chunk, and we can have as many unnamed
chunks as we want, but a nice thing about well chosen names is that we can more
easily navigate to a specific chunk using the button that says "(Top Level)"
located in the bottom left corner of the source pane.
```{r Non-unique name}

```

<!-- ```{r Non-unique name} -->
<!-- # comment this code chunk before knitting -->
<!-- ``` -->

In R Markdown there are two kinds of comments: R-comments which are those inside
our code chunks (`# comment`), and Markdown-comments which are used outside
the code chunks to out-comment full lines (<!-- comment -->) 
Remark that the syntax can only be seen in the Rmd-file.