---
title: "Resampling techniques"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    keep_tex: yes
    highlight: tango
    number_section: yes
    toc: yes

---

```{r, include = FALSE}
options(digits = 2)
# Remember to add all packages used in the code below!
#missing_pkgs <- setdiff(c("mosaic","tidyr", "tidyverse", "equatiomatic", "caret", "boot"),
missing_pkgs <- setdiff(c("mosaic", "stringr", "tibble", "tidyr", "dplyr", "caret", "boot", "splines"),
                        rownames(installed.packages()))
# package equatiomatic removed
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
library(tibble)
library(tidyr)
library(dplyr)
```


# Resampling techniques

Topics:

* Overfitting (the model fits too well to the observed data)
* Generalisation (how well a model performs on a new sample)
* Cross-validation (estimate out-of-sample prediction error)
* Bootstrap (estimate standard errors)

# Model complexity

----

## A linear model for tree data

```{r, include=FALSE}
library(mosaic)
```

```{r, message=FALSE }
trees <- read.delim("https://asta.math.aau.dk/datasets?file=trees.txt")
head(trees)
```

* We consider the dataset \texttt{trees} containing the following observations on 31 trees:
  * Response: Volume - timber volume
  * Predictor: Girth - the tree diameter

* We consider a linear model.
$$Y= \alpha + \beta \cdot x + \varepsilon$$

```{r}
m0 <- lm(Volume ~ Girth, data = trees)
plotModel(m0)
summary(m0)
```

* We obtain the prediction equation
$$\hat{y}= -36.9 + 5.07 \cdot x $$

* The model has $R^2 = 0.935$


----

## A polynomial model

* We can also try to fit a second degree polynomial to the data
$$Y = \alpha + \beta_1 x + \beta_2 x^2 + \varepsilon$$

```{r}
m1 <- lm(Volume ~ poly(Girth, 2), data = trees)
plotModel(m1)
summary(m1)
```

* Prediction equation
$$\hat{y} = 30.2 + 87.1 x + 14.6 x^2 $$
* $R^2 = 0.962$

----

## Another polynomial model

* Or a polynomial of degree 7
$$Y = \alpha + \beta_1 x + \beta_2 x^2 + \dotsm + \beta_{7} x^{7}+ \varepsilon$$

```{r}
m1_bad <- lm(Volume ~ poly(Girth, 7), data = trees)
plotModel(m1_bad)
```

* Polynomials tend to behave wildly at the ends.

----

## A natural spline

* A natural spline is a piecewise third degree polynomial with smooth overlaps which is linear at the ends. It can also be fitted to the data.

```{r}
library(splines)
m2 <- lm(Volume ~ ns(Girth, 15), data = trees)
plotModel(m2)
```

* The model is very "wiggly" to get near the data points. 

* How to compare this model with the others?

# Measures of fit

----

## $R^2$ and correlation

* We can compare the models using $R^2$

```{r}
summary(m0)$r.squared
summary(m1)$r.squared
summary(m2)$r.squared
```

* $R^2$ is always higher for more complex models

* Note that $R^2$ is the squared correlation between the observed response values $y_i$ and the  values predicted by the prediction equation $\hat{y}_i$
$$R^2 = cor(y_i, \hat{y}_i)^2$$
```{r}
# or: m0$fitted
cor(trees$Volume,predict(m0, newdata = trees))^2
cor(trees$Volume,predict(m1, newdata = trees))^2
cor(trees$Volume,predict(m2, newdata = trees))^2
```

----

## Mean squared error

* We can also compare the models using mean squared errors (MSE)
$$MSE=\frac{1}{n} \sum_i (y_i- \hat{y}_i)^2$$

```{r}
mean((trees$Volume - predict(m0, newdata = trees))^2)
mean((trees$Volume - predict(m1, newdata = trees))^2)
mean((trees$Volume - predict(m2, newdata = trees))^2)
```


* The more complicated model has lowest MSE

* The model is fitted using least squares, i.e.\ minimising MSE.

* The model is trained to predict the datapoints well.
  * Would it also predict well on new data points?

# Out-of-sample error 

----


## Reproducibility and random number generation

* The code below generates three random numbers three times. 
```{r}
rnorm(3)
rnorm(3)
rnorm(3)
```

* We get a new sample in each try
* If we want to be sure we always get the same, we can set a seed.

```{r}
set.seed(1)
rnorm(3)
set.seed(1)
rnorm(3)
```

----

## Out-of-sample error

* Our dataset contains $n=31$ observations

* We now split the dataset in two: 
  * A **training dataset** consisting of 20 observations
  * A **test dataset** consisting of 11 observations

```{r}
set.seed(1)
train_idx <- sample(x = seq_len(nrow(trees)), size = 20, replace = FALSE)
trees_train <- trees[train_idx, ]
nrow(trees_train)
trees_test <- trees[-train_idx, ]
nrow(trees_test)
```

----

## Training the models 

* We use only the training data for fitting the models

```{r}
m0_train <- lm(Volume ~ Girth, data = trees_train)
m1_train <- lm(Volume ~ poly(Girth, 2), data = trees_train)
m2_train <- lm(Volume ~ ns(Girth, 15), data = trees_train)
```

----

```{r}
plotModel(m0_train) + 
  geom_point(aes(Girth, Volume), data = trees_test, color = "red")
```

----

```{r}
plotModel(m1_train) + 
  geom_point(aes(Girth, Volume), data = trees_test, color = "red")
```

----

```{r}
plotModel(m2_train) + 
  geom_point(aes(Girth, Volume), data = trees_test, color = "red")
```


----


## Testing the models

* We now use the test dataset to test the models
  * We predict the response in the test dataset 
  * We then compare the predictions to the observed response

```{r}
cor(predict(m0_train, newdata = trees_test), trees_test$Volume)^2
cor(predict(m1_train, newdata = trees_test), trees_test$Volume)^2
cor(predict(m2_train, newdata = trees_test), trees_test$Volume)^2
```

* The linear model has the highest correlation between observations and predictions

```{r}
mean((predict(m0_train, newdata = trees_test) - trees_test$Volume)^2)
mean((predict(m1_train, newdata = trees_test) - trees_test$Volume)^2)
mean((predict(m2_train, newdata = trees_test) - trees_test$Volume)^2)
```

* The quadratic polynomial has the smallest MSE

----

## Summary

* When we test on the same data as we train the model on, we get lower MSE and higher $cor(y_i,\hat{y}_i)$ for the more complex model

* When we test on new data, the more complicated model does not predict well

* **Overfitting:** A complex model tends to fit too well to the training data, but does not fit well to new data.

# Cross-validation

----

## Testing predictive ability

* Ideally, we should test a models predictive ability on new data that was not used to fit the model 

* Typically, only one dataset is available
  * A solution could be to split the dataset in test and training data
  * Waste of data
  
* Solution: repeat the splitting of data multiple times

----

## Cross-validation

* Cross-validation provides a clever way of repeating the training and test of a model 
* Divide data into $k$ folds
* In each iteration, fit the model on $k-1$ folds, test on the last fold
* E.g. $k$-fold cross validation for $k=10$:

```{r, echo=FALSE, fig.width=8, fig.height=6}
d <- expand.grid(fold = 1:10, test = 1:10) |> 
  as_tibble() |> 
  mutate(label = case_when(
    test == fold ~ "Test data",
    TRUE ~ "Training data"
  ))
ggplot(d, aes(xmin = fold-0.4, xmax = fold+0.4, ymin = test, ymax = test+1)) +
  geom_rect(aes(fill = label)) + 
  geom_text(aes(fold, test + 0.5, label = test)) +
  theme_void() +
  labs(fill = NULL)
```

* Benefits:
  * We use most of the data for fitting the model
  * Each observation is used once for testing

----

## Example

* Number of folds depends on size of dataset (often $k=5$ or $k=10$)
* We have 31 observations
* 4-fold cross validation seems suitable

* Divide data into 4 fold

```{r, echo=FALSE, fig.width=5, fig.height=8}
set.seed(1)
d_trees_raw <- tibble(idx = split(seq_len(nrow(trees)), sample(rep(1:4, length.out = nrow(trees))))) 

d_trees_raw |> 
  mutate(test = row_number()) |> 
  rowwise() |> 
  mutate(indices = paste0(idx, collapse = ",")) |> 
  ggplot(aes(xmin = 1, xmax = 2, ymin = test, ymax = test+1)) +
  geom_rect(fill = "white", color = "black") + 
  geom_text(aes(1.5, test + 0.5, label = indices)) +
  theme_void() +
  labs(fill = NULL)
```

* "Rotate" which folds are training data and which one is test data:

```{r, echo=FALSE, fig.width=8, fig.height=6}
d_trees <- d_trees_raw |> 
  mutate(test = row_number()) |>
  mutate(fold = list(1:4)) |> 
  unnest(fold) |> 
  rowwise() |> 
  mutate(indices = paste0(idx, collapse = ",")) |> 
  mutate(label = case_when(
    test == fold ~ "Test data",
    TRUE ~ "Training data"
  )) 

ggplot(d_trees, aes(xmin = fold-0.4, xmax = fold+0.4, ymin = test, ymax = test+1)) +
  geom_rect(aes(fill = label)) + 
  geom_text(aes(fold, test + 0.5, label = indices)) +
  theme_void() +
  labs(fill = NULL)
```

----

## Repeated CV

* Cross-validation may be repeated several times

```{r, echo=FALSE, fig.width=8, fig.height=6}
set.seed(1)
d_trees_raw <- tibble(repeat_no = 1:5) |> 
  group_by(repeat_no) |> 
  mutate(idx = list(split(seq_len(nrow(trees)), sample(rep(1:4, length.out = nrow(trees)))))) |> 
  unnest(idx)

d_trees_raw |> 
  group_by(repeat_no) |> 
  mutate(test = row_number()) |> 
  rowwise() |> 
  mutate(indices = paste0(idx, collapse = ",")) |> 
  ggplot(aes(xmin = repeat_no-0.4, xmax = repeat_no+0.4, ymin = test, ymax = test+1)) +
  geom_rect(fill = "white", color = "black") + 
  geom_text(aes(repeat_no, test + 0.5, label = indices)) +
  theme_void() +
  labs(fill = NULL)
```


---- 

## Cross-validation in R

* The caret package can be use for cross-validatin in R

```{r, message=FALSE,warning=FALSE}
library(caret)
```

<https://cran.r-project.org/package=caret>

<https://topepo.github.io/caret/>

* We first set up the cross-validation

```{r}
train_control <- trainControl(method = "repeatedcv",
                              # k-fold CV
                              number = 4,
                              # repeated five times
                              repeats = 5)
```

* Then we carry out the cross-validation

```{r}
set.seed(1)
m0_cv <- train(Volume ~ Girth, data = trees, trControl = train_control, method = "lm")
m1_cv <- train(Volume ~ poly(Girth, 2), data = trees, trControl = train_control, method = "lm")
m2_cv <- train(Volume ~ ns(Girth, 15), data = trees, trControl = train_control, method = "lm")
```

----

## Result of cross-validation

```{r}
m0_cv
```

* **RMSE** is root mean squared error

----

## More on RMSE

* Here is the resulting RMSE for all folds and all repetitions:

```{r}
m0_cv$resample
```

* The average of these RMSE is the total RMSE

```{r}
mean(m0_cv$resample$RMSE)
```

* This can also be obtained directly via the code

```{r}
m0_cv$results$RMSE
```


----

## Model comparison

* We obtain the model **RMSE** for the three models

```{r}
m0_cv$results$RMSE
m1_cv$results$RMSE
m2_cv$results$RMSE
```

* The quadratic model has the lowest **RMSE** and hence the best predictive power

<!-- ---- -->

<!-- Extra (with `tidyverse`, remember `library(tidyverse)`): -->

<!-- ```{r} -->
<!-- d <- bind_rows( -->
<!--   m0_cv$resample |> mutate(Model = "Volume ~ Girth"), -->
<!--   m1_cv$resample |> mutate(Model = "Volume ~ poly(Girth, 2)") -->
<!-- ) -->
<!-- d_mean <- d |>  -->
<!--   group_by(Model) |>  -->
<!--   summarise(RMSE = mean(RMSE), .groups = "drop") -->

<!-- ggplot(d, aes(Model, RMSE)) +  -->
<!--   geom_boxplot() +  -->
<!--   geom_jitter(height = 0, width = 0.1) +  -->
<!--   geom_point(data = d_mean, color = "red", shape = 4, size = 5) -->
<!-- ``` -->


# Non-parametric bootstrap

----

## Sampling variability

* When we estimate a parameter from a sample, there is some uncertainty due to the fact that the sample is random

* A new sample would result in new estimates

* The standard error is the standard deviation of the estimate when we repeat the sampling many times
  * Measures the uncertainty of the estimate

* However, we only have one sample available

----

## Bootstrap principle

* Idea: Create new samples by resampling $n$ observations from original data with replacement  (the same observation may be sampled several times)
* Mimic new samples 

* **Example:** Data indices:
```{r}
index<-c(1,2,3,4,5)
```
* Bootstrap sample indices:
```{r}
set.seed(1)
boot_index<-sample(index, replace = TRUE)
boot_index
```

* Observation 1 appears 2 times in  Bootstrap sample

----

## Bootstrap data example

* We want to fit a linear model on the tree data. Coefficients of the linear model can be extracted by 

```{r}
m0 <- lm(Volume ~ Girth, data = trees)
coef(m0)
```

* We want to estimate their standard errors using bootstrap.

* To prepare for the bootstrap, we define a function that takes as input a vector of indices of the bootstrap observations and does linear regression and extracts coefficients:

```{r}
model_coef <- function(index){
  coef(lm(Volume ~ Girth, data = trees, subset = index))
}
model_coef(1:nrow(trees))
set.seed(1)
model_coef(sample(1:nrow(trees), replace = TRUE))
model_coef(sample(1:nrow(trees), replace = TRUE))
```

----

## Bootstrap data example - continued

* We now create 1000 bootstrap samples and estimate the linear regression coefficients for each

* We view the first ten results

```{r}
set.seed(1)
bootstrap_coefs <- replicate(1000, {
  model_coef(sample(1:nrow(trees), replace = TRUE))
})
bootstrap_coefs[, 1:10]
```

* Below we plot the regression lines for the original data (red) and for the first ten bootstrap samples (black)

```{r, echo=FALSE}
p <- gf_point(Volume ~ Girth, data = trees) 

#for (i in 1:ncol(bootstrap_coefs)) {
for (i in 1:10) {
  p <- p %>% 
    gf_abline(intercept = bootstrap_coefs[1, i], slope = bootstrap_coefs[2, i], color = "darkgrey") 
}

p %>% gf_abline(intercept = coef(m0)[1], slope = coef(m0)[2], color = "red")
```


----

## Bootstrap estimates for the standard error

* We estimate the standard error by taking the standard deviation of the 1000 parameter estimates

```{r}
apply(bootstrap_coefs, 1, sd) # applies the function sd to each row in the matrix bootstrap_coefs
```

* This can be compared to the standard errors found by lm() using theoretical formulas

```{r}
summary(m0)
```


----

## The boot package

* Bootstrapping can be done automatically using the `boot` package in R: 
<https://cran.r-project.org/package=boot>

```{r, message=FALSE}
library(boot)
```

* We now need a function of both the dataset and an index vector that returns the linear regression coefficients.

```{r}
model_coef_boot <- function(data, index){
  coef(lm(Volume ~ Girth, data = data, subset = index))
}
```

* Then the bootstrap is carried out as follows

```{r}
set.seed(1)
b <- boot(trees, model_coef_boot, R = 1000)
b
coef(summary(m0))
```

# Bootstrap by resampling residuals

* Idea:
  * First fit regression line
  * Compute residuals $\hat{\varepsilon}_i = y_i - \hat{y}_i$
  * Create new dataset by replacing $y_i$ by $\hat{y}_i + \hat{\varepsilon}_{i,new}$, where $\hat{\varepsilon}_{i,new}$ is randomly sampled from the residuals $\hat{\varepsilon}_j$

* Can be used if residuals are not normally distributed

* First fit the model

```{r}
m0 <- lm(Volume ~ Girth, data = trees)
```

* Contruct 1000 new samples with resampled residuals.

```{r}
set.seed(1)
res_bootstrap_coefs <- replicate(1000, {
  new_y <- m0$fitted.values + sample(m0$residuals, replace = TRUE)
  coef(lm(new_y ~ trees$Girth))
})
```

* Compute the regression parameters for each sample and find standard deviation

```{r}
res_bootstrap_coefs[, 1:10]
apply(res_bootstrap_coefs, 1, sd)
```

* Compare with ordinary bootstrap

```{r}
apply(bootstrap_coefs, 1, sd)
```

* Compare with lm()

```{r}
coef(summary(m0))
```