Cross-validation

library(MASS)
library(tidyverse)
library(broom)
library(modelr)
theme_set(theme_bw(base_size = 16))

Reference: Chapters 2 and 5 from Introduction to Statistical Learning with applications in R (ISLR) by James, Witten, Hastie and Tibshirani. Freely available from statlearning.com.

Cross-validation

Cross-validation is primarily used for two purposes

Evaluate model performance (model assessment)
Selecting the model flexibility (model selection)

Two types of error

The training error and the test error. The test error is the expected error we commit when making predictions on future observations. The training error is very often a poor estimate of the training error and will most likely always underestimate it.

We also refer to the test error as the generalization error. We are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

Model complexity

One of the most typical usage of cross-validation is that of model selection. How flexible should the model be - i.e. how easily should it adapt to small (local) changes in the data?

An illustrative example

Left: Data simulated from \(f\), shown in black. Three estimates of \(f\) are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand panel.

And now with a more linear data generating model

And the opposite (highly non-linear)

The more flexible the method, the higher the variation in the returned model. However, methods with little change when exposed to new data will typically have a larger bias.

Let \(y_0 = f(x_0) + \varepsilon\). Then the expected test error can be decomposed into \[\mathbb{E}[y_0 - \hat{f}(x_0)] = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\varepsilon),\] where

Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different \(\hat{f}\).
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.

\(k\)-fold cross-validation

Ideally we would like to have sufficient data to set a side about 25% for testing (assessment of the final prediction error), 25% for validation (to be used to assess the model complexity - the model selection phase) and use the remaining 50% for training the model.

However, often we would like to make better use of the data and get more than a single estimate of the test error. By the use of \(k\)-fold cross-validation this is possible.

Each \(1/k\) chunk plays the role of test data and in the remaining \(k-1\) folds it is part of the training data. By this we obtain \(k\) estimates of the test error and can use this to assess the model complexity needed to model the phenomena at hand.

Cross-validation is also used to tune hyper-parameters, which are parameters that is not estimated by the statistical learning algorithm, but are considered fixed parameters whilst other model parameters are estimated.

We will make extensive use of cross-validation in the last week of the course.

Example in R

The flexibility of a regression model can be increased by e.g. allowing the predictors to interact or appear in higher orders (i.e. polynomial terms).

mtcars %>% ggplot(aes(x = hp, y = mpg)) + 
  geom_point() + geom_smooth(se = FALSE) +
  geom_smooth(se = FALSE, method = "lm", colour = "black")

Should there be a term to account for the curvature?

mtcars_cv <- crossv_kfold(mtcars, 10)

mtcars_power <- function(power, data = mtcars_cv){
  data %>% 
    mutate(
      models = data$train %>% map(~ lm(mpg ~ poly(hp, power), data = .)),
      train = map2_dbl(models, data$train, rmse),
      test = map2_dbl(models, data$test, rmse)
    ) %>% 
    select(train, test) %>% 
    mutate(hp_power = power)
}

mtcars_rmse <- map_df(1:6, mtcars_power) %>% 
  gather(error_type, rmse, train, test) %>% 
  mutate(hp_power = factor(hp_power))

mtcars_rmse_mean <- mtcars_rmse %>% 
  group_by(hp_power, error_type) %>% 
  summarise(`mean rmse` = mean(rmse, trim = 0.05)) %>% 
  ungroup()

mtcars_rmse %>% 
  ggplot(aes(x = hp_power, y = rmse, fill = error_type)) + 
  geom_boxplot() + 
  scale_y_log10()

mtcars_rmse_mean %>% 
  ggplot(aes(x = hp_power, y = `mean rmse`, colour = error_type, group = error_type)) + 
  geom_point() +
  geom_line()

Cross-validation

Torben Tvedebrink

April 23, 2019