library(MASS)
library(tidyverse)
library(broom)
library(modelr)
theme_set(theme_bw(base_size = 16))

Bootstrap

Bootstrap is very power yet simple technique that has saved many headaches. Instead of thinking very hard to how to derive specific uncertainty estimates of parameter values or transformations thereof, the bootstrap re-sampling technique can help.

Some references:

As listed in the above Wikipedia article, there are many types of bootstraps. However, the most common ones to use is the case re-sampling and residual re-sampling.

Case resampling

The simplest form of the bootstrap procedure is to re-sample the data with replacement to obtain a sample of the same as the original data, say \(n\). Let \(x\) denote the original data, then \(x^*_b\), \(b = 1,\dots,B\) are the \(B\) bootstrap samples all of size \(n\).

The basic principle is that the deviation that an estimate, \(\hat\theta\), parameter of interest, \(\theta\), can be mimicked by the deviation that the estimates \(\hat\theta^*_b\) has around \(\hat\theta\).

That is, \(\theta - \hat\theta\) follows the same distribution as \(\hat\theta - \hat\theta^*\). For the former we only knows \(\hat\theta\), whereas for the latter we know both estimates. By re-sampling many times, \(B = 1,000\), we can compute the bootstrap standard deviations and bootstrap confidence intervals.

Resampling residuals

  1. Fit the model and retain the fitted values \(\hat{y}_i\) and the residuals \(\hat{e}_i = y_i - \hat{y}_i\), \(i = 1,\dots, n\).
  2. For each pair, \((x_i, y_i)\), where \(x_i\) is the explanatory variable(s), add a randomly re-sampled residual, \(\hat{e}_j\), to the fitted value \(\hat{y}_i\). That is, synthetic responses \(y^*_i = \hat{y}_i + \hat{e}_j\) is generated based on random re-allocation of the noise
  3. Refit the model using the synthetic response \(y^*_i\), and store the coefficients \(\hat\beta^*\)
  4. Repeat steps 2 and 3 a \(B\) times.

Using modelr

To assess the uncertainties in the parameter estimates of an regression we may use the bootstrap.

boot <- modelr::bootstrap(mtcars, 1000)
models <- boot$strap %>% map(.f = ~lm(mpg ~ hp, data = .x))
params <- map_df(models, broom::tidy, quick = TRUE, .id = "id")

We can compute summaries based on the bootstrap estimates

params_summary <- params %>% 
  group_by(term) %>% 
  summarise(q = list(
    c(conf = quantile(estimate, p = c(0.025, 0.975)) %>% set_names(c("low", "high")), 
      estimate = mean(estimate),
      std.error = sd(estimate)) %>% 
      enframe())
    ) %>% 
  unnest() %>% 
  spread(name, value)
params_summary
# A tibble: 2 x 5
  term        conf.high conf.low estimate std.error
  <chr>           <dbl>    <dbl>    <dbl>     <dbl>
1 hp            -0.0461  -0.0999  -0.0700    0.0137
2 (Intercept)   34.3     26.4     30.2       2.06  

Since we are in the world of linear models, the theory tells us the distributions of the estimators and how to compute confidence intervals.

mtcars_coef <- mtcars %>% 
  lm(mpg ~ hp, data = .) %>% 
  tidy(conf.int = TRUE)
mtcars_coef
# A tibble: 2 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  30.1       1.63       18.4  6.64e-18  26.8      33.4   
2 hp           -0.0682    0.0101     -6.74 1.79e- 7  -0.0889   -0.0476

In the car package, the Boot function enables residual based bootstrap estimates.

library(car)
mtcars_resid_boot <- Boot(lm(mpg ~ hp, data = mtcars), method = "residual", R = 999)

mtcars_resid_boots <- mtcars_resid_boot$t %>% as_tibble() %>% 
  gather(term, estimate) %>% 
  group_by(term) %>% 
  summarise(q = list(
    c(conf = quantile(estimate, p = c(0.025, 0.975)) %>% set_names(c("low", "high")), 
      estimate = mean(estimate),
      std.error = sd(estimate)) %>% 
      enframe())
    ) %>% 
  unnest() %>% 
  spread(name, value)
mtcars_resid_boots
# A tibble: 2 x 5
  term        conf.high conf.low estimate std.error
  <chr>           <dbl>    <dbl>    <dbl>     <dbl>
1 hp            -0.0477  -0.0872  -0.0678    0.0101
2 (Intercept)   33.5     26.8     30.0       1.65  

We may plot and compare the bootstrap confidence intervals to the theoretical ones.

mtcars_pars <- list(
  theoretical = mtcars_coef %>% mutate(at = -2), 
  bootstrap = params_summary %>% mutate(at = -4),
  `residual bootstrap` = mtcars_resid_boots %>% mutate(at = -6)
  ) %>% 
  bind_rows(.id = "method")
mtcars_pars %>% 
  ggplot(aes(x = estimate, colour = method)) + 
  geom_histogram(data = params, aes(colour = NULL)) + 
  geom_point(aes(y = at)) + 
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = at)) + 
  facet_wrap(~term, scales = "free_x")

We see that the residual based approach in this case gives a more accurate

More stuff

  • For a richer set of functionality the boot package is recommendable.
  • For regression residual bootstraps the Boot function from the car package works for lm and nls models. For other types of models the error term is not additive (as is required for the procedure above)