library(MASS)
library(tidyverse)
library(broom)
library(modelr)
theme_set(theme_bw(base_size = 16))
Bootstrap is very power yet simple technique that has saved many headaches. Instead of thinking very hard to how to derive specific uncertainty estimates of parameter values or transformations thereof, the bootstrap re-sampling technique can help.
Some references:
As listed in the above Wikipedia article, there are many types of bootstraps. However, the most common ones to use is the case re-sampling and residual re-sampling.
The simplest form of the bootstrap procedure is to re-sample the data with replacement to obtain a sample of the same as the original data, say \(n\). Let \(x\) denote the original data, then \(x^*_b\), \(b = 1,\dots,B\) are the \(B\) bootstrap samples all of size \(n\).
The basic principle is that the deviation that an estimate, \(\hat\theta\), parameter of interest, \(\theta\), can be mimicked by the deviation that the estimates \(\hat\theta^*_b\) has around \(\hat\theta\).
That is, \(\theta - \hat\theta\) follows the same distribution as \(\hat\theta - \hat\theta^*\). For the former we only knows \(\hat\theta\), whereas for the latter we know both estimates. By re-sampling many times, \(B = 1,000\), we can compute the bootstrap standard deviations and bootstrap confidence intervals.
modelr
To assess the uncertainties in the parameter estimates of an regression we may use the bootstrap.
boot <- modelr::bootstrap(mtcars, 1000)
models <- boot$strap %>% map(.f = ~lm(mpg ~ hp, data = .x))
params <- map_df(models, broom::tidy, quick = TRUE, .id = "id")
We can compute summaries based on the bootstrap estimates
params_summary <- params %>%
group_by(term) %>%
summarise(q = list(
c(conf = quantile(estimate, p = c(0.025, 0.975)) %>% set_names(c("low", "high")),
estimate = mean(estimate),
std.error = sd(estimate)) %>%
enframe())
) %>%
unnest() %>%
spread(name, value)
params_summary
# A tibble: 2 x 5
term conf.high conf.low estimate std.error
<chr> <dbl> <dbl> <dbl> <dbl>
1 hp -0.0461 -0.0999 -0.0700 0.0137
2 (Intercept) 34.3 26.4 30.2 2.06
Since we are in the world of linear models, the theory tells us the distributions of the estimators and how to compute confidence intervals.
mtcars_coef <- mtcars %>%
lm(mpg ~ hp, data = .) %>%
tidy(conf.int = TRUE)
mtcars_coef
# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 30.1 1.63 18.4 6.64e-18 26.8 33.4
2 hp -0.0682 0.0101 -6.74 1.79e- 7 -0.0889 -0.0476
In the car
package, the Boot
function enables residual based bootstrap estimates.
library(car)
mtcars_resid_boot <- Boot(lm(mpg ~ hp, data = mtcars), method = "residual", R = 999)
mtcars_resid_boots <- mtcars_resid_boot$t %>% as_tibble() %>%
gather(term, estimate) %>%
group_by(term) %>%
summarise(q = list(
c(conf = quantile(estimate, p = c(0.025, 0.975)) %>% set_names(c("low", "high")),
estimate = mean(estimate),
std.error = sd(estimate)) %>%
enframe())
) %>%
unnest() %>%
spread(name, value)
mtcars_resid_boots
# A tibble: 2 x 5
term conf.high conf.low estimate std.error
<chr> <dbl> <dbl> <dbl> <dbl>
1 hp -0.0477 -0.0872 -0.0678 0.0101
2 (Intercept) 33.5 26.8 30.0 1.65
We may plot and compare the bootstrap confidence intervals to the theoretical ones.
mtcars_pars <- list(
theoretical = mtcars_coef %>% mutate(at = -2),
bootstrap = params_summary %>% mutate(at = -4),
`residual bootstrap` = mtcars_resid_boots %>% mutate(at = -6)
) %>%
bind_rows(.id = "method")
mtcars_pars %>%
ggplot(aes(x = estimate, colour = method)) +
geom_histogram(data = params, aes(colour = NULL)) +
geom_point(aes(y = at)) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = at)) +
facet_wrap(~term, scales = "free_x")
We see that the residual based approach in this case gives a more accurate
boot
package is recommendable.Boot
function from the car
package works for lm
and nls
models. For other types of models the error term is not additive (as is required for the procedure above)