The ASTA team
Topics:
## Girth Height Volume
## 1 8.3 70 10
## 2 8.6 65 10
## 3 8.8 63 10
## 4 10.5 72 16
## 5 10.7 81 19
## 6 10.8 83 20
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.943 3.365 -11.0 7.6e-12 ***
## Girth 5.066 0.247 20.5 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.2 on 29 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.933
## F-statistic: 419 on 1 and 29 DF, p-value: <2e-16
We obtain the prediction equation \[\hat{y}= -36.9 + 5.07 \cdot x \]
The model has \(R^2 = 0.935\)
##
## Call:
## lm(formula = Volume ~ poly(Girth, 2), data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.489 -2.429 -0.372 2.076 7.645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.171 0.599 50.37 < 2e-16 ***
## poly(Girth, 2)1 87.073 3.335 26.11 < 2e-16 ***
## poly(Girth, 2)2 14.592 3.335 4.38 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.3 on 28 degrees of freedom
## Multiple R-squared: 0.962, Adjusted R-squared: 0.959
## F-statistic: 350 on 2 and 28 DF, p-value: <2e-16
The model is very “wiggly” to get near the data points.
How to compare this model with the others?
## [1] 0.94
## [1] 0.96
## [1] 0.98
\(R^2\) is always higher for more complex models
Note that \(R^2\) is the squared correlation between the observed response values \(y_i\) and the values predicted by the prediction equation \(\hat{y}_i\) \[R^2 = cor(y_i, \hat{y}_i)^2\]
## [1] 0.94
## [1] 0.96
## [1] 0.98
## [1] 17
## [1] 10
## [1] 4.9
The more complicated model has lowest MSE
The model is fitted using least squares, i.e. minimising MSE.
The model is trained to predict the datapoints well.
## [1] 0.16 -1.68 -0.47
## [1] -1.570 -0.579 -0.014
## [1] 0.33 0.69 -0.65
## [1] -0.63 0.18 -0.84
## [1] -0.63 0.18 -0.84
Our dataset contains \(n=31\) observations
We now split the dataset in two:
set.seed(1)
train_idx <- sample(x = seq_len(nrow(trees)), size = 20, replace = FALSE)
trees_train <- trees[train_idx, ]
nrow(trees_train)
## [1] 20
## [1] 11
## [1] 0.98
## [1] 0.97
## [1] 0.016
## [1] 40
## [1] 17
## [1] 2074
When we test on the same data as we train the model on, we get lower MSE and higher \(cor(y_i,\hat{y}_i)\) for the more complex model
When we test on new data, the more complicated model does not predict well
Overfitting: A complex model tends to fit too well to the training data, but does not fit well to new data.
Ideally, we should test a models predictive ability on new data that was not used to fit the model
Typically, only one dataset is available
Solution: repeat the splitting of data multiple times
Number of folds depends on size of dataset (often \(k=5\) or \(k=10\))
We have 31 observations
4-fold cross validation seems suitable
Divide data into 4 fold
https://cran.r-project.org/package=caret
https://topepo.github.io/caret/
train_control <- trainControl(method = "repeatedcv",
# k-fold CV
number = 4,
# repeated five times
repeats = 5)
## Linear Regression
##
## 31 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 5 times)
## Summary of sample sizes: 24, 23, 23, 23, 23, 23, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.5 0.95 3.7
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## RMSE Rsquared MAE Resample
## 1 4.7 0.91 4.2 Fold1.Rep1
## 2 4.2 0.91 3.6 Fold2.Rep1
## 3 5.3 0.99 3.3 Fold3.Rep1
## 4 4.7 0.96 4.2 Fold4.Rep1
## 5 3.8 0.93 3.2 Fold1.Rep2
## 6 5.6 0.93 4.6 Fold2.Rep2
## 7 2.7 0.97 2.0 Fold3.Rep2
## 8 5.0 0.95 4.7 Fold4.Rep2
## 9 3.5 0.97 2.7 Fold1.Rep3
## 10 4.7 0.93 4.2 Fold2.Rep3
## 11 4.6 0.96 3.6 Fold3.Rep3
## 12 5.2 0.96 4.2 Fold4.Rep3
## 13 4.4 0.93 3.6 Fold1.Rep4
## 14 5.3 0.95 4.2 Fold2.Rep4
## 15 3.5 0.96 2.7 Fold3.Rep4
## 16 4.1 0.95 3.4 Fold4.Rep4
## 17 3.7 0.98 3.0 Fold1.Rep5
## 18 4.4 0.92 3.7 Fold2.Rep5
## 19 6.2 0.98 4.6 Fold3.Rep5
## 20 4.8 0.92 4.2 Fold4.Rep5
## [1] 4.5
## [1] 4.5
## [1] 4.5
## [1] 3.5
## [1] 86
When we estimate a parameter from a sample, there is some uncertainty due to the fact that the sample is random
A new sample would result in new estimates
The standard error is the standard deviation of the estimate when we repeat the sampling many times
However, we only have one sample available
Idea: Create new samples by resampling \(n\) observations from original data with replacement (the same observation may be sampled several times)
Mimic new samples
Example: Data indices:
## [1] 1 4 1 2 5
## (Intercept) Girth
## -36.9 5.1
We want to estimate their standard errors using bootstrap.
To prepare for the bootstrap, we define a function that takes as input a vector of indices of the bootstrap observations and does linear regression and extracts coefficients:
model_coef <- function(index){
coef(lm(Volume ~ Girth, data = trees, subset = index))
}
model_coef(1:nrow(trees))
## (Intercept) Girth
## -36.9 5.1
## (Intercept) Girth
## -29.8 4.5
## (Intercept) Girth
## -34.4 4.8
We now create 1000 bootstrap samples and estimate the linear regression coefficients for each
We view the first ten results
set.seed(1)
bootstrap_coefs <- replicate(1000, {
model_coef(sample(1:nrow(trees), replace = TRUE))
})
bootstrap_coefs[, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## (Intercept) -29.8 -34.4 -34.3 -42.6 -36 -35.0 -36.5 -39.4 -34.0 -32.1
## Girth 4.5 4.8 4.8 5.5 5 4.9 5.1 5.2 4.8 4.7
## (Intercept) Girth
## 3.98 0.32
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.943 3.365 -11.0 7.6e-12 ***
## Girth 5.066 0.247 20.5 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.2 on 29 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.933
## F-statistic: 419 on 1 and 29 DF, p-value: <2e-16
boot
package in R: https://cran.r-project.org/package=boot##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = trees, statistic = model_coef_boot, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -36.9 0.372 4.05
## t2* 5.1 -0.038 0.33
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9 3.37 -11 7.6e-12
## Girth 5.1 0.25 20 8.6e-19
Idea:
Can be used if residuals are not normally distributed
First fit the model
set.seed(1)
res_bootstrap_coefs <- replicate(1000, {
new_y <- m0$fitted.values + sample(m0$residuals, replace = TRUE)
coef(lm(new_y ~ trees$Girth))
})
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## (Intercept) -38.4 -38.5 -34.0 -36 -35 -31.7 -38.6 -40.8 -30.5 -38.8
## trees$Girth 5.2 5.1 4.8 5 5 4.6 5.2 5.3 4.6 5.2
## (Intercept) trees$Girth
## 3.31 0.24
## (Intercept) Girth
## 3.98 0.32
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9 3.37 -11 7.6e-12
## Girth 5.1 0.25 20 8.6e-19