---
title: "Punting exercise"
author: ""
date: ""
output: 
  html_document:
    fig_height: 3
    fig_width: 5
  pdf_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 3
    fig_width: 5
---

```{r, setup, include=FALSE}
require(mosaic)   # Load additional packages here 

# Some customization.  You can alter or delete as desired (if you know what you are doing).
trellis.par.set(theme=theme.mosaic()) # change default color scheme for lattice
knitr::opts_chunk$set(
  tidy=FALSE,     # display code as typed
  size="small")   # slightly smaller font for code
```

## Data

In this exercise, we consider the punting data from [WMMY] Case Study 12.2. An explanation of the variables can be found in the case study. 
```{r}
punting<-read.csv("https://asta.math.aau.dk/eng/static/datasets?file=puntingData.csv", header = TRUE)
head(punting)
```

## Exercise 1 

(a) Answer Exercise 12.47 a. and b. (for b. you can use the predict() function - see slides)

(b) Split the dataset in two: a training dataset with 8 observations and a test dataset with 5 observations

(c) Fit a model on the training data with the predictors RLS, LLS, LHF and RHF. Compute the MSE when evaluated on the test data and the training data, respectively.

(c) Fit a model on the training data with the predictors LLS and RHF. Compute the MSE when evaluated on the test data and the training data, respectively.

(d) Which of the models has the best training MSE? The best test MSE?

This exercise shows that a model with many predictors (compared to the number of observations) tends to overfit, that is, it fits very well on the training data, but it does not predict well on a new dataset.


## Exercise 2 

(a) In the model from Exercise 1 a., use bootstrap to estimate the standard errors of the regression corefficients. Compare with the standard errors in the output from lm().

(b) Use resampling of residuals to estimate the standard errors. Compare with the results from Exercise 2 a.


## Exercise 3

In this exercise, we let Distance be the response, see [WMMY] Exercise 12.50 for an explanation of this variable.

(a) Consider a multiple regression with Distance as response. Use cross-validation to compare all possible models using a subset of the variables Power, LLS and RHF as predictors. Which model do you prefer?