---
title: "Analysis of Covariance"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "grid", "jpeg"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# The regression problem
## Example

* We will study the dataset in Agresti Table 13.1 available as `Income.txt` on the course website. We read in data in RStudio

```{r}
Income <- read.delim("https://asta.math.aau.dk/datasets?file=Income.txt")
```

* We have a sample with measurements of 3 variables:
    + `y=income`: Quantitative variable, which is yearly income. This will be our response.
    + `x=education`: Quantitative predictor, which is the number of years of education.
    + `z=race`: Explanatory factor with levels `b`(black), `h`(hispanic) and `w`(white).

* We always start with some graphics (remember the function  `gf_point` for plotting points and `gf_lm` for adding a regression line).

```{r message=FALSE}
library(mosaic)
gf_point(inc ~ educ, color = ~race, data = Income) %>% gf_lm()
```

* An unclear picture, but a tendency to increasing income with increasing education.
* The trend lines for the three races are different. But is the difference significant? Or can the difference be explained by sampling variation?
* Such a regression with both qualitative and quantitative predictors is called an analysis of covariance (ANCOVA). When the model only contains qualitative predictors, the problem is known as analysis of variance (ANOVA) which is the topic of the next lecture.

# Dummy coding
## Dummy coding

* First, we will look at the model **without interaction**, i.e. the effect of `education` is the same for all races, which corresponds to parallel lines.

* We also have to introduce dummy coding of the factor $z$:
    + $z_1=1$ if `race=b` and zero otherwise
    + $z_2=1$ if `race=h` and zero otherwise
    
* This determines the regression model: $$E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2$$ which  corresponds to **parallel** regressions lines for each race.

* `w`: ($z_1=0,z_2=0$) $E(y|x) = \alpha + \beta x$

* `b`: ($z_1=1,z_2=0$) $E(y|x) = \alpha + \beta_1 + \beta x$.

* `h`: ($z_1=0,z_2=1$) $E(y|x) = \alpha + \beta_2 + \beta x$.  

* $\beta_1$ is the difference in `Intercept` between black and white.

* $\beta_2$ is the difference in `Intercept` between Hispanic and white.

## Example

* We want to tell `R` that we want `w` as reference for race (default is lexicographical ordering, i.e. `(b, h, w)` and `b` would then be the reference):

```{r}
Income$race <- relevel(Income$race, "w")
```

* Then we use `+` in the model formula to only have additive effects of `educ` and `race`, i.e. a model without interaction:

```{r}
model1 <- lm(inc ~ educ + race, data = Income)
summary(model1)
```

* The common slope to `educ` is estimated to be $\widehat\beta = `r model1$coefficients["educ"]`$, with corresponding p-value=$4.42\times 10^{-10}$ which is significantly different from zero. 
* There is a clear positive effect of `educ` on `income`.
* The estimate for `w`-intercept is $\widehat\alpha=-15.6635$, which isn't significantly different from zero if we test at level 5\% (this test is not really of interest).
* **The difference** between `b`- and `w`-intercept (`raceb`) is $\widehat\beta_1=-10.8744$, which is significant with p-value=1.74\%.
* There is no significant difference between `h`- and `w`-intercept.

```{r}
plotModel(model1)
```

## Example: Prediction equations

```{r}
summary(model1)
```

* Reference/baseline group (white): $$\widehat y = -15.66 + 4.43 x$$
* Black: $$\widehat y = -15.66 - 10.87 + 4.43 x = -26.54 + 4.43 x $$
* Hispanic: $$\widehat y = -15.66 - 4.93 + 4.43 x = -20.60 + 4.43 x $$

## Agresti -- summary

```{r, fig.width=12, echo=FALSE, fig.align='center'}
# was ![](https://asta.math.aau.dk/static-files/asta/img/noInterac.jpg)
url <- "https://asta.math.aau.dk/static-files/asta/img/noInterac.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```

# Model with interaction
## Interaction

* In the following we will expand the model to include interaction between the effects of race and education on income. Before proceeding, let us recall what interaction means (and doesn't mean) in this context:
* Interaction between the effects of race and education on income does **not** mean that the values of education and race themselves are related or affect each other.
* Interaction between the effects of race and education on income means that the relationship between education and income depends on the value of race. I.e. for each fixed value of race the slope of the line relating education and income may have a different value.
* Often we just refer to this as "interaction between education and race" when it really should read "interaction between the effects of race and education on income".

## Interaction

* We will expand the regression model, so we include interaction between $x$ and $z_1$ respectively $z_2$: $$E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2 + \beta_3 z_1x + \beta_4 z_2x.$$
* This yields a regression line for each race:
* `w` ($z_1=0, z_2=0$): $E(y|x) = \alpha + \beta x$
* `b` ($z_1=1, z_2=0$): $E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x$.
* `h` ($z_1=0,z_2=1$): $E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x$.
* $\beta_1$ is **the difference** in `Intercept` between black and white, while $\beta_3$ is **the difference** in `slope` between black and white.
* $\beta_2$ is **the difference** in`Intercept` between Hispanic and white, while $\beta_4$ is the difference in `slope` between Hispanic and white.

## Example: Prediction equations

* When we use `*` in the model formula we include interaction between `educ` and `race`:

```{r}
model2 <- lm(inc ~ educ * race, data = Income)
summary(model2)
```

* Reference/baseline group (white): $$\widehat y = -25.87 + 5.21 x$$
* Black: $$\widehat y = -25.87 + 19.33 + (5.21 - 2.41)x = -6.54 + 2.80 x $$
* Hispanic: $$\widehat y = -25.87 + 9.26 + (5.21 - 1.12)x = -16.60 + 4.09 x $$

## Example: Individual tests

```{r}
summary(model2)
```

* **The difference** in slope between `b` and `w` (`educ:raceb`) is estimated to $\widehat\beta_3=-2.4107$. With p-value=9.33% there is no significant difference.
* Furthermore, there isn't any significant difference of slope between `h` and `w`. In other words there is probably not interaction between `educ` and `race`.

```{r}
gf_point(inc ~ educ, color = ~ race, data = Income) %>% gf_lm()
```

# Test for no interaction 
## Test for no interaction 

```{r two-column}
plotModel(model1)
plotModel(model2)
```

```{r}
summary(model1)$r.squared
summary(model2)$r.squared
```

* Is `model2` significantly better than `model1`? I.e. is $R^2$ significantly higher for `model2`?

## Hypothesis and test statistic

* The simpler `model1` is obtained from the more complicated `model2` by setting $\beta_3=0$ and $\beta_4=0$, so the null hypothesis "the simpler additive model describes data sufficiently well compared to the complicated interaction model" is really the simple mathematical hypothesis: $$ H_0: \beta_3=0, \beta_4=0.$$
* We will look at the difference between $R^2$ for the two models, but as before (for multiple linear regression) we have to convert this to an $F$ statistic which we can then calculate a $p$-value for.
* Formula for $F_{obs}$ (no need to learn this by heart):
$$
F_{obs} = \frac{(R_2^2-R_1^2)/(\text{df}_1-\text{df}_2)}{(1-R_2^2)/\text{df}_2}
$$
where $\text{df}_1$ and $\text{df}_2$ are $n$ minus the number of model parameters for the two models (i.e. 80-4=76 and 80-6=74 in our case).
* The formula for $F_{obs}$ can be rewritten in terms of sums of squared errors (SSE) for each model (no need to memorize it):
$$F_{obs} = \frac{(SSE_1 - SSE_2)/(\text{df}_1 - \text{df}_2)}{(SSE_2)/\text{df}_2}.$$
* In the literature SSE is sometimes denoted by RSS for **Residual Sums of Squares**; i.e SSE = RSS.

## Test for no interaction in R

* In R the calculations are done using `anova`:

```{r}
anova(model1, model2)
```

* The F-test for dropping the interaction `educ:race` has F-value=$1.465$, which in no way is significant with p-value=$23.77\%$.

# Hierarchy of models
## Hierarchy of models

* `Interaction:` The most general model with main effects `educ` and `race` and interaction `educ:race`:

```{r}
Interaction <- lm(inc ~ educ * race, data = Income)
```

* `MainEffects:` The model where there are additive effects of `educ` and `race`.

```{r}
MainEffects <- lm(inc ~ educ + race, data = Income)
```

* `educEff:` Model where there only is an effect of `educ` (simple lin. reg.).

```{r}
educEff <- lm(inc ~ educ, data = Income)
```

* `raceEff:` Model where there only is an effect of `race` (a different mean for each group -- more on this in the ANOVA lecture).

```{r}
raceEff <- lm(inc ~ race, data = Income)
```

* We can, corresponding to Agresti Table 13.10, make F-tests for 3 pairwise comparisons of models.

## Example

* Comparing `MainEffects` and `Interaction` is what we have already done.

```{r}
anova(MainEffects, Interaction)
```

* We recognize $F=1.465$ with p-value=$23.77\%$, i.e.  `model2` isn't significantly better than `model1`. So no `educ:race` interaction.

* In the same manner we can compare `educEff` and `MainEffects`. I.e. we investigate whether the effect of `race` can be left out.

```{r}
anova(educEff, MainEffects)
```

* If any, the effect of `race` is weak with p-value=$5.292\%$.

* Finally, we compare `raceEff` and `MainEffects`. Clearly `educ` cannot be left out (P-value=$4.422\times 10^{-10}$).

```{r}
anova(raceEff, MainEffects)
```

## Example

* The methods generalize to models with more than 2 predictors.
* We return to the dataset `Ericksen`, where we study the response `crime`:
```{r}
Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)
```
* The variables are:
    + `crime`: Quantitative variable
    + `city`: `city` or `state`
    + `highschool`: Quantitative variable
    + `poverty`: Quantitative variable
* The model has 3 predictors with main effects and includes
    + interaction between `city` and `highschool`
    + interaction between `city` and `poverty`.

```{r}
summary(model)
```

* There isn't significant (p-value=$34.1523\%$) interaction between `city` and `highschool`. 
* I.e. the effect of `highschool` on crime is the same in metropolitan areas (`city=city`) and the non-metropolitan areas (`city=state`).
* There is clearly (p-value=$0.2773\%$) interaction between `city` and `poverty`. 
* I.e. the effect of `poverty` on crime is different in metropolitan and non-metropolitan areas.
* For `city=state`, the effect of `poverty` (on crime) is smaller than in the major cities. 
* Hence, poverty has larger effect on crime in the major cities than in the states outside the major cites.


## Multicollinearity and variance inflation factors

* Ideally the predictors in linear regression should be **uncorrelated**, which is almost never the case.
* The consequence of the two predictors being correlated (**collinear**), is that the uncertainty of the parameter estimates increase (because the squared standard error increases) by a factor commonly called the variance inflation factor (VIF).
* If multiple pairs of predictors are collinear, we say that the model suffers from **multicollinearity**. 
* If we have a model with $p$ predictors, then the VIF of $x_j$ is: 
$$\text{VIF}_j = \frac{1}{1 - R^2_j},$$ 
where $R_j^2$ is the multiple $R^2$ value of a model using $x_j$ as a response and the remaining $p-1$ predictors as explanatory variables.
* The larger $\text{VIF}_j$ is, the higher the collinearity between $x_j$ and the remaining predictors is.
* **Rule of thumb:** If a $\text{VIF}$ is larger than 10 the collinearity is too high.