---
title: "Analysis of Variance"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes  
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "gridExtra"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# One way analysis of variance
## Example 

* The data set `chickwts` is available in `R`, and on the course webpage.
* 71 newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. 
* Their weights in grams after six weeks are given along with feed types, i.e. we have a sample with corresponding measurements of 2 variables:
    + `weight`: a numeric variable giving the chick weight.
    + `feed`: a factor giving the feed type.
* Always start with some graphics:

```{r, warning=FALSE, message=FALSE}
library(mosaic)
gf_boxplot(weight ~ feed, data = chickwts)
```

## The ANOVA Model

* We measure the response $y$ which in this case is `weight`.
* We want to study the effect of the factor $x$ on $y$. In this case $x=$`feed` and divides the sample in $g=6$ groups.
* The mean responses within the groups are denoted $\mu_1,\mu_2,\ldots,\mu_g$.
* We will assume that
    + $y=\mu_x+\epsilon$, when $y$ is a response in group $x$
    + $\epsilon$ are a sample from a population with mean zero and standard deviation $\sigma$.
    + The standard deviation for the population in each group is the same and equals $\sigma$
    + The response variable, $y$, is normal distributed within each group.
* The ANOVA test is a _test of equal means_ for the different groups.

# Estimation of mean values
## Estimates

* Least squares estimates for population means $\widehat\mu_x$ is given by the average of the response measurements in group $x$.
* For a given measured response $y$ we let $\widehat y$ denote the model's prediction of $y$, i.e. $$\widehat y = \widehat\mu_x$$ if $y$ is a response for an observation in group $x$.
* We use `mean` to find the mean, for each group:

```{r}
mean(weight ~ feed, data = chickwts)
```

* We can e.g. see that $\widehat y=323.6$, when `feed=casein` but $\widehat y=160.2$, when `feed=horsebean`.
* Is it a significant difference ?

## Contrast coding

* In many cases there is a group corresponding to "no treatment" and we are interested in the effect of different treatments.
* In this example we only have different `feeds`, which are sorted in lexicographical order by `R`, so `casein` is the reference.
* We can specify the model via:
    + `Intercept` corresponding to the mean response for the reference (`casein`).
    + For each of the other groups we have a **contrast**, which measures **the difference** between the mean value for that group and the reference group.
* For a given contrast we can calculate standard error, t-score and p-value, and thereby investigate whether there is a difference between this group and the reference group.
* In Agresti this is referred to as using **dummy variables**.


## Example

```{r}
model <- lm(weight ~ feed, data = chickwts)
summary(model)
```

* We get information about contrasts and their significance:
* `Intercept` corresponding to `casein` has `weight` different from zero ($p < 2\times 10^{-16}$) (of course, chickens grow a lot over 6 weeks)
* Weight difference between `casein` and `horsebean` is extremely significant (p=$2\times 10^{-9}$).
* There is no significant weight difference between `casein` and `sunflower` (p=$81$\%).

# Overall test for effect
## Graphical representation of models

* We have two alternative explanations of the data.
* Simple model with one parameter (mean): "The feed type doesn't matter. The weight is just random around a common mean value".
* Complex model with six parameters (means): "The feed type is important. For each feed type we get a different mean value and the weights are random around these values."

```{r fig.width=10, echo = FALSE, message = FALSE}
library(gridExtra, quietly = TRUE)
red_dot <- mean(chickwts$weight)
red_dots <- aggregate(weight ~ feed, data = chickwts, mean)
p1 <- gf_point("" ~ weight, data = chickwts) %>% gf_labs(y = "common mean value") + geom_point(aes(x=red_dot), color="red") 
p2 <- gf_point(feed ~ weight, data=chickwts) + geom_point(aes(x=weight, y = feed), col = "red", data = red_dots)
grid.arrange(p1, p2, ncol = 2)
```

## Hypotheses and test statistic

* Is the complex model significantly better (i.e. is there any effect of the explanatory grouping variable)? We can write the corresponding hypotheses in two different ways $$H_0: \mu_1 = \mu_2 = \dots=\mu_g \quad \mbox{against} \quad  H_a: \mbox{ At least 2 of the population means are different}$$
* Alternatively $$H_0: \mbox{ All contrasts are equal to zero. } \quad H_a: \mbox{ At least one contrast is non-zero}.$$
* We will (indirectly) use $R^2$ to do the test. If it is large, the complex model has good predictive power compared to the simple model. To judge significance we use $$F_{obs} = \frac{(n-g)R^2}{(g-1)(1-R^2)} = \frac{(TSS-SSE)/(g-1)}{SSE/(n-g)}.$$
* Large values of $R^2$ implies large values of $F_{obs}$, which points to the alternative hypothesis.
* I.e. when we have calculated the observed value $F_{obs}$, then we have to find the probability that a new experiment would result in a larger value.
* TSS: error sum of squares if common mean. SSE: error sum of squares if different means.
* TSS-SSE: how much does error sum of squares increase if means are restricted to be equal.

## Interpretation of $F$ statistic - Variance between/within groups

* It can be shown that the numerator of $F_{obs}$ is a measure of **the variance between the groups**, i.e. how much "boxes" vary around the total average (the red line).

* Likewise it can be shown the denominator of $F_{obs}$ is a measure for **the variance within groups**, i.e. how "tall" the boxes in the boxplot are.

```{r chickwts-boxplot-line,echo=FALSE,fig.width=7,fig.height=6}
gf_boxplot(weight ~ feed, data = chickwts) %>%   gf_abline(color="red", slope = 0, intercept = mean(chickwts$weight))
```

<!-- * If the boxes' deviations from the red line are to be explained by randomness, then the two types of variances should be of same magnitude. This is measured by the F-test statistic, which can be stated as $$F_{obs} = \frac{\mbox{variance between groups}}{\mbox{variance within groups}}$$ -->

* The bigger deviations between the red line and the box means relative to the variation within boxes, the less we trust $H_0$. This is measured by the F-test statistic, which can be stated as $$F_{obs} = \frac{\mbox{variance between groups}}{\mbox{variance within groups}}$$

## Example

```{r}
model <- lm(weight ~ feed, data = chickwts)
summary(model)
```

* The last line gives us the value of $F_{obs} = 15.36$  and the corresponding $p$-value ($5.9 \times 10^{-10}$). Clearly there is a significant difference between the types of `feed`.

# Two way analysis of variance
## Additive effects

* The data set `ToothGrowth` is available in **R** and on the webpage. For more info about this data, use `?ToothGrowth`.
* The data describes the tooth length in guinea pigs where some receive vitamin C treatment and others are given orange juice in different dosage.

```{r, echo = FALSE}
data("ToothGrowth")
ToothGrowth$dose <- factor(ToothGrowth$dose,
                           levels = c(0.5, 1, 2),
                           labels = c("LO", "ME", "HI"))
```

* A total of $60$ observations on 3 variables.
    + `[,1] len` The tooth length
    + `[,2] supp` The type of the supplement (`OJ` or `VC`)
    + `[,3] dose` The dosage (`LO`, `ME`, `HI`)
* We will study the response `len` with the predictors `supp` and `dose`.
* At first we look at the model with additive effects
    + `len`=$\mu$ + `"effect of supp"`+ `"effect of dose"` + `error`
* This is also called the main effects model since it does not contain interaction terms.
* The parameter $\mu$ corresponds to the `Intercept` and is the mean tooth length in the reference group (supp `OJ`, dose `LO`).
* The effect of `supp` is the difference in mean when changing from `OJ` to `VC`.
* The effect of `dose` is the difference in mean when changing from `LO` to either`ME` or `HI`.

## Dummy coding

* Let us introduce dummy variables:
    + $s_C=1$ if supp `VC` and zero otherwise.
    + $d_M=1$ if dose is `ME` and zero otherwise.
    + $d_H=1$ if dose is `HI` and zero otherwise.
* Then we state the model $$\mbox{length}=\mu+\beta_1 s_C+\beta_2 d_M+\beta_3 d_H + \mbox{error} .$$
* Interpretation:
    + $\mu$ is the expected tooth length when supp is `OJ` and `dose` is `LO` ($s_C=d_M=d_H=0)$).
    + $\beta_1$ is the effect of supplement `OJ` to `VC` ($s_C=1$).
    + $\beta_2$ is the effect of increasing dosage from `LO` to `ME` ($d_M=1$).
    + $\beta_3$ is the effect of increasing dosage from `LO` to `HI` ($d_H=1$).
    
* As a two-way table:

$$ \begin{array}{cccc}
   & LO & ME & HI \\
OJ & \mu & \mu+\beta_2 & \mu+ \beta_3\\
VC & \mu +\beta_1 & \mu+\beta_1 + \beta_2 & \mu+ \beta_1 + \beta_3\\
\end{array}
$$

## Main effect model in **R**

* The main effects model is fitted by
```{r}
MainEff <- lm(len ~ supp + dose, data = ToothGrowth)
summary(MainEff)
```
* The model has 4 parameters.
* The $F$ test at the end compares with the (null) model with only one overall mean parameter. <!-- Does it seem like `supp` and `dose` has an additive effect? -->

## Testing effect of supp

* Alternative model without effect of supp:
```{r size="small"}
doseEff <- lm(len ~ dose, data = ToothGrowth)
```

* We can compare $R^2$ to see if `doseEff` (Model 1) is sufficient to explain the data compared to `MainEff` (Model 2). This is done by converting to $F$-statistic:
$$
F_{obs} = \frac{(R_2^2 - R_1^2)/(df_1 - df_2)}{(1 - R_2^2)/df_2} = \frac{(SSE_1 - SSE_2)/(df_1 - df_2)}{(SSE_2)/df_2}.
$$
* $SSE_1-SSE_2$: increase in error sum of square when using Model 1 instead of Model 2
* In **R** the calculations are done using `anova`:
```{r size="small"}
anova(doseEff, MainEff)
```

* $p$-value is 0.004 hence we reject that `supp` does not have an effect. Thus we prefer Model 2.

## Testing effect of dose

* Alternative model without effect of dose:
```{r size="small"}
suppEff <- lm(len ~ supp, data = ToothGrowth)
anova(suppEff, MainEff)
```

* $p$-value is $\approx 0$ hence we reject that `dose` does not have an effect. Thus we prefer Model 2.

# Interaction
## Example

* We will extend the model by introducing an interaction between `supp` and `dose`.


* Interaction plot:

```{r}
with(ToothGrowth, interaction.plot(dose, supp, len, col = 2:3))
```

* For each of the supplement types we plot the average tooth length as a function of dosage.
* If the main effects model is correct then the difference between supplements is the same for all levels of dosage, i.e. the curves should be parallel - except for noise.
* This does not seem to be the case.

* This is how the plot _should_ look _if_ the main effects model (no interaction) is correct:

```{r, echo = FALSE, message = FALSE}
new_data <- expand.grid(supp = c("OJ", "VC"),
                        dose = c("LO", "ME", "HI"))
new_data[] <- lapply(new_data, function(x) factor(x))
new_data$len_mean <- predict(MainEff, new_data)
xyplot(len_mean ~ dose, data = new_data, groups = supp,
       ylab = "mean tooth length",
       type = "l",
       col = 2:3,
       lty = c(2, 1),
              key = list(
                title = "supp",
                x = .85, y = 1,
                text = list(c("OJ", "VC")),
                lines = list(lty = c(2, 1), type = "l", col = 2:3)))
```

* Parallel lines mean that effect of supplement does not depend on dose !

## Dummy coding

* The extended model can be formulated as
$$
\mathtt{length} = \mu+\beta_1 s_C+\beta_2 d_M+\beta_3 d_H+ \beta_4 s_C d_M+\beta_5 s_C d_H+\mathtt{error}
$$
* Interpretation:
    + $\mu$ is the expected tooth length for `supp` `OJ` and `dose` `LO` ($s_C=d_M=d_H=0$).
    + $\beta_1$ is the effect of changing from `supp` `OJ` to `VC`, `dose` is `LO` ($s_C=1,d_M=d_H=0$).
    + $\beta_2$ is the effect of increasing `dose` from `LO` to `ME`, when `supp` is `OJ` ($s_C=0,d_M=1$).
    + $\beta_3$ is the effect of increasing `dose` from `LO` to `HI`, when `supp` is `OJ` ($s_C=0,d_H=1$).
    + $\beta_4$ is an additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `ME` ($s_C=1,d_M=1$)
    + $\beta_5$ is an additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `HI` ($s_C=1,d_H=1$)
    
    
* As a two-way table:

$$ \begin{array}{cccc}
   & LO & ME & HI \\
OJ & \mu & \mu+\beta_2 & \mu+ \beta_3\\
VC & \mu +\beta_1 & \mu+\beta_1 + \beta_2 +\beta_4 & \mu+ \beta_1 + \beta_3 + \beta_5\\
\end{array}
$$

* Further examples:
    + effect of changing from `supp` `OJ` to `VC` if `dose` is `LO` is $\mu+\beta_1-\mu=\beta_1$
    + effect of changing from `supp` `OJ` to `VC` if `dose` is `ME` is $\mu+\beta_1+\beta_2+\beta_4- \mu-\beta_2=\beta_1+\beta_4$
    + effect of changing from `supp` `OJ` to `VC` if `dose` is `HI` is $\mu+\beta_1+\beta_3+\beta_5-\mu-\beta_3=\beta_1+\beta_5$
    + if $\beta_4=0$ and $\beta_5=0$ the effect of changing from `OJ` to `VC` does not depend on `dose`

## Example

* We fit the interaction model by changing plus to multiply in the model expression from before:
```{r}
Interaction <- lm(len ~ supp*dose, data = ToothGrowth)
```

* Now we can think of an experiment with 6 groups corresponding to each combination of the predictors.

<!-- * Looking at the group averages it looks like, the supplement types behave quite differently depending on `dose`: -->

<!-- ```{r} -->
<!-- mean(len ~ supp + dose, data = ToothGrowth) -->
<!-- ``` -->
<!-- * But is that significant? -->

* Is added interaction significant ? - we compare main effects model and more complex interaction model using anova:

```{r}
anova(MainEff, Interaction)
```

* With a p-value of `r 100*anova(MainEff, Interaction)[6][2,1]`% there is a significant interaction `supp:dose`, i.e. the lack of parallel curves in the interaction plot is significant.

```{r}
summary(Interaction)
```

<!-- * The additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `HI` ($\beta_5$=`suppVC:doseHI`) is highly significant. -->
* Note the negative effect of changing from `OJ` to `VC` when `dose` is low is cancelled by the positive interaction parameter $\beta_5$=`suppVC:doseHI`) meaning almost no difference between `OJ` and `VC` when `dose` is high (compare with interaction plot)

## Hierarchical principle
* In presence of interaction effect it does not make sense to make tests for absence of main effects ! Indeed each factor has an effect that just happens to vary depending on the other factor
* Hence start by investigating whether there is an interaction effect
* If yes: no further tests !
* If no: you may test main effects if relevant for your study