---
title: "Analysis of Variance"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes  
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "gridExtra"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# One way analysis of variance
## Example 

* The data set `chickwts` is available in `R`, and on the course webpage.
* Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. 
* Their weights in grams after six weeks are given along with feed types, i.e. we have a sample with corresponding measurements of 2 variables:
    + `weight`: a numeric variable giving the chick weight.
    + `feed`: a factor giving the feed type.
* Always start with some graphics:

```{r, warning=FALSE, message=FALSE}
library(mosaic)
gf_boxplot(weight ~ feed, data = chickwts)
```

## The ANOVA Model

* We measure the response $y$ which in this case is `weight`.
* We want to study the effect of the factor $x$ on $y$. In this case $x=$`feed` and divides the sample in $g=6$ groups.
* The mean responses within the groups are denoted $\mu_1,\mu_2,\ldots,\mu_g$.
* We will assume that
    + $y=\mu_x+\epsilon$, when $y$ is a response in group $x$
    + $\epsilon$ are a sample from a population with mean zero and standard deviation $\sigma$.
    + The standard deviation for the population in each group is the same and equals $\sigma$
    + The response variable, $y$, is normal distributed within each group.
* The ANOVA test is a _test of independence_ between the quantitative response variable and the qualitative explanatory variables.

# Estimation of mean values
## Estimates

* Least squares estimates for population means $\widehat\mu_x$ is given by the average of the response measurements in group $x$.
* For a given measured response $y$ we let $\widehat y$ denote the model's prediction of $y$, i.e. $$\widehat y = \widehat\mu_x$$ if $y$ is a response for an observation in group $x$.
* We use `mean` to find the mean, for each group:

```{r}
mean(weight ~ feed, data = chickwts)
```

* We can e.g. see that $\widehat y=323.6$, when `feed=casein` but $\widehat y=160.2$, when `feed=horsebean`.
* Is it a significant difference?

## Contrast coding

* In many cases there is a group corresponding to "no treatment" and we are interested in the effect of different treatments.
* In this example we only have different `feeds`, which are sorted in lexicographical order by `R`, so `casein` is the reference.
* We can specify the model via:
    + `Intercept` corresponding to the mean response for the reference (`casein`).
    + For each of the other groups we have a **contrast**, which measures **the difference** between the mean value for that group and the reference group.
* For a given contrast we can calculate standard error, t-score and p-value, and thereby investigate whether there is a difference between this group and the reference group.
* In Agresti this is referred to as using **dummy variables**.

## Example

```{r}
model <- lm(weight ~ feed, data = chickwts)
summary(model)
```

* We get information about contrasts and their significance:
* `Intercept` corresponding to `casein` has `weight` different from zero ($p < 2\times 10^{-16}$).
* Weight difference between `casein` and `horsebean` is extremely significant (p=$2\times 10^{-9}$).
* There is no significant weight difference between `casein` and `sunflower` (p=$81$\%).

# Overall test for effect
## Graphical representation of models

* We have two alternative explanations of the data.
* Simple model with one parameter (mean): "The feed type doesn't matter. The weight is just random around a common mean value".
* Complex model with six parameters (means): "The feed type is important. For each feed type we get a different mean value and the weights are random around these values."

```{r fig.width=10, echo = FALSE, message = FALSE}
library(gridExtra, quietly = TRUE)
red_dot <- mean(chickwts$weight)
red_dots <- aggregate(weight ~ feed, data = chickwts, mean)
p1 <- gf_point("" ~ weight, data = chickwts) %>% gf_labs(y = "common mean value") + geom_point(aes(x=red_dot), color="red") 
p2 <- gf_point(feed ~ weight, data=chickwts) + geom_point(aes(x=weight, y = feed), col = "red", data = red_dots)
grid.arrange(p1, p2, ncol = 2)
```

## Hypotheses and test statistic

* Is the complex model significantly better (i.e. is there any effect of the explanatory grouping variable)? We can write the corresponding hypotheses in two different ways $$H_0: \mu_1 = \mu_2 = \ldots=\mu_g \quad \mbox{against} \quad  H_a: \mbox{ At least 2 of the population means are different}$$
* Alternatively $$H_0: \mbox{ All contrasts are equal to zero. } \quad H_a: \mbox{ At least one contrast is non-zero}.$$
* We will (indirectly) use $R^2$ to do the test. If it is large, the complex model has good predictive power compared to the simple model. To judge significance we use $$F_{obs} = \frac{(n-g)R^2}{(g-1)(1-R^2)} = \frac{(TSS-SSE)/(g-1)}{SSE/(n-g)}.$$
* Large values of $R^2$ implies large values of $F_{obs}$, which points to the alternative hypothesis.
* I.e. when we have calculated the observed value $F_{obs}$, then we have to find the probability that a new experiment would result in a larger value.

## Interpretation of $F$ statistic - Variance between/within groups

* It can be shown that the numerator of $F_{obs}$ is a measure of **the variance between the groups**, i.e. how much "boxes" vary around the total average (the red line).

* Likewise it can be shown the denominator of $F_{obs}$ is a measure for **the variance within groups**, i.e. how "tall" the boxes in the boxplot are.

```{r chickwts-boxplot-line,echo=FALSE,fig.width=7,fig.height=6}
gf_boxplot(weight ~ feed, data = chickwts) %>%   gf_abline(color="red", slope = 0, intercept = mean(chickwts$weight))
```

* If the boxes' deviations from the red line are to be explained by randomness, then the two types of variances should be of same magnitude. This is measured by the F-test statistic, which can be stated as $$F_{obs} = \frac{\mbox{variance between groups}}{\mbox{variance within groups}}$$

## Example

```{r}
model <- lm(weight ~ feed, data = chickwts)
summary(model)
```

* The last line gives us the value of $F_{obs} = 15.36$  and the corresponding $p$-value ($5.9 \times 10^{-10}$). Clearly there is a significant difference between the types of `feed`.

# Two way analysis of variance
## Additive effects

* The data set `ToothGrowth` is available in **R** and on the webpage. For more info about this data, use `?ToothGrowth`.
* The data describes the tooth length in guinea pigs where some recvieve vitamin C treatment and others are given orange juice in different dosage.

```{r, echo = FALSE}
data("ToothGrowth")
ToothGrowth$dose <- factor(ToothGrowth$dose,
                           levels = c(0.5, 1, 2),
                           labels = c("LO", "ME", "HI"))
```

* A total of $60$ observations on 3 variables.
    + `[,1] len` The tooth length
    + `[,2] supp` The type of the supplement (`OJ` or `VC`)
    + `[,3] dose` The dosage (`LO`, `ME`, `HI`)
* We will study the response `len` with the predictors `supp` and `dose`.
* At first we look at the model with additive effects
    + `len`=$\mu$ + `"effect of supp"`+ `"effect of dose"` + `error`
* This is also called the main effects model since it does not contain interaction terms.
* The parameter $\mu$ corresponds to the `Intercept` and is the mean tooth length in the reference group (supp `OJ`, dose `LO`).
* The effect of `supp` is the difference in mean when changing from `OJ` to `VC`.
* The effect of `dose` is the difference in mean when changing from `LO` to either`ME` or `HI`.

## Dummy coding

* Let us introduce dummy variables:
    + $s_C=1$ if supp `VC` and zero otherwise.
    + $d_M=1$ if dose is `ME` and zero otherwise.
    + $d_H=1$ if dose is `HI` and zero otherwise.
* Then we state the model $$\mbox{length}=\mu+\beta_1 s_C+\beta_2 d_M+\beta_3 d_H + \mbox{error} .$$
* Interpretation:
    + $\mu$ is the expected tooth length when supp is `OJ` and `dose` is `LO` ($s_C=d_M=d_H=0)$).
    + $\beta_1$ is the effect of supplement `OJ` to `VC` ($s_C=1$).
    + $\beta_2$ is the effect of increasing dosage from `LO` to `ME` ($d_M=1$).
    + $\beta_3$ is the effect of increasing dosage from `LO` to `HI` ($d_H=1$).

## Main effect model in **R**

* The main effects model is fitted by
```{r}
MainEff <- lm(len ~ supp + dose, data = ToothGrowth)
summary(MainEff)
```
* The model has 4 parameters.
* The $F$ test at the end compares with the (null) model with only one overall mean parameter. Does it seem like `supp` and `dose` has an additive effect?

## Testing effect of supp

* Alternative model without effect of supp:
```{r size="small"}
doseEff <- lm(len ~ dose, data = ToothGrowth)
```

* We can compare $R^2$ to see if `doseEff` (Model 1) is sufficent to explain the data compared to `MainEff` (Model 2). This is done by converting to $F$-statistic:
$$
F_{obs} = \frac{(R_2^2 - R_1^2)/(df_1 - df_2)}{(1 - R_2^2)/df_2} = \frac{(SSE_1 - SSE_2)/(df_1 - df_2)}{(SSE_2)/df_2}.
$$
* In **R** the calculations are done using `anova`:
```{r size="small"}
anova(doseEff, MainEff)
```

## Testing effect of dose

* Alternative model without effect of dose:
```{r size="small"}
suppEff <- lm(len ~ supp, data = ToothGrowth)
anova(suppEff, MainEff)
```

# Interaction
## Example

* We will extend the model by introducing an interaction between `supp` and `dose`.
* A graphical check for no interaction in the main effects model:

```{r, echo = FALSE, message = FALSE}
new_data <- expand.grid(supp = c("OJ", "VC"),
                        dose = c("LO", "ME", "HI"))
new_data[] <- lapply(new_data, function(x) factor(x))
new_data$len_mean <- predict(MainEff, new_data)
xyplot(len_mean ~ dose, data = new_data, groups = supp,
       ylab = "mean tooth length",
       type = "l",
       col = 2:3,
       lty = c(2, 1),
              key = list(
                title = "supp",
                x = .85, y = 1,
                text = list(c("OJ", "VC")),
                lines = list(lty = c(2, 1), type = "l", col = 2:3)))
```

* Interaction plot:

```{r}
with(ToothGrowth, interaction.plot(dose, supp, len, col = 2:3))
```

* For each of the supplement types we plot the average number of tooth length as a function of dosage.
* If the main effects model is correct then the difference between supplements is the same for all levels of dosage, i.e. the curves should be parallel - except for noise.
* This does not seem to be the case.

## Dummy coding

* The extended model can be formulated as
$$
\mathtt{length} = \mu+\beta_1 s_C+\beta_2 d_M+\beta_3 d_H+ \beta_4 s_C d_M+\beta_5 s_C d_H+\mathtt{error}
$$
* Interpretation:
    + $\mu$ is the expected tooth length for `supp` `OJ` and `dose` `LO` ($s_C=d_M=d_H=0$).
    + $\beta_1$ is the effect of changing from `supp` `OJ` to `VC`, `dose` is `LO` ($s_C=1,d_M=d_H=0$).
    + $\beta_2$ is the effect of increasing `dose` from `LO` to `ME`, when `supp` is `OJ` ($s_C=0,d_M=1$).
    + $\beta_3$ is the effect of increasing `dose` from `LO` to `HI`, when `supp` is `OJ` ($s_C=0,d_H=1$).
    + $\beta_4$ is an additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `ME` ($s_C=1,d_M=1$)
    + $\beta_5$ is an additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `HI` ($s_C=1,d_H=1$)

## Example

* We fit the interaction model by changing plus to multiply in the model expression from before:
```{r}
Interaction <- lm(len ~ supp*dose, data = ToothGrowth)
```

* Now we can think of an experiment with 6 groups corresponding to each combination of the predictors.
* Looking at the group averages it looks like, the supplement types behave quite differently depending on `dose`:

```{r}
mean(len ~ supp + dose, data = ToothGrowth)
```
* But is that significant?

```{r}
anova(MainEff, Interaction)
```

* With a p-value of `r 100*anova(MainEff, Interaction)[6][2,1]`% there is a significant interaction `supp:dose`, i.e. the lack of parallel curves in the interaction plot is significant.

```{r}
summary(Interaction)
```

* The additional effect of both changing from `supp` `OJ` to `VC` and increasing `dose` from `LO` to `HI` ($\beta_5$=`suppVC:doseHI`) is highly significant.