---
title: "Log-linear models"
author: The ASTA team
output:
  slidy_presentation:
    highlight: tango
    theme: cerulean
    fig_caption: false
  pdf_document:
    toc: yes
    highlight: tango
    number_section: true
    fig_caption: false
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
# missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages()))
# if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Log-linear models
## Model specification

* We shall consider the data set `living`, which is a Danish survey on standard of living. The variables are
    + `B` (Housing, Bolig): bad/acceptable/good
    + `H` (Health, Helbred): bad/good
    + `I` (Isolated, Isoleret): yes/no
    + `A` (Anxiety, Angst): yes/no
   + `N`: The number of respondents for each combination of the 4 factors above
* We want to order the factors such that the reference is "negative".
```{r}
living <- read.delim("https://asta.math.aau.dk/datasets?file=living.txt", stringsAsFactors = TRUE)
```
```{r}
living$B <- relevel(living$B,"Bad")
living$I <- relevel(living$I,"Yes")
living$A <- relevel(living$A,"Yes")
```

## Example

* At first we study interaction between housing and health. So we aggregate data and only look at the association between `B` and `H` without controlling for `I` and `A`:
```{r}
BH <- aggregate(N ~ B + H, FUN = sum, data = living)
BH
```

## Model specification

* Like last time we can write down a model for (the logarithm of) the expected frequencies $f_e$ by using dummy variables.
* We let $z_{b1},z_{b2}$ and $z_{h1}$ denote the different levels of `B` and `H` (the reference level has all dummy variables equal to 0):
$$
 \log(f_e) = \alpha + \beta_{b1}z_{b1} + \beta_{b2}z_{b2} +
    \beta_{h1}z_{h1} + \beta_{b1h1}z_{b1}z_{h1} + \beta_{b2h1}z_{b2}z_{h1}.
    $$
* Note that this time we have included an interaction term, which in this case implies, that we do not assume independence between `B` and `H` in the model. 
* This model contains all possible terms and there are as many parameters(6) as there are cells(6) in the table. This is called the **saturated** model.

## Example

* We fit the model using `glm`:
```{r}
model <- glm(N ~ B * H, family = poisson, data = BH)
```

* The parameter estimates (of the contrasts, i.e. differences to the reference level (`B: Bad, H: Bad`) are

```{r}
summary(model)
```
* The combination `Good:Good` increases the response, i.e. an over representation of people with good housing conditions and good health.

## Model form

* Log-linear models quickly become cumbersome to write down.
* For log-linear models the structure of the model is often more interesting than the value of the parameters.
* Therefore we typically just use the model form
$$
\text{B} + \text{H} + \text{B}:\text{H}
$$
where $\text{B} + \text{H}$ are the main effects and $\text{B}:\text{H}$ means that we have interaction between B and H.
* We shall stick to the **hierarchical principle**, which means that if $\text{B}:\text{H}$ is included then we must include $\text{B} + \text{H}$. For that reason we use the shorthand notation
$$
\text{B}*\text{H} = \text{B} + \text{H} + \text{B}:\text{H}
$$


## Model comparison

* If we suggest a simpler model than the saturated model it will always provide a poorer fit to the given data.
* We need to find a model which is as simple (few parameters) as possible but which at the same time fits the data well.
* Last time we used the $\chi^2$ statistic to judge whether the model $\text{B} + \text{H}$ (independence between $\text{B}$ and $\text{H}$) was good enough compared to the saturated model $\text{B}*\text{H}$. This is only possible for models with two variables.
* More generally we shall consider the  **Deviance of a model**, which is  - approximately - given by
$$\sum\frac{(\mathtt{observed - expected})^2}{\mathtt{expected}}$$


## Deviance

* We look at the output from the function `drop1`:
```{r}
drop1(model, test = "Chisq")
```

* The deviance for the saturated model ($\langle$none$\rangle$) is zero as observed=expected.
* Deviance is increased by 40.8, when we remove the interaction B:H.
* Removing B:H corresponds to the null hypothesis $H_0: \beta_{b1h1}=\beta_{b2h1}=0$.
* The deviance is compared with a $\chi^2(Df=2)$ distribution to determine whether it is significant.
* Since the p-value is $1.4\times10^{-9}$ the interaction is significant and cannot be left out.

# Higher order interaction
## Higher order interaction

* We shall consider models with interaction between more that two variables. 
* E.g. there may be a combined effect of adding water, fertilizer and light to a plant at the same time, i.e. the effect of adding water and fertilizer depends on whether the light is on.
* We call this a **3-way interaction**.
* We shall still respect the hierarchical principle:
    + If we include a 3-way interaction, then we must include all main effects and 2-way interactions of the 3 variables.
* Again we use short hand notation
$$
\text{B}*\text{H}*\text{A} = \text{B} + \text{H} + \text{A} +
    \text{B}:\text{H}+\text{B}:\text{A} +\text{H}:\text{A}+\text{B}:\text{H}:\text{A} 
$$
* Similar considerations hold for 4-way interactions like $\text{B}*\text{H}*\text{A}*\text{I}$, etc.

## Bigger example

* We fit the saturated model to the full dataset and look at the estimated parameters (only the last half of the values are printed to save space):
```{r size="small"}
satmodel <- glm(N ~ B * H * A * I, family = poisson, data = living)
summary(satmodel)
```

* We see that the four-way interaction and all three-way interactions look like they could be left out. 
* However, the p-values are related to removing one and leaving everything else in the model so we have to remove terms of the model one at a time and check against the new model.
* This can be done by successive use of `drop1` and `update` as explained in the following.

----

```{r}
drop1(satmodel, test = "Chi")
```

* The output of `drop1` reveals that the four-way interaction is insignificant and we remove it and save the updated model like this:

```{r size="scriptsize"}
reducedmodel <- update(satmodel, .~.-B:H:A:I)
```
* We use `drop1` again:

```{r}
drop1(reducedmodel, test = "Chi")
```
* We see `B:H:I` could be removed and use `update` again:

```{r}
reducedmodel <- update(reducedmodel, .~.-B:H:I)
```
* We can continue this way as long as we like. 
* If instead we have a specific simple model in mind, we can define this model and test it against the saturated model with `anova` as shown in the following.

## Test of simple model against the saturated model

* We fit the simpler model to the full dataset:
```{r tidy=FALSE,results='hide'}
simplemodel <- glm(N ~ B*H + B*A + B*I + H*A + H*I + I*A, family = poisson, data = living)
```
* We compare the two models using `anova`:

```{r eval=FALSE}
anova(simplemodel, satmodel, test = "Chisq")
```
```{r echo=FALSE}
simplemodel <- glm(N ~ B*H + B*A + B*I + H*A + H*I + I*A, family = poisson, data = living)
anova(simplemodel, satmodel, test = "Chisq")
```
* The deviance between the two models is 10.697 with $df=9$, which has a p-value of 29.7%. So we prefer the simpler model without four- and 3-way interactions.

## Model reduction

* Let us check whether we can make further model reductions, where we again use the function `drop1`.

```{r eval=FALSE}
drop1(simplemodel, test = "Chisq")
```
```{r echo=FALSE}
drop1(simplemodel, test = "Chisq")
```

* Conclusion: All of the pairwise interaction terms are significant.

## Final model and parameter estimates

* In conclusion our final model is `simplemodel`.

```{r}
summary(simplemodel)
```

* We see that all significant interactions are positive - as expected(why?).

* `(Intercept)` is log of the expected number, when all factors are "bad". So the expected number is $\exp(2.17) = 8.76$, whereas the observed number is $8$.

## Predicted values

* What is the expected number of people without anxiety (A), with acceptable housing (B), good health (H) that are isolated (I)?

```{r}
summary(simplemodel)
```
* Logarithm of the expected: $$2.17042+0.65323-1.76785+0.35067-0.04121-0.12759+0.44076=1.67843$$
* so the expected number is $\exp(1.67843) = 5.36$, whereas the observed number is $5$.

----

* We can of course make **R** calculate all the table values expected by the model and add them to the data:
```{r}
living$expected <- fitted(simplemodel)
living[1:12,]
```

# Graphical representation
## Graphical representation

* We make a graphical representation by
    + drawing a circle for each variable.
    + connecting variables which enter the same model term.
* Example: Assume the model is
$$
\text{A}*\text{B} + \text{B}*\text{H}*\text{I}
$$
* Then the graphical representation is

```{r, echo=FALSE}
mydf <- data.frame(ID = c("A", "B", "B", "H"), MatchedID = c("B", "H", "I", "I"))
plot(igraph::graph_from_data_frame(mydf, directed=FALSE), vertex.size = 30, label.cex = 1)
```


## Interpretation of graphical model

* __Independence__: If A enters in the model formula, but A doesn't enter in any other terms (e.g.\ $\text{A}*\text{B}$, $\text{A}*\text{H}$, etc.), then A is independent of the other variables.
* E.g. $\text{A} + \text{B}*\text{H} + \text{B}*\text{I}$

```{r, echo=FALSE}
mydf <- data.frame(ID = c("B", "B"), MatchedID = c("I", "H"))
mygraph <- igraph::graph_from_data_frame(mydf, directed=FALSE)
mygraph <- igraph::add_vertices(mygraph, 1, name="A")
plot(mygraph, vertex.size = 30, label.cex = 1)
```

* __Explained association__. If B and H are ”connected” via other terms, but don't enter the same term, then the association is explained by other variables. I.e. the model cannot include e.g. $\text{B}*\text{H}$, $\text{B}*\text{H}*\text{A}$ or $\text{A}*\text{B}*\text{H}*\text{I}$.
* E.g. $\text{B}*\text{I} + \text{A}*\text{H}*\text{I}$

```{r, echo=FALSE}
mydf <- data.frame(ID = c("A", "A", "B", "I"), MatchedID = c("H", "I", "I", "H"))
plot(igraph::graph_from_data_frame(mydf, directed=FALSE), vertex.size = 30, label.cex = 1)
```


## Interpretation of graphical model

* __Homogeneous association__: If $\text{A}*\text{H}$ enters the model, but $\text{A}*\text{H}$ doesn't enter more complicated terms, then the association between A and H is homogeneous. 
* I.e. the model cannot contain $\text{A}*\text{H}*\text{I}$, $\text{A}*\text{B}*\text{H}$ or $\text{A}*\text{B}*\text{H}*\text{I}$.

* E.g. $\text{A} * \text{H} + \text{A}*\text{I}*\text{B} + \text{B}*\text{H}$

```{r, echo=FALSE}
mydf <- data.frame(ID = c("A", "A", "A", "B", "B"), MatchedID = c("B", "H", "I", "I", "H"))
plot(igraph::graph_from_data_frame(mydf, directed=FALSE), vertex.size = 30, label.cex = 1)
```

* __Heterogeneous association__: If $\text{A}*\text{H}$ enter the model as a part of a more complicated term, then the association between A and H is heterogeneous. 
* I.e. the model *must* contain $\text{A}*\text{H}*\text{I}$, $\text{A}*\text{B}*\text{H}$ or $\text{A}*\text{B}*\text{H}*\text{I}$.
* E.g. $\text{A}*\text{B}*\text{H} + \text{A}*\text{I}*\text{B}$

```{r, echo=FALSE}
mydf <- data.frame(ID = c("A", "A", "A", "B", "B"), MatchedID = c("B", "H", "I", "I", "H"))
plot(igraph::graph_from_data_frame(mydf, directed=FALSE), vertex.size = 30, label.cex = 1)
```


## Final model of the example

* In the example the final model was:

$$
\text{B} * \text{I} + \text{H} * \text{I} + \text{I} * \text{A} + \text{B}*\text{H}+ \text{B}*\text{A} + \text{H}*\text{A}
$$

```{r, echo=FALSE}
mydf <- data.frame(ID = c("A", "A", "A", "B", "B", "H"), MatchedID = c("B", "H", "I", "I", "H", "I"))
plot(igraph::graph_from_data_frame(mydf, directed=FALSE), vertex.size = 30, label.cex = 1)
```

* We can directly see from the graph, that:
    + we don't have any independent variables since all variables are connected
    + we do not have an explained association.
* From the formula we have homogeneous association between all pairs of variables.