---
title: "Contingency tables"
author: The ASTA team
output:
  slidy_presentation:
    highlight: tango
    theme: cerulean
    fig_caption: false
  pdf_document:
    toc: yes
    highlight: tango
    number_section: true
    fig_caption: false
---

```{r, include = FALSE}
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "grid", "jpeg"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Contingency tables 
## A contingency table

* We return to the dataset `popularKids`, where  we study **association** between 2 **factors**: `Goals` and `Urban.Rural`.
* Based on a sample we make a cross tabulation of the factors and we get a so-called **contingency table** (krydstabel).

```{r message=FALSE}
popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt")
library(mosaic)
tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE)
tab
``` 

### A conditional distribution

* Another representation of data is the percent-wise distribution of `Goals` for each level of `Urban.Rural`, i.e.\ the sum in each row of the table is $100$ (up to rounding):
```{r}
tab <- tally(~Urban.Rural + Goals, data = popKids)
addmargins(round(100 * prop.table(tab, 1)),margin = 1:2)
``` 
* Here we will talk about the **conditional distribution** of `Goals` given `Urban.Rural`.
* An important question could be:
    + Are the goals of the kids different when they come from urban, suburban or rural areas? I.e.\ are the rows in the table significantly different?
* There is (almost) no difference between urban and suburban, but it looks like rural is different.

# Independence
## Independence

* Recall, that two factors are **independent**, when there is no difference between the population's distributions of one factor given the levels of the other factor.
* Otherwise the factors are said to be **dependent**.
* If we e.g. have the following conditional **population distributions** of `Goals` given `Urban.Rural`:

```{r echo = FALSE}
dat <- data.frame(Goals = rep(c(rep("Grades", 5), rep("Popular", 3), rep("Sports", 2)), 3))
dat$Urban.Rural <- rep(c("Rural", "Suburban", "Urban"), each = 10)
100 * tally(~Urban.Rural + Goals, data = dat)
``` 

* Then the factors `Goals` and `Urban.Rural` are independent.
* We take a sample and "measure" the factors $F_1$ and $F_2$. E.g.  `Goals` and `Urban.Rural` for a random child. 
* The hypothesis of interest today is: $$H_0: F_1 \mbox{ and } F_2 \mbox{ are independent, } \quad  H_a: F_1 \mbox{ and } F_2 \mbox{ are dependent.}$$

## The Chi-squared test for independence

* Our best guess of the distribution of `Goals` is the relative frequencies in the sample:
```{r}
n <- margin.table(tab)
pctGoals <- round(100 * margin.table(tab, 2)/n, 1)
pctGoals
``` 

* If we assume independence, then this is also a guess of the conditional distributions of `Goals` given `Urban.Rural`. 
* The corresponding expected counts in the sample are then:
```{r echo=FALSE}
test <- chisq.test(tab, correct = FALSE)
exptab <- as.table(format(round(addmargins(test$expected), 1)))
pctvec <- paste("(",c(rep(paste(pctGoals),each = 4), rep("100", 4)), "%)", sep = "")
pctexptab <- as.table(matrix(paste(exptab, pctvec), 4, 4, dimnames = dimnames(exptab)))
pctexptab
``` 

## Calculation of expected table

```{r}
pctexptab
``` 

* We note that
    + The relative frequency for a given column is columnTotal divided by tableTotal. For example `Grades`, which is $\frac{247}{478}=51.7\%$.
    + The expected value in a given cell in the table is then the cell's relative column frequency multiplied by the cell's rowTotal. For example `Rural` and `Grades`: $149\times 51.7\%=77.0$.

* This can be summarized to:
    + The expected value in a cell is the product of the cell's rowTotal and columnTotal divided by tableTotal.

## Chi-squared ($\chi^2$) test statistic

* We have an **observed table**:
```{r}
tab
``` 

* And an **expected table**, if $H_0$ is true:

```{r echo=FALSE,size="footnotesize"}
exptab
``` 

* If these tables are "far from each other", then we reject $H_0$.  We want to measure the distance via the Chi-squared test statistic:
    + $X^2=\sum\frac{(f_o-f_e)^2}{f_e}$: Sum over all cells in the table
    + $f_o$ is the frequency in a cell in the observed table
    + $f_e$ is the corresponding frequency in the expected table.
* We have: $$X^2_{obs} = \frac{(57-77)^2}{77} + \ldots + \frac{(26-33.5)^2}{33.5}=18.8$$
* Is this a large distance??

## $\chi^2$-test template.

* We want to test the hypothesis $H_0$ of independence in a table with $r$ rows and $c$ columns:
    + We take a sample and calculate $X^2_{obs}$ - the observed value of the test statistic.
    + p-value: Assume $H_0$ is true. What is then the chance of obtaining a larger $X^2$ than $X^2_{obs}$, if we repeat the experiment?

* This can be approximated by the $\mathbf{\chi^2}$-**distribution** with $df=(r-1)(c-1)$ degrees of freedom.
* For `Goals` and `Urban.Rural` we have $r=c=3$, i.e.\ $df=4$ and $X^2_{obs}=18.8$, so the p-value is:
```{r}
1 - pdist("chisq", 18.8, df = 4)
``` 

* There is clearly a significant association between `Goals` and `Urban.Rural`.

## The function `chisq.test`.

* All of the above calculations can be obtained by the function `chisq.test`.
```{r}
tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat
testStat$expected
```

----

* The frequency data can also be put directly into a matrix.
```{r}
data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26)
tab <- matrix(data, nrow = 3, ncol = 3)
row.names(tab) <- c("Rural", "Suburban", "Urban")
colnames(tab) <- c("Grades", "Popular", "Sports")
tab
chisq.test(tab)
```

# The $\chi^2$-distribution
## The $\chi^2$-distribution

* The $\chi^2$-distribution with $df$ degrees of freedom:
    + Is never negative. And $X^2=0$ only happens if $f_e = f_o$.
    + Has mean $\mu=df$
    + Has standard deviation $\sigma=\sqrt{2df}$
    + Is skewed to the right, but approaches a normal distribution when $df$ grows.

```{r dchisq,echo=FALSE}
x <- seq(0, 40, length = 1000)
dfs <- c(1, 5, 10, 20)
plot(x, dchisq(x, df = dfs[2]), col = 2, type = "l", lwd = 2, xlab = "", ylab = "")
lines(x, dchisq(x, df = dfs[1]), col = 1, lwd = 2)
lines(x, dchisq(x, df = dfs[3]), col = 3, lwd = 2)
lines(x, dchisq(x, df = dfs[4]), col = 4, lwd = 2)
legend("topright", legend = paste("df =", dfs), col = 1:4, lwd = 2)
``` 

# Agresti - Summary
## Summary

* For the the Chi-squared statistic, $X^2$, to be appropriate we require that the expected values have to be $f_e\geq 5$. 
* Now we can summarize the ingredients in the Chi-squared test for independence. 

```{r, fig.width=10, echo=FALSE, fig.align='center'}
# was ![](https://asta.math.aau.dk/static-files/asta/img/AGR.jpg)
url <- "https://asta.math.aau.dk/static-files/asta/img/AGR.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```


# Standardized residuals
## Residual analysis

* If we reject the hypothesis of independence it can be of interest to identify the significant deviations.
* In a given cell in the table, $f_o-f_e$ is the deviation between data and the expected values under the null hypothesis.
* We assume that $f_e\geq 5$.
* If $H_0$ is true, then the standard error of $f_o - f_e$ is given by $$se=\sqrt{f_e (1-\mbox{rowProportion}) (1-\mbox{columnProportion})}$$
* The corresponding $z$-score $$z=\frac{f_o-f_e}{se}$$ should in 95\% of the cells be between $\pm 2$. Values above 3 or below -3 should not appear. 
* In popKids table cell `Rural and Grade` we got $f_e = 77.0$ and $f_o = 57$. Here columnProportion$=51.7\%$ and rowProportion$=149/478=31.2\%$.
* We can then calculate $$z = \frac{57-77}{\sqrt{77(1-0.517)(1-0.312)}} = -3.95$$. 
* Compared to the null hypothesis there are way too few rural kids who find grades important.
* In summary: The standardized residuals allow for cell-by-cell ($f_e$ vs $f_o$) comparision.

## Residual analysis in `R`

* In `R` we can extract the standardized residuals from the output of `chisq.test`:
```{r}
tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat$stdres
```

# Models for table data in `R`
## Example

* We will study the dataset `HairEyeColor`.
```{r}
HairEyeColor <- read.delim("https://asta.math.aau.dk/datasets?file=HairEyeColor.txt")
head(HairEyeColor)
``` 

* Data is organized such that the variable `Freq` gives the frequency of each combination of the factors `Hair`, `Eye` and `Sex`.
* For example: 32 observations are men with black hair and brown eyes. 
* We are interested in the association between eye color and hair color ignoring the sex
* We aggregate data, so we have a table with frequencies for each combination of `Hair` and `Eye`.
```{r}
HairEye <- aggregate(Freq ~ Eye + Hair, FUN = sum, data = HairEyeColor)
HairEye
``` 

## Model specification

* We can write down a model for (the logarithm of) the expected frequencies by using dummy variables $z_{e1},z_{e2},z_{e3}$ and $z_{h1},z_{h2},z_{h3}$
* To denote the different levels of `Eye` and `Hair` (the reference level has all dummy variables equal to 0):
$$
\log(f_e) = \alpha + \beta_{e1}z_{e1} + \beta_{e2}z_{e2} + \beta_{e3}z_{e3} + \beta_{h1}z_{h1} + \beta_{h2}z_{h2} + \beta_{h3}z_{h3}.
$$
* Note that we haven't included an interaction term, which is this case implies, that we assume independence between `Eye` and `Hair` in the model.
* Since our response variable now is a count it is no longer a linear model (`lm`) as we have been used to (linear regression). 
* Instead it is a so-called generalized linear model and the relevant `R` command is `glm`.

## Model specification in **R**

```{r}
model <- glm(Freq ~ Hair + Eye, family = poisson, data = HairEye)
``` 

* The argument `family = poisson` ensures that `R` knows that data should be interpreted as discrete counts and not a continuous variable.

```{r}
summary(model)
``` 

* A value of $X^2=146.44$ with $df=9$ shows that there is very clear significance and we reject the null hypothesis of independence between hair and eye color.
```{r}
1 - pdist("chisq", 146.44, df = 9)
``` 

## Expected values and standardized residuals

* We also want to look at expected values and standardized (studentized) residuals.
* The null hypothesis predicts $e^{3.67 + 0.02} = 40.1$ with brown eyes and black hair, but we have observed 68. 
* This is significantly too many, since the standardized residual is 5.86.
* The null hypothesis predicts 47.2 with brown eyes and blond hair, but we have seen 7. This is significantly too few, since the standardized residual is -9.42.
```{r results='hide',size="footnotesize"}
HairEye$fitted <- fitted(model)
HairEye$resid <- rstudent(model)
HairEye
``` 
```{r echo=FALSE,size="footnotesize"}
options(digits = 3)
HairEye
``` 

