The ASTA team
popularKids
, where we study association between 2 factors: Goals
and Urban.Rural
.popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt")
library(mosaic)
tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE)
tab
## Goals
## Urban.Rural Grades Popular Sports Total
## Rural 57 50 42 149
## Suburban 87 42 22 151
## Urban 103 49 26 178
## Total 247 141 90 478
Goals
for each level of Urban.Rural
, i.e. the sum in each row of the table is \(100\) (up to rounding):tab <- tally(~Urban.Rural + Goals, data = popKids)
addmargins(round(100 * prop.table(tab, 1)),margin = 1:2)
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 38 34 28 100
## Suburban 58 28 15 101
## Urban 58 28 15 101
## Sum 154 90 58 302
Goals
given Urban.Rural
.Goals
given Urban.Rural
:## Goals
## Urban.Rural Grades Popular Sports
## Rural 500 300 200
## Suburban 500 300 200
## Urban 500 300 200
Goals
and Urban.Rural
are independent.Goals
and Urban.Rural
for a random child.Goals
is the relative frequencies in the sample:n <- margin.table(tab)
pctGoals <- round(100 * margin.table(tab, 2)/n, 1)
pctGoals
## Goals
## Grades Popular Sports
## 51.7 29.5 18.8
Goals
given Urban.Rural
.## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (51.7%) 44.0 (29.5%) 28.1 (18.8%) 149.0 (100%)
## Suburban 78.0 (51.7%) 44.5 (29.5%) 28.4 (18.8%) 151.0 (100%)
## Urban 92.0 (51.7%) 52.5 (29.5%) 33.5 (18.8%) 178.0 (100%)
## Sum 247.0 (51.7%) 141.0 (29.5%) 90.0 (18.8%) 478.0 (100%)
pctexptab
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (51.7%) 44.0 (29.5%) 28.1 (18.8%) 149.0 (100%)
## Suburban 78.0 (51.7%) 44.5 (29.5%) 28.4 (18.8%) 151.0 (100%)
## Urban 92.0 (51.7%) 52.5 (29.5%) 33.5 (18.8%) 178.0 (100%)
## Sum 247.0 (51.7%) 141.0 (29.5%) 90.0 (18.8%) 478.0 (100%)
Grades
, which is \(\frac{247}{478}=51.7\%\).Rural
and Grades
: \(149\times 51.7\%=77.0\).tab
## Goals
## Urban.Rural Grades Popular Sports
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 44.0 28.1 149.0
## Suburban 78.0 44.5 28.4 151.0
## Urban 92.0 52.5 33.5 178.0
## Sum 247.0 141.0 90.0 478.0
Goals
and Urban.Rural
we have \(r=c=3\), i.e. \(df=4\) and \(X^2_{obs}=18.8\), so the p-value is:1 - pdist("chisq", 18.8, df = 4)
## [1] 0.0008603303
Goals
and Urban.Rural
.chisq.test
.chisq.test
.tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 18.828, df = 4, p-value = 0.0008497
testStat$expected
## Goals
## Urban.Rural Grades Popular Sports
## Rural 76.99372 43.95188 28.05439
## Suburban 78.02720 44.54184 28.43096
## Urban 91.97908 52.50628 33.51464
data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26)
tab <- matrix(data, nrow = 3, ncol = 3)
row.names(tab) <- c("Rural", "Suburban", "Urban")
colnames(tab) <- c("Grades", "Popular", "Sports")
tab
## Grades Popular Sports
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 18.828, df = 4, p-value = 0.0008497
Rural and Grade
we got \(f_e = 77.0\) and \(f_o = 57\). Here columnProportion\(=51.7\%\) and rowProportion\(=149/478=31.2\%\).R
R
we can extract the standardized residuals from the output of chisq.test
:tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat$stdres
## Goals
## Urban.Rural Grades Popular Sports
## Rural -3.9508449 1.3096235 3.5225004
## Suburban 1.7666608 -0.5484075 -1.6185210
## Urban 2.0865780 -0.7274327 -1.8186224
R
HairEyeColor
.HairEyeColor <- read.delim("https://asta.math.aau.dk/datasets?file=HairEyeColor.txt")
head(HairEyeColor)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
Freq
gives the frequency of each combination of the factors Hair
, Eye
and Sex
.Hair
and Eye
.HairEye <- aggregate(Freq ~ Eye + Hair, FUN = sum, data = HairEyeColor)
HairEye
## Eye Hair Freq
## 1 Blue Black 20
## 2 Brown Black 68
## 3 Green Black 5
## 4 Hazel Black 15
## 5 Blue Blond 94
## 6 Brown Blond 7
## 7 Green Blond 16
## 8 Hazel Blond 10
## 9 Blue Brown 84
## 10 Brown Brown 119
## 11 Green Brown 29
## 12 Hazel Brown 54
## 13 Blue Red 17
## 14 Brown Red 26
## 15 Green Red 14
## 16 Hazel Red 14
To denote the different levels of Eye
and Hair
(the reference level has all dummy variables equal to 0): \[
\log(f_e) = \alpha + \beta_{e1}z_{e1} + \beta_{e2}z_{e2} + \beta_{e3}z_{e3} + \beta_{h1}z_{h1} + \beta_{h2}z_{h2} + \beta_{h3}z_{h3}.
\]
Eye
and Hair
in the model.Since our response variable now is a count it is no longer a linear model (lm
) as we have been used to (linear regression). * Instead it is a so-called generalized linear model and the relevant R
command is glm
.
model <- glm(Freq ~ Hair + Eye, family = poisson, data = HairEye)
family = poisson
ensures that R
knows that data should be interpreted as discrete counts and not a continuous variable.summary(model)
##
## Call:
## glm(formula = Freq ~ Hair + Eye, family = poisson, data = HairEye)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.326 -2.065 -0.212 1.235 6.172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.66926 0.11055 33.191 < 2e-16 ***
## HairBlond 0.16206 0.13089 1.238 0.21569
## HairBrown 0.97386 0.11294 8.623 < 2e-16 ***
## HairRed -0.41945 0.15279 -2.745 0.00604 **
## EyeBrown 0.02299 0.09590 0.240 0.81054
## EyeGreen -1.21175 0.14239 -8.510 < 2e-16 ***
## EyeHazel -0.83804 0.12411 -6.752 1.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 453.31 on 15 degrees of freedom
## Residual deviance: 146.44 on 9 degrees of freedom
## AIC: 241.04
##
## Number of Fisher Scoring iterations: 5
1 - pdist("chisq", 146.44, df = 9)
## [1] 0
HairEye$fitted <- fitted(model)
HairEye$resid <- rstudent(model)
HairEye
## Eye Hair Freq fitted resid
## 1 Blue Black 20 39.22 -4.492
## 2 Brown Black 68 40.14 5.856
## 3 Green Black 5 11.68 -2.508
## 4 Hazel Black 15 16.97 -0.583
## 5 Blue Blond 94 46.12 9.368
## 6 Brown Blond 7 47.20 -9.423
## 7 Green Blond 16 13.73 0.719
## 8 Hazel Blond 10 19.95 -2.936
## 9 Blue Brown 84 103.87 -3.437
## 10 Brown Brown 119 106.28 2.151
## 11 Green Brown 29 30.92 -0.511
## 12 Hazel Brown 54 44.93 2.023
## 13 Blue Red 17 25.79 -2.399
## 14 Brown Red 26 26.39 -0.101
## 15 Green Red 14 7.68 2.368
## 16 Hazel Red 14 11.15 0.961