The ASTA team
popularKids, where we study association between 2 factors: Goals and Urban.Rural.import pandas as pd
popKids = pd.read_csv("https://asta.math.aau.dk/datasets?file=PopularKids.dat", sep="\t")
popKids = popKids.rename(columns={'Urban/Rural': 'Urban.Rural'})
tab_totals = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'], margins=True)
tab_totals## Goals Grades Popular Sports All
## Urban.Rural
## Rural 57 50 42 149
## Suburban 87 42 22 151
## Urban 103 49 26 178
## All 247 141 90 478
Goals for each level of Urban.Rural, i.e. the sum in each row of the table is \(100\) (up to rounding):tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'], margins=False)
tab_pct = (tab.div(tab.sum(axis=1), axis=0) * 100)
tab_pct['All'] = tab_pct.sum(axis=1)
tab_pct.round()## Goals Grades Popular Sports All
## Urban.Rural
## Rural 38.0 34.0 28.0 100.0
## Suburban 58.0 28.0 15.0 100.0
## Urban 58.0 28.0 15.0 100.0
Goals given Urban.Rural.Goals given Urban.Rural:## Goals
## Urban.Rural Grades Popular Sports
## Rural 500 300 200
## Suburban 500 300 200
## Urban 500 300 200
Goals and Urban.Rural are independent.Goals and Urban.Rural for a random child.Goals:tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
n = tab.values.sum()
pctGoals = (tab.sum(axis=0) / n * 100).round(1)
pctGoals## Goals
## Grades 51.7
## Popular 29.5
## Sports 18.8
## dtype: float64
Goals given Urban.Rural.## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (51.7%) 44.0 (29.5%) 28.1 (18.8%) 149.0 (100%)
## Suburban 78.0 (51.7%) 44.5 (29.5%) 28.4 (18.8%) 151.0 (100%)
## Urban 92.0 (51.7%) 52.5 (29.5%) 33.5 (18.8%) 178.0 (100%)
## Sum 247.0 (51.7%) 141.0 (29.5%) 90.0 (18.8%) 478.0 (100%)
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (51.7%) 44.0 (29.5%) 28.1 (18.8%) 149.0 (100%)
## Suburban 78.0 (51.7%) 44.5 (29.5%) 28.4 (18.8%) 151.0 (100%)
## Urban 92.0 (51.7%) 52.5 (29.5%) 33.5 (18.8%) 178.0 (100%)
## Sum 247.0 (51.7%) 141.0 (29.5%) 90.0 (18.8%) 478.0 (100%)
Grades, which is \(\frac{247}{478}=51.7\%\).Rural and Grades: \(149\times 51.7\%=77.0\).## Goals Grades Popular Sports
## Urban.Rural
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 44.0 28.1 149.0
## Suburban 78.0 44.5 28.4 151.0
## Urban 92.0 52.5 33.5 178.0
## Sum 247.0 141.0 90.0 478.0
Goals and Urban.Rural we have \(r=c=3\), i.e. \(df=4\) and \(X^2_{obs}=18.8\), so the p-value is:## np.float64(0.0008603302817890013)
Goals and Urban.Rural.import statsmodels.api as sm
tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
chisqtab = sm.stats.Table(tab)
chisqtab.fittedvalues## Goals Grades Popular Sports
## Urban.Rural
## Rural 76.993724 43.951883 28.054393
## Suburban 78.027197 44.541841 28.430962
## Urban 91.979079 52.506276 33.514644
## df 4
## pvalue 0.0008496551610398528
## statistic 18.827626180696555
import numpy as np
data = np.array([
57, 50, 42,
87, 42, 22,
103, 49, 26]).reshape(3,3)
tab = pd.DataFrame(data,
index=["Rural", "Suburban", "Urban"],
columns=["Grades", "Popular", "Sports"])
tab## Grades Popular Sports
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
## Grades Popular Sports
## Rural 76.993724 43.951883 28.054393
## Suburban 78.027197 44.541841 28.430962
## Urban 91.979079 52.506276 33.514644
## df 4
## pvalue 0.0008496551610398528
## statistic 18.827626180696555
Rural and Grade we got \(f_e = 77.0\) and \(f_o = 57\). Here columnProportion\(=51.7\%\) and rowProportion\(=149/478=31.2\%\).import statsmodels.api as sm
tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
table = sm.stats.Table(tab)
table.standardized_resids## Goals Grades Popular Sports
## Urban.Rural
## Rural -3.950845 1.309623 3.522500
## Suburban 1.766661 -0.548407 -1.618521
## Urban 2.086578 -0.727433 -1.818622
Contingency table:
Two-way ANOVA:
HairEyeColor.HairEyeColor = pd.read_csv("https://asta.math.aau.dk/datasets?file=HairEyeColor.txt", sep="\t")
HairEyeColor.head(6)## Hair Eye Sex Freq
## 0 Black Brown Male 32
## 1 Brown Brown Male 53
## 2 Red Brown Male 10
## 3 Blond Brown Male 3
## 4 Black Blue Male 11
## 5 Brown Blue Male 50
Freq gives the frequency of each combination of the factors Hair, Eye and Sex.Hair and Eye.## Eye Hair Freq
## 0 Blue Black 20
## 1 Blue Blond 94
## 2 Blue Brown 84
## 3 Blue Red 17
## 4 Brown Black 68
## 5 Brown Blond 7
## 6 Brown Brown 119
## 7 Brown Red 26
## 8 Green Black 5
## 9 Green Blond 16
## 10 Green Brown 29
## 11 Green Red 14
## 12 Hazel Black 15
## 13 Hazel Blond 10
## 14 Hazel Brown 54
## 15 Hazel Red 14
Eye and Hair (the reference level has all dummy variables equal to 0): \[
\log(f_e) = \alpha + \beta_{e1}z_{e1} + \beta_{e2}z_{e2} + \beta_{e3}z_{e3} + \beta_{h1}z_{h1} + \beta_{h2}z_{h2} + \beta_{h3}z_{h3}.
\]Eye and Hair in the model.glm.import statsmodels.formula.api as smf
model = smf.glm('Freq ~ Hair + Eye', data=HairEye, family=sm.families.Poisson()).fit()family argument (Poisson) ensures that data are interpreted as discrete counts and not a continuous variable.| Dep. Variable: | Freq | No. Observations: | 16 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 9 |
| Model Family: | Poisson | Df Model: | 6 |
| Link Function: | Log | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -113.52 |
| Date: | Mon, 03 Nov 2025 | Deviance: | 146.44 |
| Time: | 17:48:58 | Pearson chi2: | 138. |
| No. Iterations: | 5 | Pseudo R-squ. (CS): | 1.000 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 3.6693 | 0.111 | 33.191 | 0.000 | 3.453 | 3.886 |
| Hair[T.Blond] | 0.1621 | 0.131 | 1.238 | 0.216 | -0.094 | 0.419 |
| Hair[T.Brown] | 0.9739 | 0.113 | 8.623 | 0.000 | 0.752 | 1.195 |
| Hair[T.Red] | -0.4195 | 0.153 | -2.745 | 0.006 | -0.719 | -0.120 |
| Eye[T.Brown] | 0.0230 | 0.096 | 0.240 | 0.811 | -0.165 | 0.211 |
| Eye[T.Green] | -1.2118 | 0.142 | -8.510 | 0.000 | -1.491 | -0.933 |
| Eye[T.Hazel] | -0.8380 | 0.124 | -6.752 | 0.000 | -1.081 | -0.595 |
## np.float64(0.0)
We also want to look at expected values and standardized (studentized) residuals.
The null hypothesis predicts \(e^{3.67 + 0.02} = 40.1\) with brown eyes and black hair, but we have observed 68.
This is significantly too many, since the standardized residual is 6.1.
The null hypothesis predicts 47.2 with brown eyes and blond hair, but we have seen 7. This is significantly too few, since the standardized residual is -8.3.
HairEye['fitted'] = model.fittedvalues
HairEye['resid'] = model.get_influence().resid_studentized
HairEye## Eye Hair Freq fitted resid
## 0 Blue Black 20 39.222973 -4.253816
## 1 Blue Blond 94 46.123311 9.967550
## 2 Blue Brown 84 103.868243 -3.397883
## 3 Blue Red 17 25.785473 -2.311052
## 4 Brown Black 68 40.135135 6.136520
## 5 Brown Blond 7 47.195946 -8.328248
## 6 Brown Brown 119 106.283784 2.164282
## 7 Brown Red 26 26.385135 -0.100824
## 8 Green Black 5 11.675676 -2.287896
## 9 Green Blond 16 13.729730 0.732023
## 10 Green Brown 29 30.918919 -0.508263
## 11 Green Red 14 7.675676 2.576569
## 12 Hazel Black 15 16.966216 -0.575026
## 13 Hazel Blond 10 19.951014 -2.737977
## 14 Hazel Brown 54 44.929054 2.050216
## 15 Hazel Red 14 11.153716 0.989512