Contingency tables

The ASTA team

Contingency tables

A contingency table

import pandas as pd

popKids = pd.read_csv("https://asta.math.aau.dk/datasets?file=PopularKids.dat", sep="\t")
popKids = popKids.rename(columns={'Urban/Rural': 'Urban.Rural'})

tab_totals = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'], margins=True)
tab_totals
## Goals        Grades  Popular  Sports  All
## Urban.Rural                              
## Rural            57       50      42  149
## Suburban         87       42      22  151
## Urban           103       49      26  178
## All             247      141      90  478

A conditional distribution

tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'], margins=False)
tab_pct = (tab.div(tab.sum(axis=1), axis=0) * 100)
tab_pct['All'] = tab_pct.sum(axis=1)
tab_pct.round()
## Goals        Grades  Popular  Sports    All
## Urban.Rural                                
## Rural          38.0     34.0    28.0  100.0
## Suburban       58.0     28.0    15.0  100.0
## Urban          58.0     28.0    15.0  100.0

Independence

Independence

##            Goals
## Urban.Rural Grades Popular Sports
##    Rural       500     300    200
##    Suburban    500     300    200
##    Urban       500     300    200

The Chi-squared test for independence

tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
n = tab.values.sum()
pctGoals = (tab.sum(axis=0) / n * 100).round(1)
pctGoals
## Goals
## Grades     51.7
## Popular    29.5
## Sports     18.8
## dtype: float64
##            Goals
## Urban.Rural Grades        Popular       Sports        Sum         
##    Rural     77.0 (51.7%)  44.0 (29.5%)  28.1 (18.8%) 149.0 (100%)
##    Suburban  78.0 (51.7%)  44.5 (29.5%)  28.4 (18.8%) 151.0 (100%)
##    Urban     92.0 (51.7%)  52.5 (29.5%)  33.5 (18.8%) 178.0 (100%)
##    Sum      247.0 (51.7%) 141.0 (29.5%)  90.0 (18.8%) 478.0 (100%)

Calculation of expected table

##            Goals
## Urban.Rural Grades        Popular       Sports        Sum         
##    Rural     77.0 (51.7%)  44.0 (29.5%)  28.1 (18.8%) 149.0 (100%)
##    Suburban  78.0 (51.7%)  44.5 (29.5%)  28.4 (18.8%) 151.0 (100%)
##    Urban     92.0 (51.7%)  52.5 (29.5%)  33.5 (18.8%) 178.0 (100%)
##    Sum      247.0 (51.7%) 141.0 (29.5%)  90.0 (18.8%) 478.0 (100%)

Chi-squared (\(\chi^2\)) test statistic

tab
## Goals        Grades  Popular  Sports
## Urban.Rural                         
## Rural            57       50      42
## Suburban         87       42      22
## Urban           103       49      26
##            Goals
## Urban.Rural Grades Popular Sports Sum  
##    Rural     77.0   44.0    28.1  149.0
##    Suburban  78.0   44.5    28.4  151.0
##    Urban     92.0   52.5    33.5  178.0
##    Sum      247.0  141.0    90.0  478.0

\(\chi^2\)-test template.

from scipy.stats import chi2

1 - chi2.cdf(18.8, 4)
## np.float64(0.0008603302817890013)

Perform the test using software

import statsmodels.api as sm

tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
chisqtab = sm.stats.Table(tab)
chisqtab.fittedvalues
## Goals           Grades    Popular     Sports
## Urban.Rural                                 
## Rural        76.993724  43.951883  28.054393
## Suburban     78.027197  44.541841  28.430962
## Urban        91.979079  52.506276  33.514644
print(chisqtab.test_nominal_association())
## df          4
## pvalue      0.0008496551610398528
## statistic   18.827626180696555
import numpy as np

data = np.array([
  57, 50, 42,
  87, 42, 22,
  103, 49, 26]).reshape(3,3)
tab = pd.DataFrame(data,
                   index=["Rural", "Suburban", "Urban"],
                   columns=["Grades", "Popular", "Sports"])
tab
##           Grades  Popular  Sports
## Rural         57       50      42
## Suburban      87       42      22
## Urban        103       49      26
chisqtab = sm.stats.Table(tab)
chisqtab.fittedvalues
##              Grades    Popular     Sports
## Rural     76.993724  43.951883  28.054393
## Suburban  78.027197  44.541841  28.430962
## Urban     91.979079  52.506276  33.514644
print(chisqtab.test_nominal_association())
## df          4
## pvalue      0.0008496551610398528
## statistic   18.827626180696555

The \(\chi^2\)-distribution

The \(\chi^2\)-distribution

Agresti - Summary

Summary

Standardized residuals

Residual analysis

Residual analysis

import statsmodels.api as sm 

tab = pd.crosstab(popKids['Urban.Rural'], popKids['Goals'])
table = sm.stats.Table(tab)
table.standardized_resids
## Goals          Grades   Popular    Sports
## Urban.Rural                              
## Rural       -3.950845  1.309623  3.522500
## Suburban     1.766661 -0.548407 -1.618521
## Urban        2.086578 -0.727433 -1.818622

Why not just use two-way ANOVA ?

Contingency table:

Two-way ANOVA:

Models for table data

Example

HairEyeColor = pd.read_csv("https://asta.math.aau.dk/datasets?file=HairEyeColor.txt", sep="\t")
HairEyeColor.head(6)
##     Hair    Eye   Sex  Freq
## 0  Black  Brown  Male    32
## 1  Brown  Brown  Male    53
## 2    Red  Brown  Male    10
## 3  Blond  Brown  Male     3
## 4  Black   Blue  Male    11
## 5  Brown   Blue  Male    50
HairEye = HairEyeColor.groupby(['Eye', 'Hair'], as_index=False)['Freq'].sum()
HairEye
##       Eye   Hair  Freq
## 0    Blue  Black    20
## 1    Blue  Blond    94
## 2    Blue  Brown    84
## 3    Blue    Red    17
## 4   Brown  Black    68
## 5   Brown  Blond     7
## 6   Brown  Brown   119
## 7   Brown    Red    26
## 8   Green  Black     5
## 9   Green  Blond    16
## 10  Green  Brown    29
## 11  Green    Red    14
## 12  Hazel  Black    15
## 13  Hazel  Blond    10
## 14  Hazel  Brown    54
## 15  Hazel    Red    14

Model specification

Model specification

import statsmodels.formula.api as smf

model = smf.glm('Freq ~ Hair + Eye', data=HairEye, family=sm.families.Poisson()).fit()
model.summary()
Generalized Linear Model Regression Results
Dep. Variable: Freq No. Observations: 16
Model: GLM Df Residuals: 9
Model Family: Poisson Df Model: 6
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -113.52
Date: Mon, 03 Nov 2025 Deviance: 146.44
Time: 17:48:58 Pearson chi2: 138.
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 3.6693 0.111 33.191 0.000 3.453 3.886
Hair[T.Blond] 0.1621 0.131 1.238 0.216 -0.094 0.419
Hair[T.Brown] 0.9739 0.113 8.623 0.000 0.752 1.195
Hair[T.Red] -0.4195 0.153 -2.745 0.006 -0.719 -0.120
Eye[T.Brown] 0.0230 0.096 0.240 0.811 -0.165 0.211
Eye[T.Green] -1.2118 0.142 -8.510 0.000 -1.491 -0.933
Eye[T.Hazel] -0.8380 0.124 -6.752 0.000 -1.081 -0.595

1 - chi2.cdf(146.44, 9)
## np.float64(0.0)

Expected values and standardized residuals

HairEye['fitted'] = model.fittedvalues
HairEye['resid'] = model.get_influence().resid_studentized
HairEye
##       Eye   Hair  Freq      fitted     resid
## 0    Blue  Black    20   39.222973 -4.253816
## 1    Blue  Blond    94   46.123311  9.967550
## 2    Blue  Brown    84  103.868243 -3.397883
## 3    Blue    Red    17   25.785473 -2.311052
## 4   Brown  Black    68   40.135135  6.136520
## 5   Brown  Blond     7   47.195946 -8.328248
## 6   Brown  Brown   119  106.283784  2.164282
## 7   Brown    Red    26   26.385135 -0.100824
## 8   Green  Black     5   11.675676 -2.287896
## 9   Green  Blond    16   13.729730  0.732023
## 10  Green  Brown    29   30.918919 -0.508263
## 11  Green    Red    14    7.675676  2.576569
## 12  Hazel  Black    15   16.966216 -0.575026
## 13  Hazel  Blond    10   19.951014 -2.737977
## 14  Hazel  Brown    54   44.929054  2.050216
## 15  Hazel    Red    14   11.153716  0.989512