Analysis of Covariance

The ASTA team

The regression problem

Example

import pandas as pd

Income = pd.read_csv("https://asta.math.aau.dk/datasets?file=Income.txt", sep='\t')
import seaborn as sns

p = sns.lmplot(x='educ', y='inc', hue='race', data=Income, ci=None)

Dummy coding

Dummy coding

Example

Income['race'] = Income['race'].astype('category').cat.reorder_categories(
    ['w', 'b', 'h'], 
    ordered = True
)
import statsmodels.formula.api as smf

model1 = smf.ols('inc ~ educ + race', data=Income).fit()
model1.summary(slim = True)
OLS Regression Results
Dep. Variable: inc R-squared: 0.462
Model: OLS Adj. R-squared: 0.441
No. Observations: 80 F-statistic: 21.75
Covariance Type: nonrobust Prob (F-statistic): 2.85e-10
coef std err t P>|t| [0.025 0.975]
Intercept -15.6635 8.412 -1.862 0.066 -32.418 1.091
race[T.b] -10.8744 4.473 -2.431 0.017 -19.783 -1.966
race[T.h] -4.9338 4.763 -1.036 0.304 -14.421 4.553
educ 4.4317 0.619 7.158 0.000 3.199 5.665


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example: Prediction equations

model1.summary(slim = True)
OLS Regression Results
Dep. Variable: inc R-squared: 0.462
Model: OLS Adj. R-squared: 0.441
No. Observations: 80 F-statistic: 21.75
Covariance Type: nonrobust Prob (F-statistic): 2.85e-10
coef std err t P>|t| [0.025 0.975]
Intercept -15.6635 8.412 -1.862 0.066 -32.418 1.091
race[T.b] -10.8744 4.473 -2.431 0.017 -19.783 -1.966
race[T.h] -4.9338 4.763 -1.036 0.304 -14.421 4.553
educ 4.4317 0.619 7.158 0.000 3.199 5.665


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example: Plot

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

coef = model1.params
educ_vals = np.linspace(Income['educ'].min(), Income['educ'].max(), 100)

plt.figure(figsize=(8,6))
sns.scatterplot(x='educ', y='inc', hue='race', data=Income)

for r in Income['race'].cat.categories:
    intercept = coef['Intercept'] + coef.get(f'race[T.{r}]', 0)
    slope = coef['educ']
    plt.plot(educ_vals, intercept + slope * educ_vals, label=f'{r} line')

plt.xlabel("Education")
plt.ylabel("Income")
plt.legend()
plt.show()

Agresti – summary

Model with interaction

Interaction

Interaction

Example: Prediction equations

model2 = smf.ols('inc ~ educ * race', data=Income).fit()
model2.summary(slim = True)
OLS Regression Results
Dep. Variable: inc R-squared: 0.482
Model: OLS Adj. R-squared: 0.448
No. Observations: 80 F-statistic: 13.80
Covariance Type: nonrobust Prob (F-statistic): 1.62e-09
coef std err t P>|t| [0.025 0.975]
Intercept -25.8688 10.498 -2.464 0.016 -46.787 -4.951
race[T.b] 19.3333 18.293 1.057 0.294 -17.116 55.782
race[T.h] 9.2640 24.280 0.382 0.704 -39.114 57.642
educ 5.2095 0.783 6.655 0.000 3.650 6.769
educ:race[T.b] -2.4107 1.418 -1.700 0.093 -5.236 0.414
educ:race[T.h] -1.1208 2.006 -0.559 0.578 -5.118 2.876


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example: Individual tests

model2.summary(slim = True)
OLS Regression Results
Dep. Variable: inc R-squared: 0.482
Model: OLS Adj. R-squared: 0.448
No. Observations: 80 F-statistic: 13.80
Covariance Type: nonrobust Prob (F-statistic): 1.62e-09
coef std err t P>|t| [0.025 0.975]
Intercept -25.8688 10.498 -2.464 0.016 -46.787 -4.951
race[T.b] 19.3333 18.293 1.057 0.294 -17.116 55.782
race[T.h] 9.2640 24.280 0.382 0.704 -39.114 57.642
educ 5.2095 0.783 6.655 0.000 3.650 6.769
educ:race[T.b] -2.4107 1.418 -1.700 0.093 -5.236 0.414
educ:race[T.h] -1.1208 2.006 -0.559 0.578 -5.118 2.876


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
p = sns.lmplot(x='educ', y='inc', hue='race', data=Income, ci=None)

Test for no interaction

Test for no interaction

model1.rsquared
## np.float64(0.46199055232513464)
model2.rsquared
## np.float64(0.4824821580587537)

Hypothesis and test statistic

Test for no interaction in Python

from statsmodels.stats.anova import anova_lm

anova_lm(model1, model2)
##    df_resid           ssr  df_diff     ss_diff        F   Pr(>F)
## 0      76.0  18164.248072      0.0         NaN      NaN      NaN
## 1      74.0  17472.411504      2.0  691.836568  1.46505  0.23769

Hierarchy of models

Hierarchy of models

Interaction = smf.ols('inc ~ educ * race', data=Income).fit()
MainEffects = smf.ols('inc ~ educ + race', data=Income).fit()
educEff = smf.ols('inc ~ educ', data=Income).fit()
raceEff = smf.ols('inc ~ race', data=Income).fit()

Example

anova_lm(MainEffects, Interaction)
##    df_resid           ssr  df_diff     ss_diff        F   Pr(>F)
## 0      76.0  18164.248072      0.0         NaN      NaN      NaN
## 1      74.0  17472.411504      2.0  691.836568  1.46505  0.23769
anova_lm(educEff, MainEffects)
##    df_resid           ssr  df_diff      ss_diff         F    Pr(>F)
## 0      78.0  19624.832018      0.0          NaN       NaN       NaN
## 1      76.0  18164.248072      2.0  1460.583947  3.055573  0.052922
anova_lm(raceEff, MainEffects)
##    df_resid           ssr  df_diff       ss_diff         F        Pr(>F)
## 0      77.0  30409.480000      0.0           NaN       NaN           NaN
## 1      76.0  18164.248072      1.0  12245.231928  51.23458  4.422192e-10

Example

Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep='\t')
model = smf.ols('crime ~ city * highschool + city * poverty', data=Ericksen).fit()
model.summary(slim = True)
OLS Regression Results
Dep. Variable: crime R-squared: 0.658
Model: OLS Adj. R-squared: 0.629
No. Observations: 66 F-statistic: 23.06
Covariance Type: nonrobust Prob (F-statistic): 7.75e-13
coef std err t P>|t| [0.025 0.975]
Intercept 61.1456 18.125 3.373 0.001 24.889 97.402
city[T.state] 18.1526 20.413 0.889 0.377 -22.680 58.985
highschool -1.5711 0.606 -2.592 0.012 -2.784 -0.358
city[T.state]:highschool 0.7025 0.733 0.959 0.342 -0.763 2.168
poverty 5.3105 1.433 3.705 0.000 2.443 8.178
city[T.state]:poverty -5.1862 1.662 -3.121 0.003 -8.510 -1.862


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Multicollinearity and variance inflation factors