The ASTA team
Income.txt on the course website. We read in data in RStudioimport pandas as pd
Income = pd.read_csv("https://asta.math.aau.dk/datasets?file=Income.txt", sep='\t')y=income: Quantitative variable, which is yearly income. This will be our response.x=education: Quantitative predictor, which is the number of years of education.z=race: Explanatory factor with levels b(black), h(hispanic) and w(white).First, we will look at the model without interaction, i.e. the effect of education is the same for all races, which corresponds to parallel lines.
We also have to introduce dummy coding of the factor \(z\):
race=b and zero otherwiserace=h and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept between black and white.
\(\beta_2\) is the difference in Intercept between Hispanic and white.
race to be a factor (grouping variable) and we want w as reference level for race (default is lexicographical ordering, i.e. (b, h, w) and b would then be the reference):Income['race'] = Income['race'].astype('category').cat.reorder_categories(
['w', 'b', 'h'],
ordered = True
)+ in the model formula to only have additive effects of educ and race, i.e. a model without interaction:import statsmodels.formula.api as smf
model1 = smf.ols('inc ~ educ + race', data=Income).fit()
model1.summary(slim = True)| Dep. Variable: | inc | R-squared: | 0.462 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.441 |
| No. Observations: | 80 | F-statistic: | 21.75 |
| Covariance Type: | nonrobust | Prob (F-statistic): | 2.85e-10 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -15.6635 | 8.412 | -1.862 | 0.066 | -32.418 | 1.091 |
| race[T.b] | -10.8744 | 4.473 | -2.431 | 0.017 | -19.783 | -1.966 |
| race[T.h] | -4.9338 | 4.763 | -1.036 | 0.304 | -14.421 | 4.553 |
| educ | 4.4317 | 0.619 | 7.158 | 0.000 | 3.199 | 5.665 |
educ is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ on income.w-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b- and w-intercept (raceb) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h- and w-intercept.| Dep. Variable: | inc | R-squared: | 0.462 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.441 |
| No. Observations: | 80 | F-statistic: | 21.75 |
| Covariance Type: | nonrobust | Prob (F-statistic): | 2.85e-10 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -15.6635 | 8.412 | -1.862 | 0.066 | -32.418 | 1.091 |
| race[T.b] | -10.8744 | 4.473 | -2.431 | 0.017 | -19.783 | -1.966 |
| race[T.h] | -4.9338 | 4.763 | -1.036 | 0.304 | -14.421 | 4.553 |
| educ | 4.4317 | 0.619 | 7.158 | 0.000 | 3.199 | 5.665 |
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
coef = model1.params
educ_vals = np.linspace(Income['educ'].min(), Income['educ'].max(), 100)
plt.figure(figsize=(8,6))
sns.scatterplot(x='educ', y='inc', hue='race', data=Income)
for r in Income['race'].cat.categories:
intercept = coef['Intercept'] + coef.get(f'race[T.{r}]', 0)
slope = coef['educ']
plt.plot(educ_vals, intercept + slope * educ_vals, label=f'{r} line')
plt.xlabel("Education")
plt.ylabel("Income")
plt.legend()
plt.show()w (\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b (\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h (\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept between black and white, while \(\beta_3\) is the difference in slope between black and white.Intercept between Hispanic and white, while \(\beta_4\) is the difference in slope between Hispanic and white.* in the model formula we include interaction between educ and race:| Dep. Variable: | inc | R-squared: | 0.482 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.448 |
| No. Observations: | 80 | F-statistic: | 13.80 |
| Covariance Type: | nonrobust | Prob (F-statistic): | 1.62e-09 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -25.8688 | 10.498 | -2.464 | 0.016 | -46.787 | -4.951 |
| race[T.b] | 19.3333 | 18.293 | 1.057 | 0.294 | -17.116 | 55.782 |
| race[T.h] | 9.2640 | 24.280 | 0.382 | 0.704 | -39.114 | 57.642 |
| educ | 5.2095 | 0.783 | 6.655 | 0.000 | 3.650 | 6.769 |
| educ:race[T.b] | -2.4107 | 1.418 | -1.700 | 0.093 | -5.236 | 0.414 |
| educ:race[T.h] | -1.1208 | 2.006 | -0.559 | 0.578 | -5.118 | 2.876 |
| Dep. Variable: | inc | R-squared: | 0.482 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.448 |
| No. Observations: | 80 | F-statistic: | 13.80 |
| Covariance Type: | nonrobust | Prob (F-statistic): | 1.62e-09 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -25.8688 | 10.498 | -2.464 | 0.016 | -46.787 | -4.951 |
| race[T.b] | 19.3333 | 18.293 | 1.057 | 0.294 | -17.116 | 55.782 |
| race[T.h] | 9.2640 | 24.280 | 0.382 | 0.704 | -39.114 | 57.642 |
| educ | 5.2095 | 0.783 | 6.655 | 0.000 | 3.650 | 6.769 |
| educ:race[T.b] | -2.4107 | 1.418 | -1.700 | 0.093 | -5.236 | 0.414 |
| educ:race[T.h] | -1.1208 | 2.006 | -0.559 | 0.578 | -5.118 | 2.876 |
b and w (educ:raceb) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h and w. In other words there is probably not interaction between educ and race.## np.float64(0.46199055232513464)
## np.float64(0.4824821580587537)
model2 significantly better than model1? I.e. is \(R^2\) significantly higher for model2?model1 is obtained from the more complicated model2 by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova_lm:## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 76.0 18164.248072 0.0 NaN NaN NaN
## 1 74.0 17472.411504 2.0 691.836568 1.46505 0.23769
educ:race has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction: The most general model with main effects educ and race and interaction educ:race:MainEffects: The model where there are additive effects of educ and race.educEff: Model where there only is an effect of educ (simple lin. reg.).raceEff: Model where there only is an effect of race (a different mean for each group – more on this in the ANOVA lecture).MainEffects and Interaction is what we have already done.## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 76.0 18164.248072 0.0 NaN NaN NaN
## 1 74.0 17472.411504 2.0 691.836568 1.46505 0.23769
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2 isn’t significantly better than model1. So no educ:race interaction.
In the same manner we can compare educEff and MainEffects. I.e. we investigate whether the effect of race can be left out.
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 78.0 19624.832018 0.0 NaN NaN NaN
## 1 76.0 18164.248072 2.0 1460.583947 3.055573 0.052922
If any, the effect of race is weak with p-value=\(5.292\%\).
Finally, we compare raceEff and MainEffects. Clearly educ cannot be left out (P-value=\(4.422\times 10^{-10}\)).
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 77.0 30409.480000 0.0 NaN NaN NaN
## 1 76.0 18164.248072 1.0 12245.231928 51.23458 4.422192e-10
Ericksen, where we study the response crime:Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep='\t')
model = smf.ols('crime ~ city * highschool + city * poverty', data=Ericksen).fit()crime: Quantitative variablecity: city or statehighschool: Quantitative variablepoverty: Quantitative variablecity and highschoolcity and poverty.| Dep. Variable: | crime | R-squared: | 0.658 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.629 |
| No. Observations: | 66 | F-statistic: | 23.06 |
| Covariance Type: | nonrobust | Prob (F-statistic): | 7.75e-13 |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 61.1456 | 18.125 | 3.373 | 0.001 | 24.889 | 97.402 |
| city[T.state] | 18.1526 | 20.413 | 0.889 | 0.377 | -22.680 | 58.985 |
| highschool | -1.5711 | 0.606 | -2.592 | 0.012 | -2.784 | -0.358 |
| city[T.state]:highschool | 0.7025 | 0.733 | 0.959 | 0.342 | -0.763 | 2.168 |
| poverty | 5.3105 | 1.433 | 3.705 | 0.000 | 2.443 | 8.178 |
| city[T.state]:poverty | -5.1862 | 1.662 | -3.121 | 0.003 | -8.510 | -1.862 |
city and highschool.highschool on crime is the same in metropolitan areas (city=city) and the non-metropolitan areas (city=state).city and poverty.poverty on crime is different in metropolitan and non-metropolitan areas.city=state, the effect of poverty (on crime) is smaller than in the major cities.