The ASTA team
Income.txt
on the course website. We read in data in RStudioimport pandas as pd
Income = pd.read_csv("https://asta.math.aau.dk/datasets?file=Income.txt", sep='\t')
y=income
: Quantitative variable, which is yearly income. This will be our response.x=education
: Quantitative predictor, which is the number of years of education.z=race
: Explanatory factor with levels b
(black), h
(hispanic) and w
(white).First, we will look at the model without interaction, i.e. the effect of education
is the same for all races, which corresponds to parallel lines.
We also have to introduce dummy coding of the factor \(z\):
race=b
and zero otherwiserace=h
and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w
: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b
: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h
: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept
between black and white.
\(\beta_2\) is the difference in Intercept
between Hispanic and white.
race
to be a factor (grouping variable) and we want w
as reference level for race (default is lexicographical ordering, i.e. (b, h, w)
and b
would then be the reference):Income['race'] = Income['race'].astype('category').cat.reorder_categories(
['w', 'b', 'h'],
ordered = True
)
+
in the model formula to only have additive effects of educ
and race
, i.e. a model without interaction:import statsmodels.formula.api as smf
model1 = smf.ols('inc ~ educ + race', data=Income).fit()
model1.summary(slim = True)
Dep. Variable: | inc | R-squared: | 0.462 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.441 |
No. Observations: | 80 | F-statistic: | 21.75 |
Covariance Type: | nonrobust | Prob (F-statistic): | 2.85e-10 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -15.6635 | 8.412 | -1.862 | 0.066 | -32.418 | 1.091 |
race[T.b] | -10.8744 | 4.473 | -2.431 | 0.017 | -19.783 | -1.966 |
race[T.h] | -4.9338 | 4.763 | -1.036 | 0.304 | -14.421 | 4.553 |
educ | 4.4317 | 0.619 | 7.158 | 0.000 | 3.199 | 5.665 |
educ
is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ
on income
.w
-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b
- and w
-intercept (raceb
) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h
- and w
-intercept.Dep. Variable: | inc | R-squared: | 0.462 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.441 |
No. Observations: | 80 | F-statistic: | 21.75 |
Covariance Type: | nonrobust | Prob (F-statistic): | 2.85e-10 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -15.6635 | 8.412 | -1.862 | 0.066 | -32.418 | 1.091 |
race[T.b] | -10.8744 | 4.473 | -2.431 | 0.017 | -19.783 | -1.966 |
race[T.h] | -4.9338 | 4.763 | -1.036 | 0.304 | -14.421 | 4.553 |
educ | 4.4317 | 0.619 | 7.158 | 0.000 | 3.199 | 5.665 |
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
coef = model1.params
educ_vals = np.linspace(Income['educ'].min(), Income['educ'].max(), 100)
plt.figure(figsize=(8,6))
sns.scatterplot(x='educ', y='inc', hue='race', data=Income)
for r in Income['race'].cat.categories:
intercept = coef['Intercept'] + coef.get(f'race[T.{r}]', 0)
slope = coef['educ']
plt.plot(educ_vals, intercept + slope * educ_vals, label=f'{r} line')
plt.xlabel("Education")
plt.ylabel("Income")
plt.legend()
plt.show()
w
(\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b
(\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h
(\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept
between black and white, while \(\beta_3\) is the difference in slope
between black and white.Intercept
between Hispanic and white, while \(\beta_4\) is the difference in slope
between Hispanic and white.*
in the model formula we include interaction between educ
and race
:Dep. Variable: | inc | R-squared: | 0.482 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.448 |
No. Observations: | 80 | F-statistic: | 13.80 |
Covariance Type: | nonrobust | Prob (F-statistic): | 1.62e-09 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -25.8688 | 10.498 | -2.464 | 0.016 | -46.787 | -4.951 |
race[T.b] | 19.3333 | 18.293 | 1.057 | 0.294 | -17.116 | 55.782 |
race[T.h] | 9.2640 | 24.280 | 0.382 | 0.704 | -39.114 | 57.642 |
educ | 5.2095 | 0.783 | 6.655 | 0.000 | 3.650 | 6.769 |
educ:race[T.b] | -2.4107 | 1.418 | -1.700 | 0.093 | -5.236 | 0.414 |
educ:race[T.h] | -1.1208 | 2.006 | -0.559 | 0.578 | -5.118 | 2.876 |
Dep. Variable: | inc | R-squared: | 0.482 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.448 |
No. Observations: | 80 | F-statistic: | 13.80 |
Covariance Type: | nonrobust | Prob (F-statistic): | 1.62e-09 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -25.8688 | 10.498 | -2.464 | 0.016 | -46.787 | -4.951 |
race[T.b] | 19.3333 | 18.293 | 1.057 | 0.294 | -17.116 | 55.782 |
race[T.h] | 9.2640 | 24.280 | 0.382 | 0.704 | -39.114 | 57.642 |
educ | 5.2095 | 0.783 | 6.655 | 0.000 | 3.650 | 6.769 |
educ:race[T.b] | -2.4107 | 1.418 | -1.700 | 0.093 | -5.236 | 0.414 |
educ:race[T.h] | -1.1208 | 2.006 | -0.559 | 0.578 | -5.118 | 2.876 |
b
and w
(educ:raceb
) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h
and w
. In other words there is probably not interaction between educ
and race
.## np.float64(0.46199055232513464)
## np.float64(0.4824821580587537)
model2
significantly better than model1
? I.e. is \(R^2\) significantly higher for model2
?model1
is obtained from the more complicated model2
by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova_lm
:## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 76.0 18164.248072 0.0 NaN NaN NaN
## 1 74.0 17472.411504 2.0 691.836568 1.46505 0.23769
educ:race
has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction:
The most general model with main effects educ
and race
and interaction educ:race
:MainEffects:
The model where there are additive effects of educ
and race
.educEff:
Model where there only is an effect of educ
(simple lin. reg.).raceEff:
Model where there only is an effect of race
(a different mean for each group – more on this in the ANOVA lecture).MainEffects
and Interaction
is what we have already done.## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 76.0 18164.248072 0.0 NaN NaN NaN
## 1 74.0 17472.411504 2.0 691.836568 1.46505 0.23769
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2
isn’t significantly better than model1
. So no educ:race
interaction.
In the same manner we can compare educEff
and MainEffects
. I.e. we investigate whether the effect of race
can be left out.
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 78.0 19624.832018 0.0 NaN NaN NaN
## 1 76.0 18164.248072 2.0 1460.583947 3.055573 0.052922
If any, the effect of race
is weak with p-value=\(5.292\%\).
Finally, we compare raceEff
and MainEffects
. Clearly educ
cannot be left out (P-value=\(4.422\times 10^{-10}\)).
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 77.0 30409.480000 0.0 NaN NaN NaN
## 1 76.0 18164.248072 1.0 12245.231928 51.23458 4.422192e-10
Ericksen
, where we study the response crime
:Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep='\t')
model = smf.ols('crime ~ city * highschool + city * poverty', data=Ericksen).fit()
crime
: Quantitative variablecity
: city
or state
highschool
: Quantitative variablepoverty
: Quantitative variablecity
and highschool
city
and poverty
.Dep. Variable: | crime | R-squared: | 0.658 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.629 |
No. Observations: | 66 | F-statistic: | 23.06 |
Covariance Type: | nonrobust | Prob (F-statistic): | 7.75e-13 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 61.1456 | 18.125 | 3.373 | 0.001 | 24.889 | 97.402 |
city[T.state] | 18.1526 | 20.413 | 0.889 | 0.377 | -22.680 | 58.985 |
highschool | -1.5711 | 0.606 | -2.592 | 0.012 | -2.784 | -0.358 |
city[T.state]:highschool | 0.7025 | 0.733 | 0.959 | 0.342 | -0.763 | 2.168 |
poverty | 5.3105 | 1.433 | 3.705 | 0.000 | 2.443 | 8.178 |
city[T.state]:poverty | -5.1862 | 1.662 | -3.121 | 0.003 | -8.510 | -1.862 |
city
and highschool
.highschool
on crime is the same in metropolitan areas (city=city
) and the non-metropolitan areas (city=state
).city
and poverty
.poverty
on crime is different in metropolitan and non-metropolitan areas.city=state
, the effect of poverty
(on crime) is smaller than in the major cities.