The ASTA team
Income.txt
on the course website. We read in data in RStudioIncome <- read.delim("https://asta.math.aau.dk/datasets?file=Income.txt")
y=income
: Quantitative variable, which is yearly income. This will be our response.x=education
: Quantitative predictor, which is the number of years of education.z=race
: Explanatory factor with levels b
(black), h
(hispanic) and w
(white).gf_point
for plotting points and gf_lm
for adding a regression line).library(mosaic)
gf_point(inc ~ educ, color = ~race, data = Income) %>% gf_lm()
First, we will look at the model without interaction, i.e. the effect of education
is the same for all races, which corresponds to parallel lines.
race=b
and zero otherwiserace=h
and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w
: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b
: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h
: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept
between black and white.
\(\beta_2\) is the difference in Intercept
between Hispanic and white.
R
that we want w
as reference for race (default is lexicographical ordering, i.e. (b, h, w)
and b
would then be the reference):Income$race <- relevel(Income$race, "w")
+
in the model formula to only have additive effects of educ
and race
, i.e. a model without interaction:model1 <- lm(inc ~ educ + race, data = Income)
summary(model1)
##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
educ
is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ
on income
.w
-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b
- and w
-intercept (raceb
) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h
- and w
-intercept.plotModel(model1)
summary(model1)
##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
w
(\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b
(\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h
(\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept
between black and white, while \(\beta_3\) is the difference in slope
between black and white.Intercept
between Hispanic and white, while \(\beta_4\) is the difference in slope
between Hispanic and white.*
in the model formula we include interaction between educ
and race
:model2 <- lm(inc ~ educ * race, data = Income)
summary(model2)
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
summary(model2)
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
b
and w
(educ:raceb
) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h
and w
. In other words there is probably not interaction between educ
and race
.gf_point(inc ~ educ, color = ~ race, data = Income) %>% gf_lm()
plotModel(model1)
plotModel(model2)
summary(model1)$r.squared
## [1] 0.4619906
summary(model2)$r.squared
## [1] 0.4824822
model2
significantly better than model1
? I.e. is \(R^2\) significantly higher for model2
?model1
is obtained from the more complicated model2
by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova
:anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
educ:race
has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction:
The most general model with main effects educ
and race
and interaction educ:race
:Interaction <- lm(inc ~ educ * race, data = Income)
MainEffects:
The model where there are additive effects of educ
and race
.MainEffects <- lm(inc ~ educ + race, data = Income)
educEff:
Model where there only is an effect of educ
(simple lin. reg.).educEff <- lm(inc ~ educ, data = Income)
raceEff:
Model where there only is an effect of race
(a different mean for each group – more on this in the ANOVA lecture).raceEff <- lm(inc ~ race, data = Income)
MainEffects
and Interaction
is what we have already done.anova(MainEffects, Interaction)
## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2
isn’t significantly better than model1
. So no educ:race
interaction.
In the same manner we can compare educEff
and MainEffects
. I.e. we investigate whether the effect of race
can be left out.
anova(educEff, MainEffects)
## Analysis of Variance Table
##
## Model 1: inc ~ educ
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 19625
## 2 76 18164 2 1460.6 3.0556 0.05292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If any, the effect of race
is weak with p-value=\(5.292\%\).
Finally, we compare raceEff
and MainEffects
. Clearly educ
cannot be left out (P-value=\(4.422\times 10^{-10}\)).
anova(raceEff, MainEffects)
## Analysis of Variance Table
##
## Model 1: inc ~ race
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 30410
## 2 76 18164 1 12245 51.235 4.422e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ericksen
, where we study the response crime
:Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)
crime
: Quantitative variablecity
: city
or state
highschool
: Quantitative variablepoverty
: Quantitative variablecity
and highschool
city
and poverty
.summary(model)
##
## Call:
## lm(formula = crime ~ city * highschool + city * poverty, data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.741 -8.745 -1.557 7.820 47.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1456 18.1254 3.373 0.001305 **
## citystate 18.1526 20.4131 0.889 0.377413
## highschool -1.5711 0.6062 -2.592 0.011979 *
## poverty 5.3105 1.4333 3.705 0.000463 ***
## citystate:highschool 0.7025 0.7327 0.959 0.341523
## citystate:poverty -5.1862 1.6619 -3.121 0.002773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 60 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6292
## F-statistic: 23.06 on 5 and 60 DF, p-value: 7.748e-13
city
and highschool
.highschool
on crime is the same in metropolitan areas (city=city
) and the non-metropolitan areas (city=state
).city
and poverty
.poverty
on crime is different in metropolitan and non-metropolitan areas.city=state
, the effect of poverty
(on crime) is smaller than in the major cities.