The ASTA team
Income.txt
on the course website. We read in data in RStudioIncome <- read.delim("https://asta.math.aau.dk/datasets?file=Income.txt")
y=income
: Quantitative variable, which is yearly income. This will be our response.x=education
: Quantitative predictor, which is the number of years of education.z=race
: Explanatory factor with levels b
(black), h
(hispanic) and w
(white).gf_point
for plotting points and gf_lm
for adding a regression line).library(mosaic)
gf_point(inc ~ educ, color = ~race, data = Income) %>% gf_lm()
First, we will look at the model without interaction, i.e. the effect of education
is the same for all races, which corresponds to parallel lines.
race=b
and zero otherwiserace=h
and zero otherwiseThis determines the regression model: E(y|x,z)=α+βx+β1z1+β2z2 which corresponds to parallel regressions lines for each race.
w
: (z1=0,z2=0) E(y|x)=α+βx
b
: (z1=1,z2=0) E(y|x)=α+β1+βx.
h
: (z1=0,z2=1) E(y|x)=α+β2+βx.
β1 is the difference in Intercept
between black and white.
β2 is the difference in Intercept
between Hispanic and white.
R
that we want race
to be a factor (grouping variable) and we want w
as reference level for race (default is lexicographical ordering, i.e. (b, h, w)
and b
would then be the reference):Income$race <- factor(Income$race)
Income$race <- relevel(Income$race, "w")
+
in the model formula to only have additive effects of educ
and race
, i.e. a model without interaction:model1 <- lm(inc ~ educ + race, data = Income)
summary(model1)
##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
educ
is estimated to be ˆβ=4.4316685, with corresponding p-value=4.42×10−10 which is significantly different from zero.educ
on income
.w
-intercept is ˆα=−15.6635, which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b
- and w
-intercept (raceb
) is ˆβ1=−10.8744, which is significant with p-value=1.74%.h
- and w
-intercept.plotModel(model1)
summary(model1)
##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
w
(z1=0,z2=0): E(y|x)=α+βxb
(z1=1,z2=0): E(y|x)=α+β1+(β+β3)x.h
(z1=0,z2=1): E(y|x)=α+β2+(β+β4)x.Intercept
between black and white, while β3 is the difference in slope
between black and white.Intercept
between Hispanic and white, while β4 is the difference in slope
between Hispanic and white.*
in the model formula we include interaction between educ
and race
:model2 <- lm(inc ~ educ * race, data = Income)
summary(model2)
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
summary(model2)
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
b
and w
(educ:raceb
) is estimated to ˆβ3=−2.4107. With p-value=9.33% there is no significant difference.h
and w
. In other words there is probably not interaction between educ
and race
.gf_point(inc ~ educ, color = ~ race, data = Income) %>% gf_lm()
plotModel(model1)
plotModel(model2)
summary(model1)$r.squared
## [1] 0.4619906
summary(model2)$r.squared
## [1] 0.4824822
model2
significantly better than model1
? I.e. is R2 significantly higher for model2
?model1
is obtained from the more complicated model2
by setting β3=0 and β4=0, so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: H0:β3=0,β4=0.anova
:anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
educ:race
has F-value=1.465, which in no way is significant with p-value=23.77%.Interaction:
The most general model with main effects educ
and race
and interaction educ:race
:Interaction <- lm(inc ~ educ * race, data = Income)
MainEffects:
The model where there are additive effects of educ
and race
.MainEffects <- lm(inc ~ educ + race, data = Income)
educEff:
Model where there only is an effect of educ
(simple lin. reg.).educEff <- lm(inc ~ educ, data = Income)
raceEff:
Model where there only is an effect of race
(a different mean for each group – more on this later).raceEff <- lm(inc ~ race, data = Income)
MainEffects
and Interaction
is what we have already done.anova(MainEffects, Interaction)
## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
We recognize F=1.465 with p-value=23.77%, i.e. model2
isn’t significantly better than model1
. So no educ:race
interaction.
In the same manner we can compare educEff
and MainEffects
. I.e. we investigate whether the effect of race
can be left out.
anova(educEff, MainEffects)
## Analysis of Variance Table
##
## Model 1: inc ~ educ
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 19625
## 2 76 18164 2 1460.6 3.0556 0.05292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If any, the effect of race
is weak with p-value=5.292%.
Finally, we compare raceEff
and MainEffects
. Clearly educ
cannot be left out (P-value=4.422×10−10).
anova(raceEff, MainEffects)
## Analysis of Variance Table
##
## Model 1: inc ~ race
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 30410
## 2 76 18164 1 12245 51.235 4.422e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
raceEff
above has a different mean income for each race.
w
)b
and w
in mean incomeh
and w
in mean incomew
: (z1=0,z2=0) E(y|x)=αb
: (z1=1,z2=0) E(y|x)=α+β1.h
: (z1=0,z2=1) E(y|x)=α+β2.gf_point(inc ~ race, data = Income, color = ~race) %>%
gf_jitter(width = 0.1) %>%
gf_point(mean ~ race, data = favstats(inc ~ race, data = Income),
size = 6, shape = 4, stroke = 2, show.legend = FALSE)
summary(raceEff)
##
## Call:
## lm(formula = inc ~ race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.48 -12.48 -4.48 7.52 77.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.480 2.810 15.115 <2e-16 ***
## raceb -14.730 5.708 -2.581 0.0118 *
## raceh -11.480 6.009 -1.910 0.0598 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.87 on 77 degrees of freedom
## Multiple R-squared: 0.0993, Adjusted R-squared: 0.0759
## F-statistic: 4.244 on 2 and 77 DF, p-value: 0.01784
w
and b
is significant (p-value is 1.18%)w
and h
is nearly significant (p-value is 5.98%)Ericksen
, where we study the response crime
:Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)
crime
: Quantitative variablecity
: city
or state
highschool
: Quantitative variablepoverty
: Quantitative variablecity
and highschool
city
and poverty
.summary(model)
##
## Call:
## lm(formula = crime ~ city * highschool + city * poverty, data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.741 -8.745 -1.557 7.820 47.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1456 18.1254 3.373 0.001305 **
## citystate 18.1526 20.4131 0.889 0.377413
## highschool -1.5711 0.6062 -2.592 0.011979 *
## poverty 5.3105 1.4333 3.705 0.000463 ***
## citystate:highschool 0.7025 0.7327 0.959 0.341523
## citystate:poverty -5.1862 1.6619 -3.121 0.002773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 60 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6292
## F-statistic: 23.06 on 5 and 60 DF, p-value: 7.748e-13
city
and highschool
.highschool
on crime is the same in metropolitan areas (city=city
) and the non-metropolitan areas (city=state
).city
and poverty
.poverty
on crime is different in metropolitan and non-metropolitan areas.city=state
, the effect of poverty
(on crime) is smaller than in the major cities.Space, Right Arrow or swipe left to move to next slide, click help below for more details