The ASTA team
Income.txt
on the course website. We read in data in RStudioy=income
: Quantitative variable, which is yearly income. This will be our response.x=education
: Quantitative predictor, which is the number of years of education.z=race
: Explanatory factor with levels b
(black), h
(hispanic) and w
(white).gf_point
for plotting points and gf_lm
for adding a regression line).## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## â„ą Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
First, we will look at the model without interaction, i.e. the effect of education
is the same for all races, which corresponds to parallel lines.
We also have to introduce dummy coding of the factor \(z\):
race=b
and zero otherwiserace=h
and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w
: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b
: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h
: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept
between black and white.
\(\beta_2\) is the difference in Intercept
between Hispanic and white.
R
that we want race
to be a factor (grouping variable) and we want w
as reference level for race (default is lexicographical ordering, i.e. (b, h, w)
and b
would then be the reference):+
in the model formula to only have additive effects of educ
and race
, i.e. a model without interaction:##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
educ
is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ
on income
.w
-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b
- and w
-intercept (raceb
) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h
- and w
-intercept.##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
w
(\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b
(\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h
(\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept
between black and white, while \(\beta_3\) is the difference in slope
between black and white.Intercept
between Hispanic and white, while \(\beta_4\) is the difference in slope
between Hispanic and white.*
in the model formula we include interaction between educ
and race
:##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
b
and w
(educ:raceb
) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h
and w
. In other words there is probably not interaction between educ
and race
.## [1] 0.4619906
## [1] 0.4824822
model2
significantly better than model1
? I.e. is \(R^2\) significantly higher for model2
?model1
is obtained from the more complicated model2
by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova
:## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
educ:race
has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction:
The most general model with main effects educ
and race
and interaction educ:race
:MainEffects:
The model where there are additive effects of educ
and race
.educEff:
Model where there only is an effect of educ
(simple lin. reg.).raceEff:
Model where there only is an effect of race
(a different mean for each group – more on this later).MainEffects
and Interaction
is what we have already done.## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2
isn’t significantly better than model1
. So no educ:race
interaction.
In the same manner we can compare educEff
and MainEffects
. I.e. we investigate whether the effect of race
can be left out.
## Analysis of Variance Table
##
## Model 1: inc ~ educ
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 19625
## 2 76 18164 2 1460.6 3.0556 0.05292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If any, the effect of race
is weak with p-value=\(5.292\%\).
Finally, we compare raceEff
and MainEffects
. Clearly educ
cannot be left out (P-value=\(4.422\times 10^{-10}\)).
## Analysis of Variance Table
##
## Model 1: inc ~ race
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 30410
## 2 76 18164 1 12245 51.235 4.422e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
raceEff
above has a different mean income for each race.
w
)b
and w
in mean incomeh
and w
in mean incomew
: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha\)b
: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1\).h
: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2\).gf_point(inc ~ race, data = Income, color = ~race) %>%
gf_jitter(width = 0.1) %>%
gf_point(mean ~ race, data = favstats(inc ~ race, data = Income),
size = 6, shape = 4, stroke = 2, show.legend = FALSE)
##
## Call:
## lm(formula = inc ~ race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.48 -12.48 -4.48 7.52 77.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.480 2.810 15.115 <2e-16 ***
## raceb -14.730 5.708 -2.581 0.0118 *
## raceh -11.480 6.009 -1.910 0.0598 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.87 on 77 degrees of freedom
## Multiple R-squared: 0.0993, Adjusted R-squared: 0.0759
## F-statistic: 4.244 on 2 and 77 DF, p-value: 0.01784
w
and b
is significant (p-value is 1.18%)w
and h
is nearly significant (p-value is 5.98%)Ericksen
, where we study the response crime
:Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)
crime
: Quantitative variablecity
: city
or state
highschool
: Quantitative variablepoverty
: Quantitative variablecity
and highschool
city
and poverty
.##
## Call:
## lm(formula = crime ~ city * highschool + city * poverty, data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.741 -8.745 -1.557 7.820 47.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1456 18.1254 3.373 0.001305 **
## citystate 18.1526 20.4131 0.889 0.377413
## highschool -1.5711 0.6062 -2.592 0.011979 *
## poverty 5.3105 1.4333 3.705 0.000463 ***
## citystate:highschool 0.7025 0.7327 0.959 0.341523
## citystate:poverty -5.1862 1.6619 -3.121 0.002773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 60 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6292
## F-statistic: 23.06 on 5 and 60 DF, p-value: 7.748e-13
city
and highschool
.highschool
on crime is the same in metropolitan areas (city=city
) and the non-metropolitan areas (city=state
).city
and poverty
.poverty
on crime is different in metropolitan and non-metropolitan areas.city=state
, the effect of poverty
(on crime) is smaller than in the major cities.