The ASTA team
Income.txt on the course website. We read in data in RStudioy=income: Quantitative variable, which is yearly income. This will be our response.x=education: Quantitative predictor, which is the number of years of education.z=race: Explanatory factor with levels b(black), h(hispanic) and w(white).gf_point for plotting points and gf_lm for adding a regression line).## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
First, we will look at the model without interaction, i.e. the effect of education is the same for all races, which corresponds to parallel lines.
We also have to introduce dummy coding of the factor \(z\):
race=b and zero otherwiserace=h and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept between black and white.
\(\beta_2\) is the difference in Intercept between Hispanic and white.
R that we want race to be a factor (grouping variable) and we want w as reference level for race (default is lexicographical ordering, i.e. (b, h, w) and b would then be the reference):+ in the model formula to only have additive effects of educ and race, i.e. a model without interaction:##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
educ is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ on income.w-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b- and w-intercept (raceb) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h- and w-intercept.##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
w (\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b (\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h (\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept between black and white, while \(\beta_3\) is the difference in slope between black and white.Intercept between Hispanic and white, while \(\beta_4\) is the difference in slope between Hispanic and white.* in the model formula we include interaction between educ and race:##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
b and w (educ:raceb) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h and w. In other words there is probably not interaction between educ and race.## [1] 0.4619906
## [1] 0.4824822
model2 significantly better than model1? I.e. is \(R^2\) significantly higher for model2?model1 is obtained from the more complicated model2 by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova:## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
educ:race has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction: The most general model with main effects educ and race and interaction educ:race:MainEffects: The model where there are additive effects of educ and race.educEff: Model where there only is an effect of educ (simple lin. reg.).raceEff: Model where there only is an effect of race (a different mean for each group – more on this in the ANOVA lecture).MainEffects and Interaction is what we have already done.## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2 isn’t significantly better than model1. So no educ:race interaction.
In the same manner we can compare educEff and MainEffects. I.e. we investigate whether the effect of race can be left out.
## Analysis of Variance Table
##
## Model 1: inc ~ educ
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 19625
## 2 76 18164 2 1460.6 3.0556 0.05292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If any, the effect of race is weak with p-value=\(5.292\%\).
Finally, we compare raceEff and MainEffects. Clearly educ cannot be left out (P-value=\(4.422\times 10^{-10}\)).
## Analysis of Variance Table
##
## Model 1: inc ~ race
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 30410
## 2 76 18164 1 12245 51.235 4.422e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ericksen, where we study the response crime:Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)crime: Quantitative variablecity: city or statehighschool: Quantitative variablepoverty: Quantitative variablecity and highschoolcity and poverty.##
## Call:
## lm(formula = crime ~ city * highschool + city * poverty, data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.741 -8.745 -1.557 7.820 47.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1456 18.1254 3.373 0.001305 **
## citystate 18.1526 20.4131 0.889 0.377413
## highschool -1.5711 0.6062 -2.592 0.011979 *
## poverty 5.3105 1.4333 3.705 0.000463 ***
## citystate:highschool 0.7025 0.7327 0.959 0.341523
## citystate:poverty -5.1862 1.6619 -3.121 0.002773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 60 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6292
## F-statistic: 23.06 on 5 and 60 DF, p-value: 7.748e-13
city and highschool.highschool on crime is the same in metropolitan areas (city=city) and the non-metropolitan areas (city=state).city and poverty.poverty on crime is different in metropolitan and non-metropolitan areas.city=state, the effect of poverty (on crime) is smaller than in the major cities.