The ASTA team
Income.txt on the course website. We read in data in RStudioIncome <- read.delim("https://asta.math.aau.dk/datasets?file=Income.txt")y=income: Quantitative variable, which is yearly income. This will be our response.x=education: Quantitative predictor, which is the number of years of education.z=race: Explanatory factor with levels b(black), h(hispanic) and w(white).gf_point for plotting points and gf_lm for adding a regression line).library(mosaic)
gf_point(inc ~ educ, color = ~race, data = Income) %>% gf_lm()First, we will look at the model without interaction, i.e. the effect of education is the same for all races, which corresponds to parallel lines.
race=b and zero otherwiserace=h and zero otherwiseThis determines the regression model: \[E(y|x,z) = \alpha + \beta x + \beta_1 z_1 + \beta_2 z_2\] which corresponds to parallel regressions lines for each race.
w: (\(z_1=0,z_2=0\)) \(E(y|x) = \alpha + \beta x\)
b: (\(z_1=1,z_2=0\)) \(E(y|x) = \alpha + \beta_1 + \beta x\).
h: (\(z_1=0,z_2=1\)) \(E(y|x) = \alpha + \beta_2 + \beta x\).
\(\beta_1\) is the difference in Intercept between black and white.
\(\beta_2\) is the difference in Intercept between Hispanic and white.
R that we want race to be a factor (grouping variable) and we want w as reference level for race (default is lexicographical ordering, i.e. (b, h, w) and b would then be the reference):Income$race <- factor(Income$race)
Income$race <- relevel(Income$race, "w")+ in the model formula to only have additive effects of educ and race, i.e. a model without interaction:model1 <- lm(inc ~ educ + race, data = Income)
summary(model1)##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
educ is estimated to be \(\widehat\beta = 4.4316685\), with corresponding p-value=\(4.42\times 10^{-10}\) which is significantly different from zero.educ on income.w-intercept is \(\widehat\alpha=-15.6635\), which isn’t significantly different from zero if we test at level 5% (this test is not really of interest).b- and w-intercept (raceb) is \(\widehat\beta_1=-10.8744\), which is significant with p-value=1.74%.h- and w-intercept.plotModel(model1)summary(model1)##
## Call:
## lm(formula = inc ~ educ + race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.664 -9.622 -1.642 6.552 57.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6635 8.4121 -1.862 0.0665 .
## educ 4.4317 0.6191 7.158 4.42e-10 ***
## raceb -10.8744 4.4730 -2.431 0.0174 *
## raceh -4.9338 4.7632 -1.036 0.3036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 76 degrees of freedom
## Multiple R-squared: 0.462, Adjusted R-squared: 0.4408
## F-statistic: 21.75 on 3 and 76 DF, p-value: 2.853e-10
w (\(z_1=0, z_2=0\)): \(E(y|x) = \alpha + \beta x\)b (\(z_1=1, z_2=0\)): \(E(y|x) = \alpha + \beta_1 + (\beta + \beta_3) x\).h (\(z_1=0,z_2=1\)): \(E(y|x) = \alpha + \beta_2 + (\beta + \beta_4) x\).Intercept between black and white, while \(\beta_3\) is the difference in slope between black and white.Intercept between Hispanic and white, while \(\beta_4\) is the difference in slope between Hispanic and white.* in the model formula we include interaction between educ and race:model2 <- lm(inc ~ educ * race, data = Income)
summary(model2)##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
summary(model2)##
## Call:
## lm(formula = inc ~ educ * race, data = Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.064 -9.448 -1.453 6.167 56.936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.8688 10.4982 -2.464 0.0161 *
## educ 5.2095 0.7828 6.655 4.3e-09 ***
## raceb 19.3333 18.2928 1.057 0.2940
## raceh 9.2640 24.2797 0.382 0.7039
## educ:raceb -2.4107 1.4177 -1.700 0.0933 .
## educ:raceh -1.1208 2.0060 -0.559 0.5781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.37 on 74 degrees of freedom
## Multiple R-squared: 0.4825, Adjusted R-squared: 0.4475
## F-statistic: 13.8 on 5 and 74 DF, p-value: 1.618e-09
b and w (educ:raceb) is estimated to \(\widehat\beta_3=-2.4107\). With p-value=9.33% there is no significant difference.h and w. In other words there is probably not interaction between educ and race.gf_point(inc ~ educ, color = ~ race, data = Income) %>% gf_lm()plotModel(model1)plotModel(model2)summary(model1)$r.squared## [1] 0.4619906
summary(model2)$r.squared## [1] 0.4824822
model2 significantly better than model1? I.e. is \(R^2\) significantly higher for model2?model1 is obtained from the more complicated model2 by setting \(\beta_3=0\) and \(\beta_4=0\), so the null hypothesis “the simpler additive model describes data sufficiently well compared to the complicated interaction model” is really the simple mathematical hypothesis: \[ H_0: \beta_3=0, \beta_4=0.\]anova:anova(model1, model2)## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
educ:race has F-value=\(1.465\), which in no way is significant with p-value=\(23.77\%\).Interaction: The most general model with main effects educ and race and interaction educ:race:Interaction <- lm(inc ~ educ * race, data = Income)MainEffects: The model where there are additive effects of educ and race.MainEffects <- lm(inc ~ educ + race, data = Income)educEff: Model where there only is an effect of educ (simple lin. reg.).educEff <- lm(inc ~ educ, data = Income)raceEff: Model where there only is an effect of race (a different mean for each group – more on this in the ANOVA lecture).raceEff <- lm(inc ~ race, data = Income)MainEffects and Interaction is what we have already done.anova(MainEffects, Interaction)## Analysis of Variance Table
##
## Model 1: inc ~ educ + race
## Model 2: inc ~ educ * race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 18164
## 2 74 17472 2 691.84 1.465 0.2377
We recognize \(F=1.465\) with p-value=\(23.77\%\), i.e. model2 isn’t significantly better than model1. So no educ:race interaction.
In the same manner we can compare educEff and MainEffects. I.e. we investigate whether the effect of race can be left out.
anova(educEff, MainEffects)## Analysis of Variance Table
##
## Model 1: inc ~ educ
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 19625
## 2 76 18164 2 1460.6 3.0556 0.05292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If any, the effect of race is weak with p-value=\(5.292\%\).
Finally, we compare raceEff and MainEffects. Clearly educ cannot be left out (P-value=\(4.422\times 10^{-10}\)).
anova(raceEff, MainEffects)## Analysis of Variance Table
##
## Model 1: inc ~ race
## Model 2: inc ~ educ + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 30410
## 2 76 18164 1 12245 51.235 4.422e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ericksen, where we study the response crime:Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
model <- lm(crime ~ city * highschool + city * poverty, data = Ericksen)crime: Quantitative variablecity: city or statehighschool: Quantitative variablepoverty: Quantitative variablecity and highschoolcity and poverty.summary(model)##
## Call:
## lm(formula = crime ~ city * highschool + city * poverty, data = Ericksen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.741 -8.745 -1.557 7.820 47.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1456 18.1254 3.373 0.001305 **
## citystate 18.1526 20.4131 0.889 0.377413
## highschool -1.5711 0.6062 -2.592 0.011979 *
## poverty 5.3105 1.4333 3.705 0.000463 ***
## citystate:highschool 0.7025 0.7327 0.959 0.341523
## citystate:poverty -5.1862 1.6619 -3.121 0.002773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 60 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.6292
## F-statistic: 23.06 on 5 and 60 DF, p-value: 7.748e-13
city and highschool.highschool on crime is the same in metropolitan areas (city=city) and the non-metropolitan areas (city=state).city and poverty.poverty on crime is different in metropolitan and non-metropolitan areas.city=state, the effect of poverty (on crime) is smaller than in the major cities.