Example
- The data set
chickwts
is available and on the course webpage.
- 71 newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement.
- Their weights in grams after six weeks are given along with feed types, i.e. we have a sample with corresponding measurements of 2 variables:
weight
: a numeric variable giving the chick weight.
feed
: a factor giving the feed type.
- Always start with some graphics:
import pandas as pd
chickwts = pd.read_csv("https://asta.math.aau.dk/datasets?file=chickwts.txt", sep='\t')
chickwts.head(3)
## weight feed
## 0 179 horsebean
## 1 160 horsebean
## 2 136 horsebean
import seaborn as sns
import matplotlib.pyplot as plt
p = sns.boxplot(x='feed', y='weight', data=chickwts)

Estimates
- Least squares estimates for population means \(\widehat\mu_x\) is given by the average of the response measurements in group \(x\).
- For a given measured response \(y\) we let \(\widehat y\) denote the model’s prediction of \(y\), i.e. \[\widehat y = \widehat\mu_x\] if \(y\) is a response for an observation in group \(x\).
- We use
mean
to find the mean, for each group:
chickwts.groupby('feed')['weight'].mean()
## feed
## casein 323.583333
## horsebean 160.200000
## linseed 218.750000
## meatmeal 276.909091
## soybean 246.428571
## sunflower 328.916667
## Name: weight, dtype: float64
- We can e.g. see that \(\widehat y=323.6\), when
feed=casein
but \(\widehat y=160.2\), when feed=horsebean
.
- Is it a significant difference ?
Example
import statsmodels.formula.api as smf
model = smf.ols('weight ~ feed', data=chickwts).fit()
model.summary(slim = True)
OLS Regression Results
Dep. Variable: | weight | R-squared: | 0.542 |
Model: | OLS | Adj. R-squared: | 0.506 |
No. Observations: | 71 | F-statistic: | 15.36 |
Covariance Type: | nonrobust | Prob (F-statistic): | 5.94e-10 |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 323.5833 | 15.834 | 20.436 | 0.000 | 291.961 | 355.206 |
feed[T.horsebean] | -163.3833 | 23.485 | -6.957 | 0.000 | -210.287 | -116.480 |
feed[T.linseed] | -104.8333 | 22.393 | -4.682 | 0.000 | -149.554 | -60.112 |
feed[T.meatmeal] | -46.6742 | 22.896 | -2.039 | 0.046 | -92.400 | -0.948 |
feed[T.soybean] | -77.1548 | 21.578 | -3.576 | 0.001 | -120.249 | -34.061 |
feed[T.sunflower] | 5.3333 | 22.393 | 0.238 | 0.812 | -39.388 | 50.054 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- We get information about contrasts and their significance:
Intercept
corresponding to casein
has weight
different from zero (\(p < 2\times 10^{-16}\)) (of course, chickens grow a lot over 6 weeks)
- Weight difference between
casein
and horsebean
is extremely significant (p=\(2\times 10^{-9}\)).
- There is no significant weight difference between
casein
and sunflower
(p=\(81\)%).
Example
import statsmodels.formula.api as smf
model = smf.ols('weight ~ feed', data=chickwts).fit() # same as earlier
model.summary(slim = True)
OLS Regression Results
Dep. Variable: | weight | R-squared: | 0.542 |
Model: | OLS | Adj. R-squared: | 0.506 |
No. Observations: | 71 | F-statistic: | 15.36 |
Covariance Type: | nonrobust | Prob (F-statistic): | 5.94e-10 |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 323.5833 | 15.834 | 20.436 | 0.000 | 291.961 | 355.206 |
feed[T.horsebean] | -163.3833 | 23.485 | -6.957 | 0.000 | -210.287 | -116.480 |
feed[T.linseed] | -104.8333 | 22.393 | -4.682 | 0.000 | -149.554 | -60.112 |
feed[T.meatmeal] | -46.6742 | 22.896 | -2.039 | 0.046 | -92.400 | -0.948 |
feed[T.soybean] | -77.1548 | 21.578 | -3.576 | 0.001 | -120.249 | -34.061 |
feed[T.sunflower] | 5.3333 | 22.393 | 0.238 | 0.812 | -39.388 | 50.054 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- The
F-statistic
gives us the value of \(F_{obs} = 15.36\) and the corresponding \(p\)-value (\(5.9 \times 10^{-10}\)). Clearly there is a significant difference between the types of feed
.
Additive effects
- The data set
ToothGrowth
is available on the webpage.
- The data describes the tooth length in guinea pigs where some receive vitamin C treatment and others are given orange juice in different dosage.
ToothGrowth = pd.read_csv("https://asta.math.aau.dk/datasets?file=ToothGrowth.txt", sep='\t')
ToothGrowth['dose'] = pd.Categorical(
ToothGrowth['dose'].map({0.5: 'LO', 1: 'ME', 2: 'HI'}),
categories=['LO', 'ME', 'HI'],
ordered=True
)
ToothGrowth.head(3)
## len supp dose
## 0 4.2 VC LO
## 1 11.5 VC LO
## 2 7.3 VC LO
- A total of \(60\) observations on 3 variables.
len
The tooth length
supp
The type of the supplement (OJ
or VC
)
dose
The dosage (LO
, ME
, HI
)
- We will study the response
len
with the predictors supp
and dose
.
- At first we look at the model with additive effects
len
=\(\mu\) + "effect of supp"
+ "effect of dose"
+ error
- This is also called the main effects model since it does not contain interaction terms.
- The parameter \(\mu\) corresponds to the
Intercept
and is the mean tooth length in the reference group (supp OJ
, dose LO
).
- The effect of
supp
is the difference in mean when changing from OJ
to VC
.
- The effect of
dose
is the difference in mean when changing from LO
to eitherME
or HI
.
Main effect model in R
- The main effects model is fitted by
MainEff = smf.ols('len ~ supp + dose', data=ToothGrowth).fit()
MainEff.summary(slim = True)
OLS Regression Results
Dep. Variable: | len | R-squared: | 0.762 |
Model: | OLS | Adj. R-squared: | 0.750 |
No. Observations: | 60 | F-statistic: | 59.88 |
Covariance Type: | nonrobust | Prob (F-statistic): | 1.78e-17 |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 12.4550 | 0.988 | 12.603 | 0.000 | 10.475 | 14.435 |
supp[T.VC] | -3.7000 | 0.988 | -3.744 | 0.000 | -5.680 | -1.720 |
dose[T.ME] | 9.1300 | 1.210 | 7.543 | 0.000 | 6.705 | 11.555 |
dose[T.HI] | 15.4950 | 1.210 | 12.802 | 0.000 | 13.070 | 17.920 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- The model has 4 parameters.
- The \(F\) test at the end compares with the (null) model with only one overall mean parameter.
Testing effect of supp
- Alternative model without effect of supp:
doseEff = smf.ols('len ~ dose', data=ToothGrowth).fit()
- We can compare \(R^2\) to see if
doseEff
(Model 1) is sufficient to explain the data compared to MainEff
(Model 2). This is done by converting to \(F\)-statistic: \[
F_{obs} = \frac{(R_2^2 - R_1^2)/(df_1 - df_2)}{(1 - R_2^2)/df_2} = \frac{(SSE_1 - SSE_2)/(df_1 - df_2)}{(SSE_2)/df_2}.
\]
- \(SSE_1-SSE_2\): increase in error sum of square when using Model 1 instead of Model 2
- In R the calculations are done using
anova
:
from statsmodels.stats.anova import anova_lm
anova_lm(doseEff, MainEff)
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 57.0 1025.775 0.0 NaN NaN NaN
## 1 56.0 820.425 1.0 205.35 14.016638 0.000429
- \(p\)-value is 0.0004 hence we reject that
supp
does not have an effect. Thus we prefer Model 2 (MainEff
).
Testing effect of dose
- Alternative model without effect of dose:
suppEff = smf.ols('len ~ supp', data=ToothGrowth).fit()
anova_lm(suppEff, MainEff)
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 58.0 3246.859333 0.0 NaN NaN NaN
## 1 56.0 820.425000 2.0 2426.434333 82.810935 1.871163e-17
- \(p\)-value is \(\approx 0\) hence we reject that
dose
does not have an effect. Thus we prefer Model 2 (MainEff
).
Example
- We fit the interaction model by changing plus to multiply in the model expression from before:
Interaction = smf.ols('len ~ supp*dose', data=ToothGrowth).fit()
- Now we can think of an experiment with 6 groups corresponding to each combination of the predictors.
- Is added interaction significant ? - we compare main effects model and more complex interaction model using anova:
anova_lm(MainEff, Interaction)
## df_resid ssr df_diff ss_diff F Pr(>F)
## 0 56.0 820.425 0.0 NaN NaN NaN
## 1 54.0 712.106 2.0 108.319 4.106991 0.02186
- With a p-value of 2.186% there is a significant interaction
supp:dose
, i.e. the lack of parallel curves in the interaction plot is significant.
Interaction.summary(slim = True)
OLS Regression Results
Dep. Variable: | len | R-squared: | 0.794 |
Model: | OLS | Adj. R-squared: | 0.775 |
No. Observations: | 60 | F-statistic: | 41.56 |
Covariance Type: | nonrobust | Prob (F-statistic): | 2.50e-17 |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 13.2300 | 1.148 | 11.521 | 0.000 | 10.928 | 15.532 |
supp[T.VC] | -5.2500 | 1.624 | -3.233 | 0.002 | -8.506 | -1.994 |
dose[T.ME] | 9.4700 | 1.624 | 5.831 | 0.000 | 6.214 | 12.726 |
dose[T.HI] | 12.8300 | 1.624 | 7.900 | 0.000 | 9.574 | 16.086 |
supp[T.VC]:dose[T.ME] | -0.6800 | 2.297 | -0.296 | 0.768 | -5.285 | 3.925 |
supp[T.VC]:dose[T.HI] | 5.3300 | 2.297 | 2.321 | 0.024 | 0.725 | 9.935 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- Note the negative effect of changing from
OJ
to VC
when dose
is low is cancelled by the positive interaction parameter (\(\beta_5\) for suppVC:doseHI
) meaning almost no difference between OJ
and VC
when dose
is high (compare with interaction plot)