The ASTA team
Consider two populations:
Population 1 has mean \(\mu_1\) and standard deviation \(\sigma_1\).
Population 2 has mean \(\mu_2\) and standard deviation \(\sigma_2\).
We want to compare the means by looking at the difference \(\mu_1-\mu_2.\)
We now take a sample from each population.
The sample from Population 1 has sample mean \(\bar{x}_1\), sample standard deviation \(s_1\) and sample size \(n_1\).
The sample from Population 2 has sample mean \(\bar{x}_2\), sample standard deviation \(s_2\) and sample size \(n_2\).
The two samples are paired.
Example: Suppose we consider the fuel consumption of cars.
If we compare two samples of cars with different engine types, then the two samples are independent, since each car can only have one of the two engine types.
If we compare the fuel consumption of cars at two different speed levels by testing each car at both speed levels, then then samples are paired.
We consider the situation, where we have two independent samples of a quantitative variable.
We estimate the difference \(\mu_1-\mu_2\) by \[d=\bar{x}_1-\bar{x}_2.\]
Assume that we can find the estimated standard error \(se_d\) of the difference.
If the samples come from two normal distributions, or if both samples are large (\(n_1,n_2\geq 30\)), then one can show \[T_{obs}=\frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)}{se_d}\sim \texttt{t}(df),\] where \(\texttt{t}(df)\) is a \(t\)-distribution with \(df\) degrees of freedom.
By the usual procedure, we can use this to construct a confidence interval for the unknown population difference of means \(\mu_1-\mu_2\) by \[ (\bar{x}_1-\bar{x}_2)\pm t_{crit}se_d, \] where the critical \(t\)-score, \(t_{crit}\), is determined by the confidence level and the \(df\).
\(H_a:\mu_1 - \mu_2\neq 0.\)
If the null hypothesis is true, then the test statistic: \[T_{obs} = \frac{(\bar{X}_1-\bar{X}_2) - 0}{se_d},\] has a \(t\)-distribution with \(df\) degrees of freedom.
The p-value is the probability of observing something further away from 0 than \(t_{obs}\) in a \(\texttt{t}(df)\) distribution.
It remains to find the estimated standard error \(se_d\) and the degrees of freedom \(df\). We distinguish between two cases:
The two populations have different variances \(\sigma_1^2 \neq \sigma_2^2\).
\[\sqrt{\tfrac{\sigma_1^2}{n_1}+\tfrac{\sigma^2_2}{n_2}}.\]
\[s_p^2=\tfrac{(n_1-1)s^2_1+(n_2-1)s_2^2}{n_1+n_2-2}.\]
\[se_d=\sqrt{\tfrac{s_p^2}{n_1}+\tfrac{s^2_p}{n_2}}=s_p\sqrt{\tfrac{1}{n_1}+\tfrac{1}{n_2}}.\]
We return to the mtcars
data. We study the association between the variables vs
and mpg
(engine type and fuel consumption). So, we will perform a significance test to test the null-hypothesis that there is no difference between the mean of fuel consumption for the two engine types.
library(mosaic)
fv <- favstats(mpg ~ vs, data = mtcars)
fv
## vs min Q1 median Q3 max mean sd n missing
## 1 0 10.4 14.8 15.7 19.1 26.0 16.6 3.86 18 0
## 2 1 17.8 21.4 22.8 29.6 33.9 24.6 5.38 14 0
The degrees of freedom are \(df=n_1 + n_2-2 = 30\).
We find the \(p\)-value:
2*pdist("t", q = -4.864, df=30, xlim = c(-5, 5))
## [1] 3.419648e-05
If the variances are unequal, then we simply insert the two estimates \(s_1^2\) and \(s_2^2\) for \(\sigma_1^2\) and \(\sigma_2^2\) in the formula for the standard error to obtain the estimated standard error \[ se_d=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}. \]
The degrees of freedom \(df\) for \(se_d\) can be estimated by a complicated formula, which we will not present here (see p.365 in the book).
We return to the mtcars
data. We study the association between the variables vs
and mpg
(engine type and fuel consumption). So, we will perform a significance test to test the null-hypothesis that there is no difference between the mean of fuel consumption for the two engine types.
library(mosaic)
fv <- favstats(mpg ~ vs, data = mtcars)
fv
## vs min Q1 median Q3 max mean sd n missing
## 1 0 10.4 14.8 15.7 19.1 26.0 16.6 3.86 18 0
## 2 1 17.8 21.4 22.8 29.6 33.9 24.6 5.38 14 0
The degrees of freedom can be found using R (see below) to be \(df=22.716\).
We find the \(p\)-value:
2* pdist("t", q = -4.6671, df=22.716, xlim = c(-5, 5))
## [1] 0.0001098212
\[ d \pm t_{crit} se_d\]
qdist("t", p = 1-0.05/2, df=22.716, xlim = c(-3, 3))
## [1] 2.07009
\[[-7.94 - 2.07*1.70;-7.94 + 2.07*1.70]= [-11.5,-4.4].\]
t.test
:t.test(mpg ~ vs, data = mtcars,var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: mpg by vs
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.462508 -4.418445
## sample estimates:
## mean in group 0 mean in group 1
## 16.61667 24.55714
In order to decide whether to use the t-test with equal or unequal variance, we may test the hypothesis \(H_0: \sigma_1^2 = \sigma_2^2\).
As test statistic we use \[F_{obs} = \frac{s_1^2}{s_2^2}.\]
If the null-hypothesis is true, we expect \(F_{obs}\) to take values close to 1. Small and large values are critical for \(H_0\).
If \(F_{obs}>1\) we reject the null-hypothesis if two times the probability of getting something larger than \(F_{obs}\) is less than the significance level.
mtcars
example, we first compute the sample variances.var(mpg~vs,data=mtcars)
## 0 1
## 14.90500 28.93341
We compute \(F_{obs} = \frac{s_1^2}{s_2^2} = \frac{14.9}{28.9} = 0.516\).
The probability of observing something smaller than \(F_{obs}\) in an \(F\)-distribution with \(df_1=n_1-1 = 17\) and \(df_2=n_2-1 = 13\):
pdist("f", 0.516, df1=17, df2=13)
## [1] 0.1004094
The p-value is \(2*0.1004= 0.2008\). Here we multiply by two because the test is two-sided (large values would also have been critical).
We find no evidence against the null-hypothesis.
We now consider the case where we have two samples from two different populations but the observations in the two samples are paired.
We apply the the one-sample t-test from Lecture 2.1 to test whether the mean difference is zero.
student
(integer – a simple id)reaction_time
(numeric – average reaction time in milliseconds)phone
(factor – yes
/no
indicating whether speaking on the phone)reaction <- read.delim("https://asta.math.aau.dk/datasets?file=reaction.txt")
head(reaction, n = 3)
## student reaction_time phone
## 1 1 604 no
## 2 2 556 no
## 3 3 540 no
yes <- subset(reaction, phone == "yes")
no <- subset(reaction, phone == "no")
reaction_diff <- data.frame(student = no$student, yes = yes$reaction_time, no = no$reaction_time)
reaction_diff$diff <- reaction_diff$yes - reaction_diff$no
head(reaction_diff)
## student yes no diff
## 1 1 636 604 32
## 2 2 623 556 67
## 3 3 615 540 75
## 4 4 672 522 150
## 5 5 601 459 142
## 6 6 600 544 56
t.test( ~ diff, data = reaction_diff)
##
## One Sample t-test
##
## data: diff
## t = 5.4563, df = 31, p-value = 5.803e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 31.70186 69.54814
## sample estimates:
## mean of x
## 50.625
With a \(p\)-value of 0.0000058 we reject the null-hypothesis that speaking on the phone has no influence on the reaction time.
We can avoid the manual calculations and let R perform the significance test by using t.test
with paired = TRUE
:
t.test(reaction_time ~ phone, data = reaction, paired = TRUE)
##
## Paired t-test
##
## data: reaction_time by phone
## t = -5.4563, df = 31, p-value = 5.803e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -69.54814 -31.70186
## sample estimates:
## mean of the differences
## -50.625
An explanatory variable (or independent variable, covariate) that divides data in 2 groups.
We are interested in the effect of the explanatory variable on the response variable.
For instance in the mtcars
data, mpg
is the response variable and vs
is the explanatory variable.
In this lecture we consider the case with one discrete explanatory variable. Module 3 is concerned with the case of one or more continuous variables.
We are now going to consider a situation where we have \(k\) populations with mean values \(\mu_1,\ldots,\mu_k\).
We assume that each population follows a normal distribution and that the standard deviation is the same in all populations.
We are interested in the null-hypothesis that all \(k\) populations have the same mean, i.e.
\[H_0: \mu_1 = \dotsm=\mu_k.\] \[H_a: \text{ not all } \mu_1,\ldots \mu_k \text{ are the same}.\]
We take out a sample from each population.
chickwts
is available in R
, and on the course webpage.weight
: a numeric variable giving the chicken weight.feed
: a factor giving the feed type.library(mosaic)
gf_boxplot(weight ~ feed, data = chickwts)
We estimate the mean in each group by the sample mean inside that group, i.e. \(\hat{\mu}_i = \bar{x}_i\), \(i=1,\ldots, k\).
We use mean
to find the mean, for each group:
mean(weight ~ feed, data = chickwts)
## casein horsebean linseed meatmeal soybean sunflower
## 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167
feed=casein
but \(160.2\), when feed=horsebean
.model <- lm(weight ~ feed, data = chickwts)
summary(model)
##
## Call:
## lm(formula = weight ~ feed, data = chickwts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.909 -34.413 1.571 38.170 103.091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 323.583 15.834 20.436 < 2e-16 ***
## feedhorsebean -163.383 23.485 -6.957 2.07e-09 ***
## feedlinseed -104.833 22.393 -4.682 1.49e-05 ***
## feedmeatmeal -46.674 22.896 -2.039 0.045567 *
## feedsoybean -77.155 21.578 -3.576 0.000665 ***
## feedsunflower 5.333 22.393 0.238 0.812495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.85 on 65 degrees of freedom
## Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064
## F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10
feeds
. R chooses the lexicographically smallest, which is casein
, to be the reference group.Intercept
is the estimated mean \(\hat{\mu}_{casein}=323.583\) in the reference group.feedhorsebean
estimates the contrast \(\alpha_{horsebean}\) between the casein
and horsebean
group to be \(\hat{\alpha}_{horsebean}=-163.383\).Alternatively \[H_0: \alpha_2 = \alpha_3 = \dots=\alpha_k =0, \quad H_a: \mbox{ At least one contrast is non-zero}.\]
Idea: Compare variation within groups and variation between groups.
We use the test statistic \[F_{obs} = \frac{(TSS-SSE)/(k-1)}{SSE/(n-k)}.\]
Thus, \[F_{obs} = \frac{\text{variation between groups}}{\text{variation within groups}}.\]
A large variation between groups compared to the variation within groups points against \(H_0\).
Thus, large values are critical for the null-hypothesis.
Under the null-hypothesis, \(F_{obs}\) follows an \(F\)-distribution with \(df_1=k-1\) and \(df_2=n-k\) degrees of freedom.
A \(p\)-value for the null-hypothesis is the probability of observing something larger than \(F_{obs}\) in an \(F\)-distribution with \(df_1\) and \(df_2\) degrees of freedom.
For instance if \(F_{obs}=15.36\) with \(df_1=5\) and \(df_2=65\) degrees of freedom:
1 - pdist("f", 15.36, df1=5, df2=65)
## [1] 5.967948e-10
model <- lm(weight ~ feed, data = chickwts)
summary(model)
##
## Call:
## lm(formula = weight ~ feed, data = chickwts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.909 -34.413 1.571 38.170 103.091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 323.583 15.834 20.436 < 2e-16 ***
## feedhorsebean -163.383 23.485 -6.957 2.07e-09 ***
## feedlinseed -104.833 22.393 -4.682 1.49e-05 ***
## feedmeatmeal -46.674 22.896 -2.039 0.045567 *
## feedsoybean -77.155 21.578 -3.576 0.000665 ***
## feedsunflower 5.333 22.393 0.238 0.812495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.85 on 65 degrees of freedom
## Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064
## F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10
feed
.