Comparison of two groups

The ASTA team

Response variable and explanatory variable

We conduct an experiment, where we at random choose 50 IT-companies and 50 service companies and measure their profit ratio. Is there association between company type (IT/service) and profit ratio?
In other words we compare samples from 2 different populations. For each company we register:
- The binary variable company type, which is called the explanatory variable and divides data in 2 groups.
- The quantitative variable profit ratio, which is called the response variable.

Dependent/independent samples

In the example with profit ratio of 50 IT-companies and 50 service companies we have independent samples, since the same company cannot be in both groups.
Now, think of another type of experiment, where we at random choose 50 IT-companies and measure their profit ratio in both 2009 and 2010. Then we may be interested in whether there is association between year and profit ratio?
In this example we have dependent samples, since the same company is in both groups.
Dependent samples may also be referred to as paired samples.

Comparison of two means (Independent samples)

We consider the situation, where we have two quantitative samples:
- Population 1 has mean \(\mu_1\), which is estimated by \(\hat{\mu}_1=\bar{y}_1\) based on a sample of size \(n_1\).
- Population 2 has mean \(\mu_2\), which is estimated by \(\hat{\mu}_2=\bar{y}_2\) based on a sample of size \(n_2\).
- We are interested in the difference \(\mu_2-\mu_1\), which is estimated by \(d=\bar{y}_2-\bar{y}_1\).
- Assume that we can find the estimated standard error \(se_d\) of the difference and that this has degrees of freedom \(df\).
- Assume that the samples either are large or come from a normal population.
Then we can construct a
- confidence interval for the unknown population difference of means \(\mu_2-\mu_1\) by \[ (\bar{y}_2-\bar{y}_1)\pm t_{crit}se_d, \] where the critical \(t\)-score, \(t_{crit}\), determines the confidence level.
- significance test:
  - for the null hypothesis \(H_0:\ \mu_2-\mu_1=0\) and alternative hypothesis \(H_a:\ \mu_2-\mu_1\neq 0\).
  - which uses the test statistic: \(t_{obs} = \frac{(\bar{y}_2-\bar{y}_1) - 0}{se_d}\), that has to be evaluated in a \(t\)-distribution with \(df\) degrees of freedom.

Comparison of two means (Independent samples)

In the independent samples situation it can be shown that \[ se_d=\sqrt{se_1^2+se_2^2}, \] where \(se_1\) and \(se_2\) are estimated standard errors for the sample means in populations 1 and 2, respectively.
We recall, that for these we have \(se=\frac{s}{\sqrt{n}}\), i.e. \[ se_d=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}, \] where \(s_1\) and \(s_2\) are estimated standard deviations for population 1 and 2, respectively.
The degrees of freedom \(df\) for \(se_d\) can be estimated by a complicated formula, which we will not present here.
For the confidence interval and the significance test we note that:
- If both \(n_1\) and \(n_2\) are above 30, then we can use the standard normal distribution (\(z\)-score) rather than the \(t\)-distribution (\(t\)-score).
- If \(n_1\) or \(n_2\) are below 30, then we let R calculate the degrees of freedom and \(p\)-value/confidence interval.

Example: Comparing two means (independent samples)

We return to the Chile data. We study the association between the variables sex and statusquo (scale of support for the status-quo). So, we will perform a significance test to test for difference in the mean of statusquo for male and females.

Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
library(mosaic)
fv <- favstats(statusquo ~ sex, data = Chile)
fv

##   sex   min     Q1 median    Q3  max    mean    sd    n missing
## 1   F -1.80 -0.975  0.121 1.033 2.02  0.0657 1.003 1368      11
## 2   M -1.74 -1.032 -0.216 0.861 2.05 -0.0684 0.993 1315       6

Difference: \(d = 0.0657 - (-0.0684) = 0.1341\).
Estimated standard deviations: \(s_1 = 1.0032\) (females) and \(s_2 = 0.9928\) (males).
Sample sizes: \(n_1 = 1368\) and \(n_2 = 1315\).
Estimated standard error of difference: \(se_d = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{1.0032^2}{1368} + \frac{0.9928^2}{1315}} = 0.0385\).
Observed \(t\)-score for \(H_0:\ \mu_1-\mu_2=0\) is: \(\quad t_{obs} = \frac{d-0}{se_d} = \frac{0.1341}{0.0385} = 3.4786\).
Since both sample sizes are “pretty large” (> 30), we can use the \(z\)-score instead of the \(t\)-score for finding the \(p\)-value (i.e. we use the standard normal distribution):

1 - pdist("norm", q = 3.4786, xlim = c(-4, 4))

## [1] 0.0002520202

Then the \(p\)-value is \(2\cdot 0.00025 = 0.0005\), so we reject the null hypothesis.
We can leave all the calculations to R by using t.test:

t.test(statusquo ~ sex, data = Chile)

## 
##  Welch Two Sample t-test
## 
## data:  statusquo by sex
## t = 3.4786, df = 2678.7, p-value = 0.0005121
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05849179 0.20962982
## sample estimates:
## mean in group F mean in group M 
##      0.06570627     -0.06835453

We recognize the \(t\)-score \(3.4786\) and the \(p\)-value \(0.0005\). The estimated degrees of freedom \(df = 2679\) is so large that we can not tell the difference between results obtained using \(z\)-score and \(t\)-score.

Comparison of two means: confidence interval (independent samples)

We have already found all the ingredients to construct a confidence interval for \(\mu_2-\mu_1\):
- \(d=\bar{y}_2-\bar{y}_1\) estimates \(\mu_2-\mu_1\).
- \(se_d=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\) estimates the standard error of \(d\).
Then: \[ d\pm t_{crit}se_d \] is a confidence interval for \(\mu_2-\mu_1\).
The critical \(t\)-score, \(t_{crit}\) is chosen corresponding to the wanted confidence level. If \(n_1\) and \(n_2\) both are greater than 30, then \(t_{crit} = 2\) yields a confidence level of approximately 95%.

Comparison of two means: paired \(t\)-test (dependent samples)

Experiment:
- You choose 10 Netto stores at random, where you measure the average expedition time by the cash registers over some period of time.
- Now, new cash registers are installed in all 10 stores, and you repeat the experiment.
It is interesting to investigate whether or not the new cash registers have changed the expedition time.
So we have 2 samples corresponding to old/new technology. In this case we have dependent samples, since we have 2 measurement in each store.
We use the following strategy for analysis:
- For each store calculate the change in average expedition time when we change from old to new technology.
- The changes \(d_1,d_2,\ldots,d_{10}\) are now considered as ONE sample from a population with mean \(\mu\).
- Test the hypothesis \(H_0: \mu=0\) as usual (using a \(t\)-test for testing the mean as in the previous lecture).

Netto store example

Data is organized in a data frame with 2 variables, before and after, containing the average expedition time before and after installation of the new technology. Instead of doing manual calculations we let R perform the significance test (using t.test with paired = TRUE as our samples are paired/dependent):

Netto <- read.delim("https://asta.math.aau.dk/datasets?file=Netto.txt")
head(Netto, n = 3)

##     before    after
## 1 3.730611 3.440214
## 2 2.623338 2.314733
## 3 3.795295 3.586334

t.test(Netto$before, Netto$after, paired = TRUE)

## 
##  Paired t-test
## 
## data:  Netto$before and Netto$after
## t = 5.7204, df = 9, p-value = 0.0002868
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1122744 0.2591578
## sample estimates:
## mean of the differences 
##               0.1857161

With a \(p\)-value of \(0.00029\) we reject that the expedition time is the same after installing new technology.

Comparison of two proportions

We consider the situation, where we have two qualitative samples and we investigate whether a given property is present or not:
- Let the proportion of population 1 which has the property be \(\pi_1\), which is estimated by \(\hat{\pi}_1\) based on a sample of size \(n_1\).
- Let the proportion of population 2 which has the property be \(\pi_2\), which is estimated by \(\hat{\pi}_2\) based on a sample of size \(n_2\).
- We are interested in the difference \(\pi_2-\pi_1\), which is estimated by \(d=\hat{\pi}_2-\hat{\pi}_1\).
- Assume that we can find the estimated standard error \(se_d\) of the difference.
Then we can construct
- an approximate confidence interval for the difference, \(\pi_2 - \pi_1\).
- a significance test.

Comparison of two proportions: Independent samples

In the situation where we have independent samples we know that \[ se_d=\sqrt{se_1^2+se_2^2}, \] where \(se_1\) and \(se_2\) are the estimated standard errors for the sample proportion in population 1 and 2, respectively.
We recall, that these are given by \(se=\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}\), i.e. \[ se_d = \sqrt{\frac{\hat{\pi}_1(1-\hat{\pi}_1)}{n_1}+\frac{\hat{\pi}_2(1-\hat{\pi}_2)}{n_2}}. \]
A (approximate) confidence interval for \(\pi_2-\pi_1\) is obtained by the usual construction:
\[ (\hat{\pi}_2-\hat{\pi}_1)\pm z_{crit}se_d, \] where the critical \(z\)-score determines the confidence level.

Approximate test for comparing two proportions (independent samples)

We consider the null hypothesis \(H_0\): \(\pi_1=\pi_2\) (equivalently \(H_0: \pi_1 - \pi_2 = 0\)) and the alternative hypothesis \(H_a\): \(\pi_1 \neq \pi_2\).
Assuming \(H_0\) is true, we have a common proportion \(\pi\), which is estimated by \[ \hat{\pi}=\frac{n_1\hat{\pi}_1+n_2\hat{\pi}_2}{n_1+n_2}, \] i.e. we aggregate the populations and calculate the relative frequency of the property (with other words: we estimate the proportion, \(\pi\), as if the two samples were one).
Rather than using the estimated standard error of the difference from previous, we use the following that holds under \(H_0\): \[ se_0=\sqrt{\hat{\pi}(1-\hat{\pi})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)} \]
The observed test statistic/\(z\)-score for \(H_0\) is then: \[ z_{obs}=\frac{(\hat{\pi}_2-\hat{\pi}_1) - 0}{se_0}, \] which is evaluated in the standard normal distribution.
The \(p\)-value is calculated in the usual way.

WARNING: The approximation is only good, when \(n_1\hat{\pi},\ n_1(1-\hat{\pi}),\ n_2\hat{\pi},\ n_2(1-\hat{\pi})\) all are greater than 5.

Example: Approximate confidence interval and test for comparing proportions

We return to the Chile dataset. We make a new binary variable indicating whether the person intends to vote no or something else (and we remember to tell R that it should think of this as a grouping variable, i.e. a factor):

Chile$voteNo <- relevel(factor(Chile$vote == "N"), ref = "TRUE")

We study the association between the variables sex and voteNo:

tab <- tally( ~ sex + voteNo, data = Chile, useNA = "no")
tab

##    voteNo
## sex TRUE FALSE
##   F  363   946
##   M  526   697

This gives us all the ingredients needed in the hypothesis test:

Estimated proportion of men that vote no: \(\hat{\pi}_1=\frac{526}{526+697}=0.430\)
Estimated proportion of women that vote no: \(\hat{\pi}_2=\frac{363}{363+946}=0.277\)
Estimated common proportion: \(\hat{\pi}=\frac{1223 \times 0.430 + 1309 \times 0.277}{1309+1223}=\frac{526 + 363}{1309+1223}=0.351.\)
Estimated difference \(d=\hat{\pi}_2-\hat{\pi}_1=0.277-0.430=-0.153\)

Further,

Standard error of difference:
\(se_d=\sqrt{\frac{\hat{\pi}_1(1-\hat{\pi}_1)}{n_1}+\frac{\hat{\pi}_2(1-\hat{\pi}_2)}{n_2}} = \sqrt{\frac{0.430(1-0.430)}{1223}+\frac{0.277(1-0.277)}{1309}}= 0.0188\).
Approximate 95% confidence interval for difference: \(d\pm 1.96se_d=(-0.190, -0.116)\).
Standard error of difference when \(H_0:\ \pi_1=\pi_2\) is true:
\(se_0=\sqrt{\hat{\pi}(1-\hat{\pi})(\frac{1}{n_1}+\frac{1}{n_2})} = 0.0190\).
The observed test statistic/\(z\)-score: \(z_{obs}=\frac{d}{se_0}=-8.06\). The test for \(H_0\) against \(H_a: \pi_1\not=\pi_2\) yields a \(p\)-value that is practically zero, i.e. we can reject that the proportions are equal.

Automatic calculation in R

Chile2 <- subset(Chile, !is.na(voteNo))
prop.test(voteNo ~ sex, data = Chile2, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  tally(voteNo ~ sex)
## X-squared = 64.777, df = 1, p-value = 8.389e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1896305 -0.1159275
## sample estimates:
##    prop 1    prop 2 
## 0.2773109 0.4300899

Fisher’s exact test

If \(n_1\hat{\pi},\ n_1(1-\hat{\pi}),\ n_2\hat{\pi},\ n_2(1-\hat{\pi})\) are not all greater than 5, then the approximate test cannot be trusted. Instead you can use Fisher’s exact test:

fisher.test(tab)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 1.04e-15
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4292768 0.6021525
## sample estimates:
## odds ratio 
##  0.5085996

Again the \(p\)-value is seen to be extremely small, so we definitely reject the null hypothesis of equal voteNo proportions for women and men.

Agresti: Overview of comparison of two groups