Comparison of two groups

The ASTA team

Response variable and explanatory variable

Dependent/independent samples

Comparison of two means (Independent samples)

Comparison of two means (Independent samples)

Example: Comparing two means (independent samples)

We return to the Chile data. We study the association between the variables sex and statusquo (scale of support for the status-quo). So, we will perform a significance test to test for difference in the mean of statusquo for male and females.

import pandas as pd

Chile = pd.read_csv("https://asta.math.aau.dk/datasets?file=Chile.txt", sep = "\t")

stats = Chile.groupby("sex")["statusquo"].agg(
    ['mean',
     'std',
     'count'
]).reset_index() # reset_index() resets the grouping again
stats
##   sex      mean       std  count
## 0   F  0.065706  1.003212   1368
## 1   M -0.068355  0.992803   1315

from scipy.stats import norm
1 - norm.cdf(x = 3.4786, loc = 0, scale = 1)
## np.float64(0.00025202016841718855)
from scipy.stats import ttest_ind

female = Chile.loc[Chile['sex'] == "F", 'statusquo'].dropna()
male = Chile.loc[Chile['sex'] == "M", 'statusquo'].dropna()

stat, pval = ttest_ind(female, male, equal_var = False)

print("t-statistic:", stat)
## t-statistic: 3.4785834945762018
print("p-value:", pval)
## p-value: 0.0005121107038059029

Comparison of two means: confidence interval (independent samples)

Comparison of two means: paired \(t\)-test (dependent samples)

Reaction time example

reaction = pd.read_csv("https://asta.math.aau.dk/datasets?file=reaction.txt", sep = "\t")
reaction.head(3)
##    student  reaction_time phone
## 0        1            604    no
## 1        2            556    no
## 2        3            540    no

Instead of doing manual calculations we let the software perform the significance test (using a paired test as our samples are paired/dependent):

from scipy.stats import ttest_rel

yes = reaction[reaction['phone'] == "yes"]
no  = reaction[reaction['phone'] == "no"]
print(all(yes['student'].values == no['student'].values))
## True
stat, pval = ttest_rel(no["reaction_time"], yes["reaction_time"])
print("t-statistic:", stat)
## t-statistic: -5.456300665835772
print("p-value:", pval)
## p-value: 5.803405318112956e-06
from scipy.stats import ttest_1samp
diff = no["reaction_time"].values - yes["reaction_time"].values
stat, pval = ttest_1samp(diff, popmean = 0)
print("t-statistic:", stat)
## t-statistic: -5.456300665835772
print("p-value:", pval)
## p-value: 5.803405318112956e-06

Comparison of two proportions

Comparison of two proportions

Comparison of two proportions: Independent samples

Approximate test for comparing two proportions (independent samples)

WARNING: The approximation is only good, when \(n_1\hat{\pi},\ n_1(1-\hat{\pi}),\ n_2\hat{\pi},\ n_2(1-\hat{\pi})\) all are greater than 5.

Example: Approximate confidence interval and test for comparing proportions

We return to the Chile dataset. We make a new binary variable indicating whether the person intends to vote no or something else (and we remember to tell the software that it should think of this as a grouping variable):

import numpy as np

Chile["vote"].value_counts()
## vote
## N    889
## Y    868
## U    588
## A    187
## Name: count, dtype: int64
Chile["vote"].value_counts(dropna = False)
## vote
## N      889
## Y      868
## U      588
## A      187
## NaN    168
## Name: count, dtype: int64
# Step 1: initialize with NA
voteNo = pd.Series([np.nan] * len(Chile), dtype = "object")
# Step 2: mark TRUE where vote == "N"
voteNo[Chile["vote"] == "N"] = True
# Step 3: mark FALSE where vote != "N" (but not NA)
voteNo[(Chile["vote"].notna()) & (Chile["vote"] != "N")] = False

Chile["voteNo"] = pd.Categorical(
    voteNo,
    categories=[True, False],
    ordered=False
)
Chile["voteNo"].value_counts(dropna = False)
## voteNo
## False    1643
## True      889
## NaN       168
## Name: count, dtype: int64

We study the association between the variables sex and voteNo:

tab = pd.crosstab(Chile["sex"], Chile["voteNo"], dropna = True)
tab
## voteNo  True  False
## sex                
## F        363    946
## M        526    697

This gives us all the ingredients needed in the hypothesis test:

Example: Approximate confidence interval (cont.)

Example: \(p\)-value (cont.)

Automatic calculation

from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import confint_proportions_2indep

Chile2 = Chile.dropna(subset=["voteNo"])

counts = Chile2.groupby("sex")["voteNo"].apply(lambda x: (x == True).sum())
nobs   = Chile2.groupby("sex")["voteNo"].count()

print("Counts (successes):\n", counts)
## Counts (successes):
##  sex
## F    363
## M    526
## Name: voteNo, dtype: int64
print("Totals (nobs):\n", nobs)
## Totals (nobs):
##  sex
## F    1309
## M    1223
## Name: voteNo, dtype: int64
print("sample estimates:\n")
## sample estimates:
print(counts/nobs)
## sex
## F    0.277311
## M    0.430090
## Name: voteNo, dtype: float64
# Two-sample proportion z-test
stat, pval = proportions_ztest(count=counts, nobs=nobs, value=0, alternative='two-sided')

ci_low, ci_high = confint_proportions_2indep(
    count1=counts.iloc[0], nobs1=nobs.iloc[0],
    count2=counts.iloc[1], nobs2=nobs.iloc[1],
    method="wald", alpha=0.05
)

print("p-value:", pval)
## p-value: 8.389098566796607e-16
print(f"95% CI for difference: ({ci_low:.3f}, {ci_high:.3f})")
## 95% CI for difference: (-0.190, -0.116)

Fisher’s exact test

from scipy.stats import fisher_exact

oddsratio, pvalue = fisher_exact(tab)
print("Odds ratio:", oddsratio)
## Odds ratio: 0.5084667079317358
print("p-value:", pvalue)
## p-value: 1.0396837491279301e-15

Agresti: Overview of comparison of two groups