The ASTA team
The dataset popularKids
, we study the association between the factors Goals
and Urban.Rural
:
Urban.Rural
: The students were selected from urban, suburban, and rural schools.
Goals
: The students indicated whether good grades, athletic ability, or popularity was most important to them.
In total 478 students from grades 4-6.
Based on a sample we make a cross tabulation of the factors and we get a so-called contingency table (krydstabel).
popKids <- read.delim("https://asta.math.aau.dk/datasets?file=PopularKids.txt")
library(mosaic)
tab <- tally(~Urban.Rural + Goals, data = popKids, margins = TRUE)
tab
## Goals
## Urban.Rural Grades Popular Sports Total
## Rural 57 50 42 149
## Suburban 87 42 22 151
## Urban 103 49 26 178
## Total 247 141 90 478
Goals
for each level of Urban.Rural
, i.e. the sum in each row of the table is \(1\) (up to rounding):## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 0.383 0.336 0.282 1.000
## Suburban 0.576 0.278 0.146 1.000
## Urban 0.579 0.275 0.146 1.000
## Total 0.517 0.295 0.188 1.000
Here we will talk about the conditional distribution of Goals
given Urban.Rural
.
An important question could be:
There is (almost) no difference between urban and suburban, but it looks like rural is different.
Recall, that two factors are independent, when there is no difference between the population’s distributions of one factor given the levels of the other factor.
Otherwise the factors are said to be dependent.
If we e.g. have the following conditional population distributions of Goals
given Urban.Rural
:
## Goals
## Urban.Rural Grades Popular Sports
## Rural 0.5 0.3 0.2
## Suburban 0.5 0.3 0.2
## Urban 0.5 0.3 0.2
Then the factors Goals
and Urban.Rural
are independent.
We take a sample and “measure” the factors \(F_1\) and \(F_2\). E.g. Goals
and Urban.Rural
for a random child.
The hypothesis of interest today is: \[H_0: F_1 \mbox{ and } F_2 \mbox{ are independent, } \quad H_a: F_1 \mbox{ and } F_2 \mbox{ are dependent.}\]
Goals
is the relative frequencies in the sample:tab <- tally(~Urban.Rural + Goals, data = popKids)
n <- margin.table(tab)
pctGoals <- round(margin.table(tab, 2) / n, 3)
pctGoals
## Goals
## Grades Popular Sports
## 0.517 0.295 0.188
If we assume independence, then this is also a guess of the conditional distributions of Goals
given Urban.Rural
.
The corresponding expected counts in the sample are then:
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (0.517) 44.0 (0.295) 28.1 (0.188) 149.0 (1.000)
## Suburban 78.0 (0.517) 44.5 (0.295) 28.4 (0.188) 151.0 (1.000)
## Urban 92.0 (0.517) 52.5 (0.295) 33.5 (0.188) 178.0 (1.000)
## Sum 247.0 (0.517) 141.0 (0.295) 90.0 (0.188) 478.0 (1.000)
pctexptab
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 (0.517) 44.0 (0.295) 28.1 (0.188) 149.0 (1.000)
## Suburban 78.0 (0.517) 44.5 (0.295) 28.4 (0.188) 151.0 (1.000)
## Urban 92.0 (0.517) 52.5 (0.295) 33.5 (0.188) 178.0 (1.000)
## Sum 247.0 (0.517) 141.0 (0.295) 90.0 (0.188) 478.0 (1.000)
We note that
The relative frequency for a given column is column total divided by table total. For example Grades
, which is \(\frac{247}{478} = 0.517\).
The expected value in a given cell in the table is then the cell’s relative column frequency multiplied by the cell’s row total. For example Rural
and Grades
: \(149\times 0.517 = 77.0\).
This can be summarized to:
tab
## Goals
## Urban.Rural Grades Popular Sports
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
## Goals
## Urban.Rural Grades Popular Sports Sum
## Rural 77.0 44.0 28.1 149.0
## Suburban 78.0 44.5 28.4 151.0
## Urban 92.0 52.5 33.5 178.0
## Sum 247.0 141.0 90.0 478.0
If these tables are “far from each other”, then we reject \(H_0\). We want to measure the distance via the Chi-squared test statistic:
\(X^2=\sum\frac{(f_o-f_e)^2}{f_e}\): Sum over all cells in the table
\(f_o\) is the frequency in a cell in the observed table
\(f_e\) is the corresponding frequency in the expected table.
We have: \[X^2_{obs} = \frac{(57-77)^2}{77} + \ldots + \frac{(26-33.5)^2}{33.5}=18.8\]
Is this a large distance??
We want to test the hypothesis \(H_0\) of independence in a table with \(r\) rows and \(c\) columns:
We take a sample and calculate \(X^2_{obs}\) - the observed value of the test statistic.
p-value: Assume \(H_0\) is true. What is then the chance of obtaining a larger \(X^2\) than \(X^2_{obs}\), if we repeat the experiment?
This can be approximated by the \(\mathbf{\chi^2}\)-distribution with \(df=(r-1)(c-1)\) degrees of freedom.
For Goals
and Urban.Rural
we have \(r=c=3\), i.e. \(df=4\) and \(X^2_{obs}=18.8\), so the p-value is:
1 - pdist("chisq", 18.8, df = 4)
## [1] 0.0008603303
Goals
and Urban.Rural
.chisq.test
chisq.test
.tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 18.828, df = 4, p-value = 0.0008497
testStat$expected
## Goals
## Urban.Rural Grades Popular Sports
## Rural 76.99372 43.95188 28.05439
## Suburban 78.02720 44.54184 28.43096
## Urban 91.97908 52.50628 33.51464
data <- c(57, 87, 103, 50, 42, 49, 42, 22, 26)
tab <- matrix(data, nrow = 3, ncol = 3)
row.names(tab) <- c("Rural", "Suburban", "Urban")
colnames(tab) <- c("Grades", "Popular", "Sports")
tab
## Grades Popular Sports
## Rural 57 50 42
## Suburban 87 42 22
## Urban 103 49 26
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 18.828, df = 4, p-value = 0.0008497
The \(\chi^2\)-distribution with \(df\) degrees of freedom:
Is never negative. And \(X^2=0\) only happens if \(f_e = f_o\).
Has mean \(\mu=df\)
Has standard deviation \(\sigma=\sqrt{2df}\)
Is skewed to the right, but approaches a normal distribution when \(df\) grows.
For the the Chi-squared statistic, \(X^2\), to be appropriate we require that the expected values have to be \(f_e\geq 5\).
Now we can summarize the ingredients in the Chi-squared test for independence.
If we reject the hypothesis of independence it can be of interest to identify the significant deviations.
In a given cell in the table, \(f_o-f_e\) is the deviation between data and the expected values under the null hypothesis.
We assume that \(f_e\geq 5\).
If \(H_0\) is true, then the standard error of \(f_o - f_e\) is given by \[se=\sqrt{f_e (1 - \textbf{row proportion}) (1 - \textbf{column proportion})}\]
The corresponding \(z\)-score \[z=\frac{f_o-f_e}{se}\] should in 95% of the cells be between \(\pm 2\). Values above 3 or below -3 should not appear.
In popKids
table cell Rural and Grade
we got \(f_e = 77.0\) and \(f_o = 57\). Here column proportion \(=0.517\) and row proportion \(=149/478=0.312\).
We can then calculate \[z = \frac{57-77}{\sqrt{77(1-0.517)(1-0.312)}} = -3.95\].
Compared to the null hypothesis there are way too few rural kids who find grades important.
In summary: The standardized residuals allow for cell-by-cell (\(f_e\) vs \(f_o\)) comparision.
R
R
we can extract the standardized residuals from the output of chisq.test
:tab <- tally(~ Urban.Rural + Goals, data = popKids)
testStat <- chisq.test(tab, correct = FALSE)
testStat$stdres
## Goals
## Urban.Rural Grades Popular Sports
## Rural -3.9508449 1.3096235 3.5225004
## Suburban 1.7666608 -0.5484075 -1.6185210
## Urban 2.0865780 -0.7274327 -1.8186224
For a random sample of black males the General Social Survey in 1996 asked two questions:
Q1: What is your yearly income (income
)?
Q2: How satisfied are you with your job (satisfaction
)?
Both measurements are on an ordinal scale.
VeryD | LittleD | ModerateS | VeryS | |
---|---|---|---|---|
< 15k | 1 | 3 | 10 | 6 |
15-25k | 2 | 3 | 10 | 7 |
25-40k | 1 | 6 | 14 | 12 |
> 40k | 0 | 1 | 9 | 11 |
We might do a chi-square test to see whether Q1 and Q2 are associated, but the test does not exploit the ordinality.
We shall consider a test that incorporates ordinality.
Consider a pair of respondents, where respondent 1 is below respondent 2 in relation to Q1.
If respondent 1 is also below respondent 2 in relation to Q2 then the pair is concordant.
If respondent 1 is above respondent 2 in relation to Q2 then the pair is disconcordant.
Let:
\(C=\) the number of concordant pairs in our sample.
\(D=\) the number of disconcordant pairs in our sample.
Properties:
Gamma lies between -1 og 1
The sign tells whether the association is positive or negative
Large absolute values correspond to strong association
The standard error \(se({\hat{\gamma}})\) on \({\hat{\gamma}}\) is complicated to calculate, so we leave that to software.
We can now determine a 95% confidence interval: \[{\hat{\gamma}}\pm 1.96 se({\hat{\gamma}})\] and if zero is contained in the interval, then there is no significant association, when we perform a test with a 5% significance level.
vcdExtra
, which has the function GKgamma
for calculating gamma. It also has the dataset on job satisfaction and income built-in:library(vcdExtra)
JobSat
## satisfaction
## income VeryD LittleD ModerateS VeryS
## < 15k 1 3 10 6
## 15-25k 2 3 10 7
## 25-40k 1 6 14 12
## > 40k 0 1 9 11
GKgamma(JobSat, level = 0.90)
## gamma : 0.221
## std. error : 0.117
## CI : 0.028 0.414
You have collected a sample and want to know, whether the sample is representative for people living in Hirtshals.
E.g. whether the distribution of gender, age, or profession in the sample do not differ significantly from the distribution in Hirtshals.
Actually, you know how to do that for binary variables like gender, but not if you e.g. have 6 agegroups.
As an example we look at \(k\) groups, where data from Hjørring kommune tells us the distribution in Hirtshals is given by the vector \[\pi = (\pi_1, \ldots, \pi_k),\] where \(\pi_i\) is the proportion which belongs to group number \(i,\ \ i=1,2\ldots,k\) in Hirtshals.
Consider the sample represented by the vector: \[O=(O_1, \ldots, O_k),\] where \(O_i\) is the observed number of individuals in group number \(i,\ \ i = 1, 2, \ldots, k\).
The total number of individuals: \[n = \sum_{i = 1}^k O_i.\]
The expected number of individuals in each group, if we have a sample from Hirtshals: \[ E_i = n \pi_i,\ i = 1, 2, \ldots, k \]
\[X^2=\sum_{i=1}^k\frac{(O_i-E_i)^2}{E_i}\]
k <- 4
pi_vector <- c(0.3, 0.2, 0.25, 0.25)
O_vector <- c(74, 72, 40, 61)
n <- sum(O_vector)
E_vector <- n * pi_vector
E_vector
## [1] 74.10 49.40 61.75 61.75
Xsq = sum((O_vector - E_vector)^2 / E_vector)
Xsq
## [1] 18.00945
p_value <- 1 - pchisq(Xsq, df = k-1)
p_value
## [1] 0.0004378808
R
Xsq_test <- chisq.test(O_vector, p = pi_vector)
Xsq_test
##
## Chi-squared test for given probabilities
##
## data: O_vector
## X-squared = 18.009, df = 3, p-value = 0.0004379
Xsq_test$stdres
## [1] -0.01388487 3.59500891 -3.19602486 -0.11020775