The ASTA team
Suppose we study a population and we are interested in certain parameters of the population distribution, e.g. the mean μ and the standard deviation σ.
Based on a sample we can make a point estimate of the parameter. We have already seen the following examples:
We often supplement the point estimate with an interval estimate (also called a confidence interval). This is an interval around the point estimate, in which we are confident (to a certain degree) that the population parameter is located.
If we want to estimate the population mean μ we have several possibilities e.g.
Advantage of XT: Very large/small observations have little effect, i.e. it has practically no effect if there are a few errors in the data set.
Disadvantage of XT: If the distribution of the population is skewed, i.e. asymmetrical, then XT is biased, i.e. E(XT)≠μ. This means that in the long run this estimator systematically over or under estimates the value of μ.
Generally we prefer that an estimator is unbiased, i.e. its expected value equals the true parameter value.
Recall that for a sample from a population with mean μ, the sample mean ˉX also has mean μ. That is, ˉX is an unbiased estimate of the population mean μ.
Consider a population with population mean μ and standard deviation σ. We would like to make a 100(1−α)% confidence interval for μ.
Suppose we draw a random sample X1,…,Xn. As a point estimate for μ we use ˉX.
If the population follows a normal distribution or if n≥30, we may assume ˉX∼norm(μ,σ√n).
The z-score of ˉX follows a standard normal distribution:
Z=ˉX−μσ/√n∼norm(0,1).
The confidence interval was derived using the z-score Z=ˉX−μσ/√n.
Problem: In practice σ is typically unknown.
If we replace σ by the sample standard deviation S, we get the t-score T=ˉX−μS/√n.
Since S is random with a certain variance, this causes T to vary more than Z.
As a consequence, T no longer follows a normal distribution, but a t-distribution with n−1 degrees of freedom.
In the situation where σ is unknown, we use that T=ˉX−μS/√n∼t(n−1).
We determine the critical t-value tα/2 such that P(T>tα/2)=α/2. This implies by symmetry that P(−tα/2≤T≤tα/2)=1−α.
By exactly the same computations as before, we find that for 100(1−α)% of all samples, μ lies in [ˉX−tα/2S√n;ˉX+tα/2S√n].
This interval is what we call the 100(1−α)% confidence interval for μ.
To apply the formula, we need to be able to compute the critical t-value tα/2=P(T>α/2).
This can be done in R via the function qdist
.
Note that we need the point with right tail probability α/2 while R gives the left tail probailities. The right tail probability α/2 corresponds to the left tail probability 1−α/2.
So to find tα/2 with α=0.05 (corresponding to a 95% confidence level) in a t-distribution with 4 degrees of freedom, we type:
## [1] 2.776445
mtcars
. We want to construct a 95% confidence interval for the population mean μ of the fuel consumption.## min Q1 median Q3 max mean sd n missing
## 10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
## [1] 2.039513
##
## One Sample t-test
##
## data: mpg
## t = 18.857, df = 31, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
We shall look at a built-in R dataset chickwts
.
?chickwts
yields a page with the following information
An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types.
chickwts
is a data frame with 71 observations on 2 variables:
weight
: a numeric variable giving the chick weight.feed
: a factor giving the feed type.Calculate a confidence interval for the mean weight for each feed separately; the confidence interval is from lower
to upper
given by mean
±tscore * se
:
cwei <- favstats( weight ~ feed, data = chickwts)
se <- cwei$sd / sqrt(cwei$n) # Standard errors
tscore <- qt(p = .975, df = cwei$n - 1) # t-scores for 2.5% right tail probability (qdist cannot take multiple df's)
cwei$lower <- cwei$mean - tscore * se
cwei$upper <- cwei$mean + tscore * se
cwei[, c("feed", "mean", "lower", "upper")]
## feed mean lower upper
## 1 casein 323.5833 282.6440 364.5226
## 2 horsebean 160.2000 132.5687 187.8313
## 3 linseed 218.7500 185.5610 251.9390
## 4 meatmeal 276.9091 233.3083 320.5099
## 5 soybean 246.4286 215.1754 277.6818
## 6 sunflower 328.9167 297.8875 359.9458
gf_errorbarh
:We consider again the data set concerning votes in Chile.
We are interested in the unknown proportion p of females in the population of Chile.
The gender distribution in the sample is:
## sex
## F M
## 1379 1321
## sex
## F M
## 0.5107407 0.4892593
##
## 1-sample proportions test without continuity correction
##
## data: Chile$sex [with success = F]
## X-squared = 1.2459, df = 1, p-value = 0.2643
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4918835 0.5295675
## sample estimates:
## p
## 0.5107407
correct = FALSE
is needed to make R use the “normal” formulas as on the slides and in the book. When correct = TRUE
(the default) a mathematical correction which you have not learned about is applied and slightly different results are obtained.We could also have computed a 99% confidence interval for the proportion of females in Chile:
qdist("norm", 1 - 0.01/2)
=2.576.Suppose we are interested in the variance σ2 of a population.
We draw a random sample X1,…,Xn and use the sample variance S2 as a point estimate or σ2.
When the population distribution is normal, or n≥30, (n−1)S2σ2∼χ2(n−1), where χ2(n−1) is the chi-square distribution with (n−1) degrees of freedom (se next slide).
The distribution χ2(k) is called the chi-square distribution.
It is a continuous distribution on (0,∞).
It depends on a the parameter k called the degrees of freedom.
The degrees of freedom determine the shape of the distribution.
The mean value is k.
To make a confidence interval for σ2, we draw a random sample X1,…,Xn, compute S2 and recall that
(n−1)S2σ∼χ2(n−1).
Let χ2α/2 and χ21−α/2 be the critical values in a χ2-distribution with (n−1) degrees of freedom such the right tail probabilities are α/2 and 1−α/2, respectively.
Then P(χ21−α/2≤(n−1)S2σ2≤χ2α/2)=1−α.
Isolating σ2, this is equivalent to
P((n−1)S2χ2α/2≤σ2≤(n−1)S2χ21−α/2)=1−α.
So we get the confidence interval for σ2: [(n−1)S2χ2α/2;(n−1)S2χ21−α/2].
A confidence interval for σ can be found by taking square roots: [√(n−1)S2χ2α/2;√(n−1)S2χ21−α/2].
Note that these confidence intervals are not symmetric around the point estimate.
mtcars
and construct a 95% confidence interval for the population variance σ2 of the fuel consumption. We find the sample variance to be 6.0269482≈36.3 using ´favstats´.## min Q1 median Q3 max mean sd n missing
## 10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
qdist
. The degrees of freedom are n−1. The χ2-distribution is not symmetric, so we need to find both χ2α/2 and χ21−α/2.## [1] 48.23189
## [1] 17.53874
When planning an experiment, one has to decide how large the sample size should be.
Space, Right Arrow or swipe left to move to next slide, click help below for more details