The ASTA team
Suppose we study a population and we are interested in certain parameters of the population distribution, e.g. the mean \(\mu\) and the standard deviation \(\sigma\).
Based on a sample we can make a point estimate of the parameter. We have already seen the following examples:
We often supplement the point estimate with an interval estimate (also called a confidence interval). This is an interval around the point estimate, in which we are confident (to a certain degree) that the population parameter is located.
If we want to estimate the population mean \(\mu\) we have several possibilities e.g.
Advantage of \(X_T\): Very large/small observations have little effect, i.e. it has practically no effect if there are a few errors in the data set.
Disadvantage of \(X_T\): If the distribution of the population is skewed, i.e. asymmetrical, then \(X_T\) is biased, i.e. \(E(X_T)\neq \mu\). This means that in the long run this estimator systematically over or under estimates the value of \(\mu\).
Generally we prefer that an estimator is unbiased, i.e. its expected value equals the true parameter value.
Recall that for a sample from a population with mean \(\mu\), the sample mean \(\bar{X}\) also has mean \(\mu\). That is, \(\bar{X}\) is an unbiased estimate of the population mean \(\mu\).
Consider a population with population mean \(\mu\) and standard deviation \(\sigma\). We would like to make a \(100(1-\alpha)\%\) confidence interval for \(\mu\).
Suppose we draw a random sample \(X_1,\ldots,X_n\). As a point estimate for \(\mu\) we use \(\bar{X}\).
If the population follows a normal distribution or if \(n\geq 30\), we may assume \(\bar{X}\sim \texttt{norm}(\mu,\tfrac{\sigma}{\sqrt{n}})\).
The \(z\)-score of \(\bar{X}\) follows a standard normal distribution:
\[Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim \texttt{norm(0,1)}.\]
The confidence interval was derived using the \(z\)-score \(Z= \tfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}\).
Problem: In practice \(\sigma\) is typically unknown.
If we replace \(\sigma\) by the sample standard deviation \(S\), we get the \(t\)-score \[T= \tfrac{\bar{X}-\mu}{S/\sqrt{n}}.\]
Since \(S\) is random with a certain variance, this causes \(T\) to vary more than \(Z\).
As a consequence, \(T\) no longer follows a normal distribution, but a \(t\)-distribution with \(n-1\) degrees of freedom.
In the situation where \(\sigma\) is unknown, we use that \[ T=\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim \texttt{t}(n-1).\]
We determine the critical \(t\)-value \(t_{\alpha/2}\) such that \(P(T>t_{\alpha/2}) = \alpha/2\). This implies by symmetry that \[ P(-t_{\alpha/2}\leq T \leq t_{\alpha/2})=1-\alpha.\]
By exactly the same computations as before, we find that for \(100(1-\alpha)\%\) of all samples, \(\mu\) lies in \[\left[\bar{X}-t_{\alpha/2}\tfrac{S}{\sqrt{n}}; \bar{X} + t_{\alpha/2} \tfrac{S}{\sqrt{n}}\right].\]
This interval is what we call the \(100(1-\alpha)\%\) confidence interval for \(\mu\).
To apply the formula, we need to be able to compute the critical \(t\)-value \(t_{\alpha/2}=P(T>\alpha/2)\).
This can be done in R via the function qdist
.
Note that we need the point with right tail probability \(\alpha/2\) while R gives the left tail probailities. The right tail probability \(\alpha/2\) corresponds to the left tail probability \(1-\alpha/2\).
So to find \(t_{\alpha/2}\) with \(\alpha=0.05\) (corresponding to a \(95\%\) confidence level) in a \(t\)-distribution with \(4\) degrees of freedom, we type:
## [1] 2.776445
mtcars
. We want to construct a \(95\%\) confidence interval for the population mean \(\mu\) of the fuel consumption.## min Q1 median Q3 max mean sd n missing
## 10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
## [1] 2.039513
##
## One Sample t-test
##
## data: mpg
## t = 18.857, df = 31, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
We shall look at a built-in R dataset chickwts
.
?chickwts
yields a page with the following information
An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types.
chickwts
is a data frame with 71 observations on 2 variables:
weight
: a numeric variable giving the chick weight.feed
: a factor giving the feed type.Calculate a confidence interval for the mean weight for each feed separately; the confidence interval is from lower
to upper
given by mean
\(\pm\)tscore * se
:
cwei <- favstats( weight ~ feed, data = chickwts)
se <- cwei$sd / sqrt(cwei$n) # Standard errors
tscore <- qt(p = .975, df = cwei$n - 1) # t-scores for 2.5% right tail probability (qdist cannot take multiple df's)
cwei$lower <- cwei$mean - tscore * se
cwei$upper <- cwei$mean + tscore * se
cwei[, c("feed", "mean", "lower", "upper")]
## feed mean lower upper
## 1 casein 323.5833 282.6440 364.5226
## 2 horsebean 160.2000 132.5687 187.8313
## 3 linseed 218.7500 185.5610 251.9390
## 4 meatmeal 276.9091 233.3083 320.5099
## 5 soybean 246.4286 215.1754 277.6818
## 6 sunflower 328.9167 297.8875 359.9458
gf_errorbarh
:We consider again the data set concerning votes in Chile.
We are interested in the unknown proportion \(p\) of females in the population of Chile.
The gender distribution in the sample is:
## sex
## F M
## 1379 1321
## sex
## F M
## 0.5107407 0.4892593
##
## 1-sample proportions test without continuity correction
##
## data: Chile$sex [with success = F]
## X-squared = 1.2459, df = 1, p-value = 0.2643
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4918835 0.5295675
## sample estimates:
## p
## 0.5107407
correct = FALSE
is needed to make R use the “normal” formulas as on the slides and in the book. When correct = TRUE
(the default) a mathematical correction which you have not learned about is applied and slightly different results are obtained.We could also have computed a 99% confidence interval for the proportion of females in Chile:
qdist("norm", 1 - 0.01/2)
=2.576.Suppose we are interested in the variance \(\sigma^2\) of a population.
We draw a random sample \(X_1,\ldots,X_n\) and use the sample variance \(S^2\) as a point estimate or \(\sigma^2\).
When the population distribution is normal, or \(n\geq 30\), \[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1),\] where \(\chi^2(n-1)\) is the chi-square distribution with \((n-1)\) degrees of freedom (se next slide).
The distribution \(\chi^2(k)\) is called the chi-square distribution.
It is a continuous distribution on \((0,\infty)\).
It depends on a the parameter \(k\) called the degrees of freedom.
The degrees of freedom determine the shape of the distribution.
The mean value is \(k\).
To make a confidence interval for \(\sigma^2\), we draw a random sample \(X_1,\ldots,X_n\), compute \(S^2\) and recall that
\[\frac{(n-1)S^2}{\sigma}\sim \chi^2(n-1).\]
Let \(\chi^2_{\alpha/2}\) and \(\chi_{1-\alpha/2}^2\) be the critical values in a \(\chi^2\)-distribution with \((n-1)\) degrees of freedom such the right tail probabilities are \(\alpha/2\) and \(1-\alpha/2\), respectively.
Then \[P\left(\chi^2_{1-\alpha/2} \leq \frac{(n-1)S^2}{\sigma^2} \leq\chi_{\alpha/2}^2\right)=1-\alpha.\]
Isolating \(\sigma^2\), this is equivalent to
\[P\left(\frac{(n-1)S^2}{\chi^2_{\alpha/2}} \leq \sigma^2 \leq \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}\right)=1-\alpha.\]
So we get the confidence interval for \(\sigma^2\): \[\left[\frac{(n-1)S^2}{\chi^2_{\alpha/2}} ; \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}\right].\]
A confidence interval for \(\sigma\) can be found by taking square roots: \[\left[\sqrt{\frac{(n-1)S^2}{\chi^2_{\alpha/2}} };\sqrt{ \frac{(n-1)S^2}{\chi^2_{1-\alpha/2}}}\right].\]
Note that these confidence intervals are not symmetric around the point estimate.
mtcars
and construct a \(95\%\) confidence interval for the population variance \(\sigma^2\) of the fuel consumption. We find the sample variance to be \(6.026948^2\approx 36.3\) using ´favstats´.## min Q1 median Q3 max mean sd n missing
## 10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
qdist
. The degrees of freedom are \(n-1\). The \(\chi^2\)-distribution is not symmetric, so we need to find both \(\chi^2_{\alpha/2}\) and \(\chi^2_{1-\alpha/2}\).## [1] 48.23189
## [1] 17.53874
When planning an experiment, one has to decide how large the sample size should be.