The ASTA team
import pandas as pd
Chile = pd.read_csv("https://asta.math.aau.dk/datasets?file=Chile.txt", sep = "\t")
sex
, i.e. the gender distribution in the sample.## sex
## F 1379
## M 1321
## Name: count, dtype: int64
## sex
## F 0.510741
## M 0.489259
## Name: proportion, dtype: float64
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
counts = Chile['sex'].value_counts()
successes = counts["F"]
nobs = counts.sum()
stat, p_value = proportions_ztest(count = successes, nobs = nobs, value = 0.5)
ci_low, ci_high = proportion_confint(successes, nobs, alpha = 0.05, method = 'normal')
print(f"95% CI: ({ci_low:.4f}, {ci_high:.4f})")
## 95% CI: (0.4919, 0.5296)
## sample estimate: 0.5107
## p-value: 0.2642
Compute for the Chile
data set the 99% and 95%-confidence intervals for the probability that a person is female:
qdist("norm", 1 - 0.01/2)
=2.576.qdist("norm", 1 - 0.05/2)
=1.96.prop.test
).The expression of the density function is of slightly complicated form and will not be stated here, instead the \(t\)-distribution is plotted below for \(df =1,2,10\) and \(\infty\).
## np.float64(2.7764451051977987)
We seek the quantile (i.e. value on the x-axis) such that we have a given right tail probability. This is the critical t-score associated with our desired level of confidence.
To get e.g. the \(t\)-score corresponding to a right tail probability of 2.5 % we have to look up the 97.5 % quantile using ppf
with p = 1 - 0.025
since ppf
looks at the area to the left hand side.
The degrees of freedom are determined by the sample size; in this example we just used df = 4 for illustration.
As the \(t\)-score giving a right probability of 2.5 % is 2.776 and the \(t\)-distribution is symmetric around 0, we have that an observation falls between -2.776 and 2.776 with probability 1 - 2 \(\cdot\) 0.025 = 95 % for a \(t\)-distribution with 4 degrees of freedom.
Ericksen
and want to construct a \(95\%\) confidence interval for the population mean \(\mu\) of the variable crime
.import numpy as np
Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep = "\t")
stats = Ericksen['crime'].agg(
mean = 'mean',
std = 'std',
n = 'count'
)
stats
## mean 63.060606
## std 24.891073
## n 66.000000
## Name: crime, dtype: float64
## np.float64(1.9971379083920033)
I.e. we have
The confidence interval is \(\bar{y}\pm t_{crit}\frac{s}{\sqrt{n}} = (56.942 , 69.18)\)
All these calculations can be done automatically:
import pingouin as pg
res = pg.ttest(x = Ericksen['crime'], y = 0, confidence = 0.95)
res["mean"] = Ericksen['crime'].mean()
res[["mean", "T", "dof", "alternative", "p-val", "CI95%"]]
## mean T dof alternative p-val CI95%
## T-test 63.060606 20.581949 65 two-sided 3.564824e-30 [56.94, 69.18]
We shall look at a dataset chickwts
:
An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types.
chickwts
is a data frame with 71 observations on 2 variables:
weight
: a numeric variable giving the chick weight.feed
: a factor giving the feed type.Calculate a confidence interval for the mean weight for each feed separately; the confidence interval is from lower
to upper
given by mean
\(\pm\)tscore * se
:
chickwts = pd.read_csv("https://asta.math.aau.dk/datasets?file=chickwts.txt", sep = "\t")
cwei = chickwts.groupby("feed")["weight"].agg(
['mean',
'std',
'count'
])
cwei["se"] = cwei["std"] / (cwei["count"])**(1/2)
cwei["tscore"] = t.ppf(1 - 0.025, df = cwei["count"] - 1)
cwei["lower"] = cwei["mean"] - cwei["tscore"] * cwei["se"]
cwei["upper"] = cwei["mean"] + cwei["tscore"] * cwei["se"]
cwei[["mean", "lower", "upper"]]
## mean lower upper
## feed
## casein 323.583333 282.644025 364.522642
## horsebean 160.200000 132.568738 187.831262
## linseed 218.750000 185.561021 251.938979
## meatmeal 276.909091 233.308259 320.509923
## soybean 246.428571 215.175378 277.681765
## sunflower 328.916667 297.887508 359.945825
import matplotlib.pyplot as plt
g = plt.errorbar(
x = cwei.index, # feed categories
y = cwei["mean"], # means
yerr = cwei["mean"] - cwei["lower"], # half-width of CI
fmt = 'o', # point marker
ecolor = 'black', # error bar color
capsize = 5 # end cap length
)
plt.xlabel("Feed")
plt.ylabel("Weight")
plt.title("Mean weight with 95% CI by feed")