The ASTA team
Picture of a “lot” of capacitors.
The word lot is used to identify several components produced in a single run.
We expect components from the same lot to be more similar.
Peter Koch has tested 269 of the capacitors in the displayed lot (one measurement for each).
Cap220=read.csv(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_220_nF.txt"))[,1]
summary(Cap220)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 197.2 204.8 207.9 207.9 210.9 218.6
Last time we assumed log normality of the relative measurements: \[\ln\Big(\frac{\text{measuredValue}}{\text{nominalValue}}\Big) \sim \text{norm}(\mu,\sigma).\]
The data we considered last time did not allow us check this assumtion.
We have seen that normality can be checked with a qqplot (lecture 1.3, [WMMY] Sec. 8.8).
Cap220=read.csv(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_220_nF.txt"))[,1]
ln_Error=log(Cap220/220)
qqnorm(ln_Error,ylab="ln_Error")
qqline(ln_Error,lwd=2,col="red")
ln_Error
.One can also make a test of the null-hypothesis \[H_0: \text{ the population has a normal distribution.}\]
There are several tests of normality.
Two of these are considered in [WMMY] Section 10.11:
If the population distribution is normal, we expect that \[U=\frac{S_1}{S_0}\] is close to 1.
Under the null-hypothesis,
\[Z=\frac{\sqrt{n}(U-1)}{0.2661}\] is approximately standard normally distributed when \(n\) is large.
That is, with a significance level of 5%, we reject the null-hypothesis if \(|z_{obs}|> 1.96\).
We can do all the computations in R.
mln_E=mean(ln_Error)
s1=sqrt(mean((ln_Error-mln_E)^2))
s0=sqrt(pi/2)*mean(abs(ln_Error-mln_E))
u=s1/s0
z_obs=sqrt(length(ln_Error))*(u-1)/0.2661
z_obs
## [1] -1.383383
Goodness of fit is a general method for investigating whether a sample comes from a specific distribution.
Before considering test for normality, we consider a simpler example (see [WMMY] Sec. 10.11).
Suppose we roll a die. We have the null-hypothesis that the die is fair, i.e. the probabilities of the outcomes \((1,2,3,4,5,6)\) are \[(1/6,1/6,1/6,1/6,1/6,1/6).\]
Rolling the die 120 times, we expect the frequencies \[(20, 20, 20, 20, 20, 20)\]
Actually we observe the frequencies \[(20, 22, 17, 18, 19, 24)\]
The distance between observed and expected frequencies is measured by \[X^2=\sum\frac{\mbox{(ObservedFrequencies - ExpectedFrequencies)}^2}{\mbox{ExpectedFrequencies}}\]
## [1] 11.0705
ln_Error
is a sample from a normal distribution.Area in each bin of the red population curve is 0.1
As the sample size is 269 we obtain that the expected frequency is \(269*0.1=26.9\) in each bin.
## bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
## 25 37 25 19 28 30 21 25 25 34
## [1] 10.21933
## [1] 10.21933
## [1] 14.06714
Since \(X^2\) is smaller than the critical value, we do not reject the null-hypothesis
We could also have used the p-value
## [1] 0.1764812
There are many other tests of normality.
We mention one of the most commonly used tests: Shapiro-Wilks.
It is standard in R
.
We do not treat the details, but the test statistic is somewhat like a correlation for the qq-plot.
##
## Shapiro-Wilk normality test
##
## data: ln_Error
## W = 0.99255, p-value = 0.1971
The completely general model would be: \[\text{measuredValue} = \text{systematicError} + \text{lotError}\] \[\qquad \qquad + \text{componentError} + \text{measurementError}\]
In mathematical notation \[Y_{k,i,j} = \mu + L_k + C_{k,i} + \varepsilon_{k,i,j}\] where
The errors are assumed random and normal
As we have one lot only, we cannot identify the variation between lots.
We only have one measurement on each component
The model for our data reduces to (since \(k=1\) and \(j=1\) we omit them from notation) \[Y_i = \mu + \mu_l + C_i + \varepsilon_i\] where
In lecture 4.1 we developed a linear calibration to eliminate the systematic measurement error.
To remove the systematic measurement error, we apply this calibration to our new dataset.
## [1] -0.02686793
That is, the systematic lot error is around -2.7%.
Estimate of \(\sigma_m^2+\sigma_c^2\)
## [1] 0.0003892828
That is \(s_m^2+s_c^2=3.9\cdot 10^{-4}\).
In lecture 4.1 we estimated \(s_m^2 = 0.29\cdot 10^{-6}\) and hence \(s_c^2=3.9\cdot 10^{-4}\) \[s_c = \sqrt{3.9\cdot 10^{-4}} = 0.02\]
3 sigma limits for the corrected lot values: \[-2.7\% \pm 3\cdot 2.0 \% = [-8.7; 3.3]\%\] clearly respecting the 10% tolerance.
cap470 <- read.table(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_470_nF2.txt"))[, 1]
hist(cap470, breaks = 15, col = "greenyellow")
ln_Error <- log(cap470/470)
ln_Error_corrected <- (ln_Error-ab[1])/ab[2]
hist(ln_Error_corrected, breaks = 15, col = "gold")
## [1] -0.08888934 0.08323081
We assume that the ln_Error
So we have 4 unknown parameters: \((\mu_1,\mu_2,\sigma,p)\).
To estimate these, we entrust to the R
-package mclust
.
library(mclust)
fit <- Mclust(ln_Error_corrected, 2 , "E")# 2 clusters; "E"qual variances
pr <- fit$parameters$pro[1]
pr
## [1] 0.728314
## 1 2
## -0.05174452 0.05406515
## [1] 0.01692654
Estimate of \(\sigma\) was 1.7%. In relation to the 220 nF lot we estimated 2.0%, which is comparable.
The lots do not completely respect the tolerance of 10%. However, in the sample the minimum is -8.9% and the maximum 8.3%.
The difference in lot means is \(5.4\%-(-5,2)\%=10.6%\).
This indicates that the variation between lots is much greater than the variation within lots.
This is also clearly illustrated by the histogram/density plots.