Statistics and electronics - lecture 2

The ASTA team

Checking for log normality

Picture of a “lot” of capacitors.

The word lot is used to identify several components produced in a single run.

Where a run is a production series limited to a given timeinterval and fixed production parameters.

Lot variation

Peter Koch has tested 269 of the capacitors in the displayed lot.

First of all, we will check the assumption that our measurements have a log normal error.

Cap220=read.csv(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_220_nF.txt"))[,1]
ln_Error=log(Cap220/220)
qqnorm(ln_Error,ylab="ln_Error")
qqline(ln_Error,lwd=2,col="red")

Testing normality

The qq-plot(WMM - section 8.8) supports normality of the ln_Error.

There are several tests of normality.

Two of these are considered in WMM section 10.11:

Gearys test

Consider a sample \(X_1,\ldots,X_n\) and an estimate of \(\sigma\) - the standard deviation of the population:

\(S_0\) is always a good estimator of the population standard deviation \(\sigma\) - no matter the form of the population distribution.

Next consider

This is a good estimator of \(\sigma\), if the population is normal. But otherwise, it will under- or overestimate \(\sigma\) depending on the form of the population distribution.

Gearys test

Hence we expect that

For large values of \(n\) a normal approximation yields that

that is, if \(-2\leq z_{obs}\leq 2\), we do not reject normality, if we test on level 5%.

mln_E=mean(ln_Error)
s1=sqrt(mean((ln_Error-mln_E)^2))
s0=sqrt(pi/2)*mean(abs(ln_Error-mln_E))
u=s1/s0
z_obs=sqrt(length(ln_Error))*(u-1)/0.2261
z_obs
## [1] -1.628122

Hence there is no evidence of non-normality.

Goodness of fit

Is a general method for investigating whether a sample has a specific distribution.

The first example in WMM is concerned with the problem of whether a dice is balanced.

That is, all sides have probability 1/6 of showing up.

Rolling the dice 120 times we expect

Actually we observe

Distance measure between observed and expected:

If the dice is balanced then

where k=6 is the number of possible outcomes.

Goodness of fit

For the actual data:

critical_value <- qdist("chisq", .95, df = 5)

critical_value
## [1] 11.0705

At 5% significance the critical value is 11.07, so there is no evidence of unbalancedness.

Goodness of fit - normal distribution

We assume that ln_Error is a sample from a normal distribution and divide the population distribution into 10 bins with equal probabilities p=10%.

The number of bins could be changed. It is required that the expected frequency should be at least 5.

m <- mean(ln_Error)
s <- sd(ln_Error)
breaks <- qnorm((0:10)/10, m, s)

Area in each bin of the red population curve is 0.1 and as sample size is 269 we obtain

Goodness of fit - normal distribution

Observed frequecies:

observed <- table(cut(ln_Error, breaks))
names(observed) <- paste("bin", 1:10, sep = "")
observed
##  bin1  bin2  bin3  bin4  bin5  bin6  bin7  bin8  bin9 bin10 
##    25    37    25    19    28    30    21    25    25    34

\(X^2\) statistic:

chisq_obs <- sum((observed-26.9)^2)/26.9
chisq_obs
## [1] 10.21933

The degrees of freedom is the number of bins minus 3 (number of parameters + 1), i.e. df = 10-3 = 7.

Goodness of fit - normal distribution

chisq_obs
## [1] 10.21933
critical_value <- qdist("chisq", .95, df = 7)

critical_value
## [1] 14.06714
p_value <- 1 - pchisq(chisq_obs, 7)
p_value
## [1] 0.1764812

We do not reject normality at level 5%.

Other tests of normality

As mentioned, there are multiple tests of normality.

We introduce one other test: Shapiro-Wilks. It is standard in R.

We do not treat the details, but the test statistic is somewhat like a correlation for the qq-plot. If the “correlation is far from 1”, we reject normality.

shapiro.test(ln_Error)
## 
##  Shapiro-Wilk normality test
## 
## data:  ln_Error
## W = 0.99255, p-value = 0.1971

With p-value=19.71%, we do not reject normality, if we test on level 5%.

Sources of variation

In lecture 1 we discussed

Generally it is relevant to decompose the production variation in 2 components:

Sources of variation

As we have one lot only, we cannot identify the variation between lots.

Our actual data are thus composed of

Linear calibration

In lecture 1 we developed a linear calibration eliminating the systematic measurement error.

Adopting this to the actual data yields

load("ab.RData")
ln_Error_corrected <- (ln_Error-ab[1])/ab[2]
hist(ln_Error_corrected, breaks = "FD", col = "wheat")

Sources of variation

We are now left with a sample, which has

where we have assumed that the random measurement error and the random lot error are independent.

Estimate of \(\mu_l\)

myl <- mean(ln_Error_corrected)
myl
## [1] -0.02686793

That is, the systematic lot error is around -2.7%.

Estimate of variances

Estimate of \(\sigma_m^2+\sigma_l^2\)

var(ln_Error_corrected)
## [1] 0.0003892828

that is \(s_m^2+s_l^2=\) 3.9e-04

In lecture 1 we estimated \(s_m^2\) = 0.29e-06 and hence

3 sigma limits for the correct lot values:

clearly respecting the 10% tolerance.

Mixture of lots

Peter has also tested 311 capacitors with nominal value 470 nF

cap470 <- read.table(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_470_nF2.txt"))[, 1]
hist(cap470, breaks = 15, col = "greenyellow")

Consulting Peter, it turned out, that his box of capacitors contained components from 2 different lots.

Transforming

We ln-transform and calibrate:

ln_Error <- log(cap470/470)
ln_Error_corrected <- (ln_Error-ab[1])/ab[2]
hist(ln_Error_corrected, breaks = 15, col = "gold")

range(ln_Error_corrected)
## [1] -0.08888934  0.08323081

Mixture model

We assume that the ln_Error

So we have 4 unknown parameters: \((\mu_1,\mu_2,\sigma,p)\).

How to estimate these, we entrust to the R-package mclust.

Fitting a mixture

library(mclust)
fit <- Mclust(ln_Error_corrected, 2 , "E")# 2 clusters; "E"qual variances
pr <- fit$parameters$pro[1]
pr
## [1] 0.728314

The chance of coming from lot1 is around 73%.

means <- fit$parameters$mean
means
##           1           2 
## -0.05174452  0.05406515
sigma <- sqrt(fit$parameters$variance$sigmasq)
sigma
## [1] 0.01692654

Comparing model and data

hist(ln_Error_corrected,breaks=15,col="lightcyan",probability = TRUE,ylim=c(0,18),main="Histogram and population curve")
curve(pr*dnorm(x,means[1],sigma)+(1-pr)*dnorm(x,means[2],sigma),-.1,.1,add=TRUE,lwd=2)

Concluding remarks

Estimate of \(\sigma\) was 1.7%. In relation to the 220 nF lot we estimated 2.0%, which is comparable.

do not completely respect the tolerance 10%. However, in the sample the minimum is -8.9% and the maximum 8.3%.

This indicates that the variation between lots is much greater than the variation within lots.

Which is also clearly illustrated by the histogram/density plots.