---
title: "Statistics and electronics - lecture 2"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    keep_tex: yes
    highlight: tango
    number_section: yes
    toc: yes

---

```{r, include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic","VennDiagram"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
library(VennDiagram)
library(mosaic)

```
----

## Checking for log normality

```{r, fig.width=10, echo=FALSE, fig.align='center', fig.caption="Lot of capacitors with nominal value 220 NF and tolerance 10%"}
url <- "https://asta.math.aau.dk/static-files/asta/img/220nF_10procent.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```

Picture of a "lot" of capacitors. 

The word lot is used to identify several components produced in a single run. 

Where a run is a production series limited to a given timeinterval and fixed production parameters.


## Lot variation

Peter Koch has tested 269 of the capacitors in the displayed lot.

First of all, we will check the assumption that our measurements have a log normal error.


```{r fig.dim=c(4,3)}
Cap220=read.csv(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_220_nF.txt"))[,1]
ln_Error=log(Cap220/220)
qqnorm(ln_Error,ylab="ln_Error")
qqline(ln_Error,lwd=2,col="red")
```

## Testing normality

The qq-plot(WMM - section 8.8) supports normality of the ln_Error.

There are several tests of normality. 

Two of these are considered in WMM section 10.11:

- Gearys test
- goodness of fit


## Gearys test

Consider a sample $X_1,\ldots,X_n$ and an estimate of $\sigma$ -
the standard deviation of the population:

- $S_0=\sqrt{\frac{1}{n}\sum_i(X_i-\bar X)^2}$

$S_0$ is **always** a good estimator of the population standard deviation $\sigma$ -
no matter the form of the population distribution.

Next consider

- $S_1=\sqrt{\frac{\pi}{2}}\sum_i|X_i-\bar X|/n$

This is a good estimator of $\sigma$, **if** the population is normal. 
But otherwise, it will under- or overestimate $\sigma$ depending on the form of the population distribution.

## Gearys test

Hence we expect that

- $U=\frac{S_1}{S_0}$ should be close to one in case of normality.

For large values of $n$ a normal approximation yields that

- $Z=\frac{\sqrt{n}(U-1)}{0.2661}$ has a standard normal distribution **if** the sample is normal

that is, if $-2\leq z_{obs}\leq 2$, we do not reject normality, if we test on level 5%.


```{r }
mln_E=mean(ln_Error)
s1=sqrt(mean((ln_Error-mln_E)^2))
s0=sqrt(pi/2)*mean(abs(ln_Error-mln_E))
u=s1/s0
z_obs=sqrt(length(ln_Error))*(u-1)/0.2261
z_obs
```

Hence there is no evidence of non-normality.

## Goodness of fit

Is a general method for investigating whether a sample has a specific distribution.

The first example in WMM is concerned with the problem of whether a dice is balanced. 

That is, all sides have probability 1/6 of showing up.

Rolling the dice 120 times we expect

- ExpectedFrequency: (20, 20, 20, 20, 20, 20)

Actually we observe

- ObservedFrequency: (20, 22, 17, 18, 19, 24)



Distance measure between observed and expected:

- $X^2=\sum\frac{\mbox{(ObservedFrequencies - ExpectedFrequencies)}^2}{\mbox{ExpectedFrequencies}}$

**If** the dice is balanced then

- $X^2$ has a so-called **chi-square distribution** (WMM chapter 6.7) with df=k-1=5, degrees of freedom

where k=6 is the number of possible outcomes.

## Goodness of fit

For the actual data:

- $x^2_{obs}=1.7$ and we need to judge whether this is higher than expected. **If** the null hypothesis is true.

```{r fig.dim=c(4,2)}
critical_value <- qdist("chisq", .95, df = 5)
critical_value
```

At 5% significance the critical value is 11.07, so there is no evidence of unbalancedness.

## Goodness of fit - normal distribution

We assume that ln_Error is a sample from a normal distribution and divide the population distribution into 10 bins with equal probabilities p=10%. 

The number of bins could be changed.
It is required that the expected frequency should be at least 5.



```{r }
m <- mean(ln_Error)
s <- sd(ln_Error)
breaks <- qnorm((0:10)/10, m, s)
```

```{r fig.dim=c(6,4),echo=FALSE}
hist(ln_Error,breaks=c(min(ln_Error),breaks[2:10],max(ln_Error)),main="Histogram and population curve")
curve(dnorm(x,m,s),min(ln_Error),max(ln_Error),lwd=2,add=TRUE,col="red")
for(x in breaks[2:10]) lines(c(x,x),c(0,dnorm(x,m,s)),col="red",lwd=2)
```

Area in each bin of the red population curve is 0.1 and as sample size is 269 we obtain

- Expected_frequency is 26.9 in each bin

## Goodness of fit - normal distribution

Observed frequecies:

```{r }
observed <- table(cut(ln_Error, breaks))
names(observed) <- paste("bin", 1:10, sep = "")
observed
```

$X^2$ statistic:

```{r}
chisq_obs <- sum((observed-26.9)^2)/26.9
chisq_obs
```

The degrees of freedom is the number of bins minus 3 (number of parameters + 1), i.e. df = 10-3 = 7.

## Goodness of fit - normal distribution

```{r fig.dim=c(4,2)}
chisq_obs
critical_value <- qdist("chisq", .95, df = 7)
critical_value
p_value <- 1 - pchisq(chisq_obs, 7)
p_value
```

We do not reject normality at level 5%.

## Other tests of normality

As mentioned, there are multiple tests of normality.

We introduce one other test: Shapiro-Wilks.
It is standard in `R`.

We do not treat the details, 
but the test statistic is somewhat like a correlation for the qq-plot.
If the "correlation is far from 1", we reject normality.

```{r }
shapiro.test(ln_Error)
```
With p-value=19.71%, we do not reject normality, if we test on level 5%.

## Sources of variation

In lecture 1 we discussed

- systematic measurement error
- random measurement variation
- production variation

Generally it is relevant to decompose the production variation in 2 components:

- variation within lot, i.e. the variation around the lot mean
- variation between lots, i.e. the variation of the lot means.

## Sources of variation

As we have one lot only, we cannot identify the variation between lots.

Our actual data are thus composed of

- systematic measurement error - call it $\mu_m$
- systematic lot error - call it $\mu_l$
- standard deviation of measurement - call it $\sigma_m$
- standard deviation within lot - call it $\sigma_l$

## Linear calibration

In lecture 1 we developed a linear calibration eliminating the systematic measurement error.

Adopting this to the actual data yields


```{r fig.dim=c(6, 4)}
load("ab.RData")
ln_Error_corrected <- (ln_Error-ab[1])/ab[2]
hist(ln_Error_corrected, breaks = "FD", col = "wheat")
```

## Sources of variation

We are now left with a sample, which has

- mean $\mu_l$ and variance $\sigma_m^2+\sigma_l^2$

where we have assumed that the random measurement error and the random lot error are independent.

Estimate of $\mu_l$

```{r }
myl <- mean(ln_Error_corrected)
myl
```
That is, the systematic lot error is around -2.7%.

## Estimate of variances

Estimate of $\sigma_m^2+\sigma_l^2$

```{r }
var(ln_Error_corrected)
```

that is $s_m^2+s_l^2=$ 3.9e-04

In lecture 1 we estimated $s_m^2$ = 0.29e-06 and hence

- $s_l$ = sqrt(3.9e-04) = 2.0%.

3 sigma limits for the correct lot values:

- -2,7 $\pm$ 3*2.0 = [-8.7; 3.3]%

clearly respecting the 10% tolerance.

## Mixture of lots

Peter has also tested 311 capacitors with nominal value 470 nF

```{r fig.dim=c(6, 4)}
cap470 <- read.table(url("https://asta.math.aau.dk/datasets?file=capacitor_lot_470_nF2.txt"))[, 1]
hist(cap470, breaks = 15, col = "greenyellow")
```

Consulting Peter, it turned out, 
that his box of capacitors contained components from 2 different lots. 

## Transforming

We ln-transform and calibrate:

```{r fig.dim=c(6, 4)}
ln_Error <- log(cap470/470)
ln_Error_corrected <- (ln_Error-ab[1])/ab[2]
hist(ln_Error_corrected, breaks = 15, col = "gold")
```

```{r}
range(ln_Error_corrected)
```


## Mixture model

We assume that the ln_Error

- is normal with mean $\mu_1$ if the component is from lot 1
- is normal with mean $\mu_2$ if the component is from lot 2
- both distributions have variance $\sigma^2=\sigma_m^2+\sigma_l^2$
- the probability of coming from lot 1 is $p$

So we have 4 unknown parameters: $(\mu_1,\mu_2,\sigma,p)$.

How to estimate these, we entrust to the `R`-package `mclust`.

## Fitting a mixture

```{r }
library(mclust)
fit <- Mclust(ln_Error_corrected, 2 , "E")# 2 clusters; "E"qual variances
pr <- fit$parameters$pro[1]
pr
```
The chance of coming from lot1 is around 73%.

```{r }
means <- fit$parameters$mean
means
```

- The mean in lot 1 is around -5.2%
- The mean in lot 2 is around 5.4%

```{r }
sigma <- sqrt(fit$parameters$variance$sigmasq)
sigma
```

- $\sigma$ is around 1.7%

## Comparing model and data

```{r fig.dim=c(6, 4)}
hist(ln_Error_corrected,breaks=15,col="lightcyan",probability = TRUE,ylim=c(0,18),main="Histogram and population curve")
curve(pr*dnorm(x,means[1],sigma)+(1-pr)*dnorm(x,means[2],sigma),-.1,.1,add=TRUE,lwd=2)
```

## Concluding remarks

Estimate of $\sigma$ was 1.7%. In relation to the 220 nF lot we estimated 2.0%, which is comparable.

- 3 sigma limits for the correct lot 1 values: -5.2 $\pm$ 3*1.7=[-10.3;-0.1]%
- 3 sigma limits for the correct lot 2 values: 5.4 $\pm$ 3*1.7=[0.3;10.5]%

do not completely respect the tolerance 10%. However, in the sample the minimum is -8.9% and the maximum 8.3%.

- The difference in lot means is 5.4-(-5,2)=10.6%. 

This indicates that the variation between lots is much greater than the variation within lots.

Which is also clearly illustrated by the histogram/density plots. 

```{r echo=FALSE}
#x_low=means[1]-3*s
#x_high=means[2]+3*s
#x=seq(x_low,x_high,length.out = 250)
#y=pr*pnorm(x,means[1],sigma)+(1-pr)*pnorm(x,means[2],sigma)
#break_no=c()
#for (i in 1:14) break_no[i]=which.min(abs(y-i/15))
#pcum=c(0,y[break_no],1)
#expected=length(ln_Error)*diff(pcum)
#round(expected,1)
#breaks=c(-Inf,x[break_no],Inf)
#observed=table(cut(ln_Error_corrected,breaks))
#sum((observed-expected)^2/expected)
#significant
```

