---
title: "Statistics and electronics - lecture 1"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    keep_tex: yes
    highlight: tango
    number_section: yes
    toc: yes

---

```{r, include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic","VennDiagram"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
library(VennDiagram)
library(mosaic)

```
----

## Sources of variation

We shall study 2 types of variation


- measurement variation due to random errors on a measuring device
- component variation due to random errors in the production proces

```{r, fig.width=10, echo=FALSE, fig.align='center', fig.caption="Capacitors with nominal value 47/150 nF and tolerance 1%"}
url <- "https://asta.math.aau.dk/static-files/asta/img/kondensatorer-forsog-nr2.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```

## Data from Peter Koch

Peter has done 100 independent measurements of the capacity of 4 of the displayed capacitors and one additional. 
Nominal values are 47, 47, 100, 150, 150. All with stated tolerance of 1%.

```{r }
load(url("https://asta.math.aau.dk/datasets?file=cap_1pct.RData"))
head(capDat, 4)
```

Here we see the first 4 capacity measurements of the first capacitor with nominal value 47.

- Remark: The measured values are consistently below the nominal value minus the 1% tolerance: $47 - 0.47 = 46.53$. 

```{r}
table(capDat$sample)
```


## Transformation

Linearisation:

\begin{align}
  f(x) &\approx f(x_0) + f'(x_0)(x - x_0) \\
  \\
  x_0 &= 1 \\
  f(x) &= \log x \\
  f'(x) &= 1/x \\
  \\
  x    &= m/n \\
  \log \left ( \frac{m}{n} \right ) &\approx \log 1 + \frac{1}{1} \left ( \frac{m}{n} - 1 \right ) \\
  &= \frac{m - n}{n}
\end{align}

\vspace{1em}

$$
\log \left ( \frac{m}{n} \right ) \approx \frac{m - n}{n}
$$

```{r}
n <- 47
m <- seq(47-5*0.01*47, 47+5*0.01*47, length.out = 100)
plot(m, log(m/n), col = "red", type = "l")
lines(m, (m - n)/n, col = "blue", type = "l")
legend("topleft", legend = c("log(m/n)", "(m-n)/n"), lty = 1, col = c("red", "blue"))
```


## Transformation

$$
\log \left ( \frac{m}{n} \right ) \approx \frac{m - n}{n}
$$


Instead of the raw measurement we will consider:

$\mbox{lnError = ln(measuredValue/nominalValue)}$

Remark that by linear approximation:

$\mbox{lnError}\approx\mbox{measuredValue/nominalValue - 1 = }$ $\mbox{(measuredValue-nominalValue)/nominalValue}$

which is the error relative to the nominal value.

**I.e.: `lnError` can be interpreted as relative error.**

## Transformed data

```{r fig.dim=c(4,3)}
capDat = within(capDat, lnError <- log(capacity/nomval))
head(capDat, 2)
tail(capDat, 2)
```

- The resolution on Peters capacitance meter is with 2/1 decimal(s) in the 47/150 nF range, which means that only a limited number of different values(3-8) are observed. Meaning that box- or histogram-plots are noninformative.

## Model considerations

- The measurements are more than 2.7% below the nominal value.
This must be due to a systematic error on the meter.

In this case we have as earlier mentioned two further sources of error:

- $\mbox{ln(measuredValue / nominalValue) = systematicError + productionError + measurementError}$

## Statistical model

$$\mbox{ln(measuredValue / nominalValue) = systematicError + productionError + measurementError}$$

We formulate the model:

- $Y_{ij}=\mu+A_i+\varepsilon_{ij}$

where

- $Y_{ij}$ is the log error measurement
- $\mu$ is the systematic error on the meter
- $A_i$ is the random production error
- $\varepsilon_{ij}$ is the random measurement error
- $i=1,2,3,4,k=5$ is the number of the 5 samples
- $j=1,\ldots,n=100$ is the number of the observation in each sample

## Assumptions

This is the model treated in WMM chapter 13.11, where it is assumed that

- $A_i$ is normally distributed with mean 0 and variance $\sigma_\alpha^2$, which represents the production error
- $\varepsilon_{ij}$ is normally distributed with mean 0 and variance $\sigma^2$, which represents the measurement error

## Estimation of systematic error

The systematic error is simply estimated by the mean

- $\hat\mu=\bar y_{..}$

```{r fig.dim=c(4,3)}
muhat <- mean(capDat$lnError)
muhat
```

The meter systematically reports a value, which is estimated to be `r round(-100*muhat, 2)`% too low.

## Estimation of random error

Notation from WMM chapter 13.3:

- $SSA=n\sum_i (\bar y_{i.}-\bar y_{..})^2$ (related to production error)
- $SSE=\sum_{ij} ( y_{ij}-\bar y_{i.})^2$ (related to measurement error)

Theorem 13.4 states:

- $E(SSA)=(k-1)\sigma^2+n(k-1)\sigma_\alpha^2$
- $E(SSE)=k(n-1)\sigma^2$

## Fit

```{r}
fit <- lm(lnError ~ sample, data = capDat)
anova(fit)
```

where we read

```{r}
SS <- anova(fit)$`Sum Sq`
SSA <- SS[1]
SSE <- SS[2]
```

- SSA = `r format(SSA, digits = 3)` and SSE = `r format(SSE, digits = 3)`

## Solution

Solving the equations 

- `SSA = E(SSA)` and `SSE = E(SSE)`

yields

```{r, include=FALSE}
k <- length(unique(capDat$sample))
n <- nrow(capDat)/k
sigma2 <- SSE/(k*(n-1))
sigma2a <- (1/n)*(SSA/(k-1) - sigma2)

mult <- ceiling(min(log10(sigma2a), log10(sigma2)))

fmt <- function(x) {
  paste0(round(x*10^(-mult), 2), 
         "e", mult, "")
}
fmt_tex <- function(x) {
  paste0(round(x*10^(-mult), 2), 
         " \\times 10^{", mult, "}")
}
fmt(sigma2)
fmt_tex(sigma2)

fmt(sigma2a)
fmt_tex(sigma2a)
```

- $\hat\sigma_\alpha^2=\frac{1}{n}(\frac{SSA}{k-1}-\hat\sigma^2)$ = $`r fmt_tex(sigma2a)`$
- $\hat\sigma^2=\frac{SSE}{k(n-1)}$ = $`r fmt_tex(sigma2)`$


## Summing up

- the meter has an estimated systematic error of $`r round(100*muhat, 2)`$%
- the estimated standard error of the meter is $\sqrt{`r fmt_tex(sigma2)`}$ = `r round(100*sqrt(sigma2), 3)`%
- the estimated standard error of the production is $\sqrt{`r fmt_tex(sigma2a)`}$ = `r round(100*sqrt(sigma2a), 2)`%. 
So the 3-sigma limit is `r round(3*100*sqrt(sigma2a), 2)`%, which is in accordance with the tolerance of 1%. It should be noted that the estimate is insecure, as it is based on 4 degrees of freedom only.

The estimated variance on log error

- $`r fmt_tex(sigma2)` + `r fmt_tex(sigma2a)` = `r fmt_tex(sigma2 + sigma2a)`$

is clearly dominated by the production error.

## Test of no random effect

We have the possibility of testing the hypothesis

- $H_0$: $\sigma_\alpha=0$

This is equivalent to 

- $E(SSA/(k-1))=E(SSE/k/(n-1))=\sigma^2$

Under $H_0$ the statistic 

- $F=\frac{\frac{SSA}{k-1}}{\frac{SSE}{k(n-1)}}$

has an F-distribution with degrees of freedom $(k-1,k(n-1))$

In the actual case $f_{obs}=4067.4$, which is highly significant (p-value=0).

## Lognormal variation

In the preceeding we assumed normal errors after a log  transformation.

Let $X$ be a random variable and $Y=ln(X)$.


We say that $X$ has a lognormal distribution if $Y$ has a normal distribution with - say - mean $\mu$ and standard deviation $\sigma$.

Density plots:

```{r echo=FALSE, fig.dim = c(8, 6)}
curve(dlnorm(x,0,1),0,5,ylab="density",lwd=2,col=2)
curve(dlnorm(x,1,1),0,5,ylab="density",lwd=2,col=2,add=T,lty=2)
legend(x = "topright",
  legend = c(expression(paste("(",
    paste(mu),",",paste(sigma),") = (0, 1)")),expression(paste("(",
    paste(mu),",",paste(sigma),") = (1, 1)"))),
  lty = c(1, 2),col = c(2, 2),lwd = 2)     
```

## Moments of lognormal

If $Y=ln(X)$ has a normal distribution with mean $\mu$ and standard deviation $\sigma$, then Theorem 6.7 of WMM states:

- $E(X)=\exp(\mu+\sigma^2/2)$
- $Var(X)=\exp(2\mu+\sigma^2)(\exp(\sigma^2)-1)$

If we are interested in relative variation, it is common to look at the coefficient of variation

- $CV(X)=\frac{\sigma}{\mu}$

if e.g. CV=0.05 then 95% of our measurements are within

- $\mu\pm 2\sigma=\mu\pm 2*0.05\mu=\mu(1\pm 0.1)$

i.e. most observations are within 10% of the mean.

## CV of Lognormal

If $Y=ln(X)$ has a normal distribution with mean $\mu$ and standard deviation $\sigma$, we calculate CV for $X$ as

- $CV(X)=\frac{E(X)}{\sqrt{Var(X)}}=\sqrt{\exp(\sigma^2)-1}$

In Peter's data we estimated the variance of the log error to $`r fmt_tex(sigma2a)`$, which means that the estimated CV of the capacity measurement is


- CV = $\sqrt{\exp\left (`r fmt_tex(sigma2a)` \right ) -1} = `r round(100*(sqrt(exp(sigma2a) - 1)), 2)`$%.

i.e., **if** we correct for the systematic error of the meter, then our measurements are extremely precise.

## Linear calibration

In our previous analysis, we assumed, that the systematic error on the meter did not depend on nominal value.

To check this assumption consider the model

- $Y=\mbox{ln(measuredValue)}$ is a linear model of $x= \mbox{ln(nominalValue)}$
- $Y=\alpha+\beta x+\varepsilon$

where we have previously assumed slope($\beta$) equal to 1.

## Linear calibration fit

```{r}
fit <- lm(log(capacity) ~ log(nomval), data = capDat)
summary(fit)
```
The slope is more than close to 1. But is actually extremely significantly different from 1 (tvalue=3776.74 >>>> 3). 

Clearly, it is a bit dubious to assume a linear relationship, as we only have 3 nominal values. 

## Calibrated values

If we stick to the linear calibration model, it is sensible to correct our measured errors according to the calibration of the meter:

- $$\mbox{measuredError}=\alpha+\beta *\mbox{correctError}$$
- $$\mbox{correctError}=(\mbox{measuredError}-\alpha)/\beta$$
  
```{r}
ab = coef(fit)
ab
capDat$lnError_c = (capDat$lnError - ab[1])/ab[2]
```
  
```{r echo=FALSE}
save(ab, file = "ab.RData")
```

## Calibrated data

```{r}
head(capDat)
```

The calibrated data now shows that the production error on component s_1_nF47 is in the vicinity of 0.2%. Well below the tolerance 1%.