---
title: "Statistics and electronics - lecture 1"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    keep_tex: yes
    highlight: tango
    number_section: yes
    toc: yes

---

```{r, include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic","VennDiagram"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
library(VennDiagram)
library(mosaic)

```
----

## Sources of variation

Capacitors come with a nominal value for the capacitance.

- When capacitance is measured, we do not get exactly the nominal value.

We shall study 2 sources of variation:

- measurement variation due to random errors on a measuring device
- component variation due to random errors in the production process

```{r, fig.width=10, echo=FALSE, fig.align='center', fig.caption="Capacitors with nominal value 47/150 nF and tolerance 1%"}
url <- "https://asta.math.aau.dk/static-files/asta/img/kondensatorer-forsog-nr2.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```

## Data from Peter Koch

Peter has done 100 independent measurements of the capacitance of each 4 of the displayed capacitors and one additional.

 - Nominal values are 47, 47, 100, 150, 150 nF. 
 
 - All have a stated tolerance of 1%.

```{r }
load(url("https://asta.math.aau.dk/datasets?file=cap_1pct.RData"))
head(capDat, 4)
```

Here we see the first 4 measurements of the first capacitor with nominal value 47nF.

- Remark: The measured values are consistently below the nominal value minus the 1% tolerance: $47 - 0.47 = 46.53$. 

```{r}
table(capDat$sample)
```

## Relative errors

* Instead of considering the raw errors
$$\text{measuredValue - nominalValue},$$
we will consider the relative error
$$\frac{\text{measuredValue - nominalValue}}{\text{nominalValue}}.$$

* A tolerance of 0.01 means that the relative error should be within $\pm 0.01$.

## Approximation of the relative error

* Instead of looking at the relative error, we may look at the following approximation:
$$\text{lnError} = \ln\Big( \frac{\text{measuredValue}}{\text{nominalValue}} \Big)  \approx \ \frac{\text{measuredValue} - \text{nominalValue}}{\text{nominalValue}} $$

* This is illustrated below with a nominal value of $n=47$ and measured values of $47$ plus/minus $5\%$.
 
```{r}
n <- 47 
m <- seq(47-5*0.01*47, 47+5*0.01*47, length.out = 100) 
plot(m, log(m/n), col = "red", type = "l") 
lines(m, (m - n)/n, col = "blue", type = "l") 
legend("topleft", legend = c("log(m/n)", "(m-n)/n"), lty = 1, col = c("red", "blue"))
```


## Transformation of errors

* The approximation can be justified theoretically.

* Recall the linear approximation of a function:
$$   f(x) \approx f(x_0) + f'(x_0)(x - x_0) $$
* If we take 
\begin{align}
  x_0 &= 1 \\
  f(x) &= \ln x \\
  f'(x) &= 1/x,
\end{align}  
  we get 
  $$\ln(x) \approx \ln(x_0) +\frac{1}{x_0}\cdot(x-x_0) = x-1.$$
* Suppose $x=m/n$. Then  
$$
  \ln \left ( \frac{m}{n} \right ) \approx \frac{m}{n} - 1 = \frac{m - n}{n}
$$


## Transformed data

- We construct an extra `lnError` variable in the `capDat` dataset. 

```{r fig.dim=c(4,3)}
capDat = within(capDat, lnError <- log(capacity/nomval))
head(capDat, 2)
tail(capDat, 2)
```

- The resolution on Peters capacitance meter is with 1-2 decimal(s) in the 47/150 nF range, which means that only a limited number of different values(3-18) are observed for each capacitor. This means that box-plots and histograms are non-informative.

## Model considerations

- Let us have a look at a summary of the data:

```{r}
favstats(lnError~sample, data=capDat)
```


- All measurements are more than 2.3% below the nominal value.

- This must be due to a systematic error on the meter.



## Sources of variation

* We now have three sources of error:
  - Systematic errors of the measurement device
  - Production errors in the individual capacitors
  - Random measurement errors
  
* This leads us to consider the model
 $$\ln\Big(\frac{\text{measuredValue}}{\text{nominalValue}}\Big) = \text{systematicError} + \text{productionError} + \text{measurementError}. $$

## Statistical model

* We have the model:
$$\ln\Big(\frac{\text{measuredValue}}{\text{nominalValue}}\Big) = \text{systematicError} + \text{productionError} + \text{measurementError}$$

* We may write the model mathematically as
$$Y_{ij}=\mu+A_i+\varepsilon_{ij}$$
where
  - $Y_{ij}$ is the log error measurement ($j$th measurement from the $i$th capacitor)
  - $i=1,\ldots,k$ is the number of the capacitor
  - $j=1,\ldots,n$ is the number of the observation for that capacitor
  - $k=5$ is the total number of capacitors
  - $n=100$ is the number of repetitions for each capacitor
  - $\mu$ is the systematic error on the meter
  - $A_i$ is the random production error
  - $\varepsilon_{ij}$ is the random measurement error

* We make the following assumptions:
  - The production error $A_i$ is normally distributed with mean 0 and variance $\sigma_\alpha^2$,
  - The measurement error $\varepsilon_{ij}$ is normally distributed with mean 0 and variance $\sigma^2$.

* This is called a **random effects model**, see [WMMY] Chapter 13.11.

## Estimation of systematic error

* The systematic error is simply estimated by the sample mean
 $$\hat\mu=\bar y_{..}$$
  * The two dots indicate that we take the average over all observations from all capacitors.
```{r fig.dim=c(4,3)}
muhat <- mean(capDat$lnError)
muhat
```

* The meter systematically reports a value, which is estimated to be `r round(-100*muhat, 2)`% too low.

## Estimation of random error

* We now try to estimate the variance $\sigma_\alpha^2$ of the production error and the variance of the random measurement error $\sigma^2$.

* We need two types of sum of squares:

*  SSA (*sum of squares between groups*) measures how much the sample means for the individual capacitors $\bar y_{i.}$ deviate from the total sample mean $\bar y_{..}$
  $$SSA=n\sum_i (\bar y_{i.}-\bar y_{..})^2$$

  
* SSE (*sum of squares within groups*) measures how much the individual measurements deviate from the sample mean of the capacitor they were measured on:
  $$SSE=\sum_{ij} ( y_{ij}-\bar y_{i.})^2$$
  

* Intuitively, $SSA$ is closely related to the variance of the production error $\sigma_\alpha^2$, while $SSE$ is closely related to the variance of the random measurement error $\sigma^2$.


## Fit

* The sum of squares may be found from:

```{r}
fit <- lm(lnError ~ sample, data = capDat)
anova(fit)
```

* We can extract the sum of squares as follows

```{r}
SS <- anova(fit)$`Sum Sq`
SSA <- SS[1]
SSE <- SS[2]
SSA
SSE
```


## Solution

* One may show (see [WMMY] Theorem 13.4):

$$E(SSA)=(k-1)\sigma^2+n(k-1)\sigma_\alpha^2$$
$$E(SSE)=k(n-1)\sigma^2$$


* Using the approximations
$$E(SSA) \approx SSA, \qquad E(SSE) \approx SSE$$
we obtain the estimates
$$
\hat{\sigma}^2 = \frac{1}{(n-1)k} SSE = \frac{1}{99\cdot 5} \cdot 0.0001417  = 2.86 \cdot 10^{-7}$$
$$\hat{\sigma}^2_\alpha = \frac{1}{n(k-1)} SSA - \frac{\hat{\sigma}^2}{n} = \frac{1}{100\cdot (5-1)} \cdot 0.0046576 - \frac{2.86 \cdot 10^{-7}}{100}   = 1.16\cdot 10^{-5}$$

```{r, include=FALSE}
sigma2<- SSE/(99*5)
sigma2a<- SSA/(100*4) -sigma2/100
```


## Summing up

- The meter has an estimated systematic error of $\hat \mu = `r round(100*muhat, 2)`\%$.
- The estimated standard deviation of the meter is $\hat \sigma = \sqrt{2.86 \cdot 10^{-7}} = 0.0534\%$.
- The estimated standard deviation of the production error is $\hat \sigma_\alpha = \sqrt{1.16\cdot 10^{-5}} = 0.341 \%$. 
* Since $99.7\%$ (practically all) of all observations fall within $\pm 3\cdot \sigma_\alpha$ from $0$, we have that the production error falls within $$\pm 3 \cdot 0.341\% = 1.02 \%$$ of the nominal value, which is in accordance with the tolerance of 1%. 

* The total estimated variance of the log error is
 $$\hat \sigma_\alpha^2 + \hat \sigma^2 = 1.16\cdot 10^{-5} + 2.86 \cdot 10^{-7}=1.19\cdot 10^{-5}.$$
  * The variance is clearly dominated by the production error.

* Note that especially the estimate $\hat \sigma_\alpha$ is quite uncertain, since we only have  measurements from 5 capacitors.

## Test of no random effect

* We have the possibility of testing the hypothesis
$$H_0 : \sigma_\alpha=0.$$
* The formulas for $E(SSA)$ and $E(SSE)$ were
$$E(SSA)=(k-1)\sigma^2 + n(k-1)\sigma_\alpha^2$$
$$E(SSE)=k(n-1)\sigma^2.$$
* Under $H_0$, this is means that 
$$\frac{1}{k-1}E(SSA)=\frac{1}{k(n-1)}E(SSE)=\sigma^2.$$
* Under $H_0$, the $F$ statistic 
$$F_{obs}=\frac{\frac{SSA}{k-1}}{\frac{SSE}{k(n-1)}}$$
has an F-distribution with degrees of freedom $df_1=k-1$ and  $df_2=k(n-1)$.  
  * Large values are critical for the null-hypothesis.

* In the capacitor dataset $F_{obs}=4067.4$, which is highly significant (p-value close to 0).
  * Our capacitors do have some production errors.

## Coefficient of variation

* Let $X$ be a random variable with mean $\mu$ and standard deviation $\sigma$.
* If we are interested in relative variation, it is common to look at the **coefficient of variation**
$$CV(X)=\frac{\sigma}{\mu}$$
  * Standard deviation relative to the mean
  * Unit-free
* If $X$ is normal, then 95% of our measurements are within
$$\mu\pm 2\cdot \sigma=\mu\pm 2\cdot \mu\cdot CV(X)=\mu (1\pm 2\cdot CV(X)).$$
* If e.g.\ $CV(X)=0.05$, it means that $95\%$ of all observations are within $2\cdot 0.05 = 10\%$ of the mean.

## The lognormal  distribution 

* In the preceeding analysis, we assumed that the log-transformed errors had a normal distribution. 

* Let $X$ be a random variable and $Y=\ln(X)$.

* We say that $X$ has a **lognormal distribution** if $Y$ has a normal distribution with - say - mean $\mu$ and standard deviation $\sigma$.

* Here are some plots of the density of the lognormal distribution:

```{r echo=FALSE, fig.dim = c(8, 6)}
curve(dlnorm(x,0,1),0,5,ylab="density",lwd=2,col=2)
curve(dlnorm(x,1,1),0,5,ylab="density",lwd=2,col=2,add=T,lty=2)
legend(x = "topright",
  legend = c(expression(paste("(",
    paste(mu),",",paste(sigma),") = (0, 1)")),expression(paste("(",
    paste(mu),",",paste(sigma),") = (1, 1)"))),
  lty = c(1, 2),col = c(2, 2),lwd = 2)     
```




## Coefficient of variation for lognormal distribution

* Suppose $X$ has a log-normal distribution, so that $Y=\ln(X)$ has a normal distribution with mean $\mu$ and standard deviation $\sigma$. 
* Then the mean and variance are given by (Theorem 6.7 of [WMMY]):
$$E(X)=\exp(\mu+\sigma^2/2)$$
$$Var(X)=\exp(2\mu+\sigma^2)(\exp(\sigma^2)-1)$$

* The coefficient of variation is then
$$CV(X)=\frac{\sqrt{Var(X)}}{E(X)}= \frac{\sqrt{\exp(2\mu+\sigma^2)(\exp(\sigma^2)-1)}}{\exp(\mu+\sigma^2/2)}=\sqrt{\exp(\sigma^2)-1}$$

* In Peter's data we estimated the variance of the ln error to
 $\hat\sigma_\alpha^2 = 1.16\cdot 10^{-5},$
which means that the estimated CV of the capacity measurement is
 $$\widehat{CV}(X)=\sqrt{\exp\left (  1.16\cdot 10^{-5} \right ) -1} = 0.341\%.$$


## Linear calibration

* In our previous analysis, we assumed, that the systematic error on the meter did not depend on nominal value.
$$
\ln\Big(\frac{\text{measuredValue}}{\text{nominalValue}}\Big) = \text{meterError} + \text{randomError} 
$$

* To check this assumption consider the linear model
$$\ln(\text{measuredValue}) =\alpha+\beta \cdot \ln(\text{nominalValue})+\varepsilon.$$

* Note that the previously considered model corresponds to $\beta=1$.


## Linear calibration fit

* We fit the linear model:

```{r}
fit <- lm(log(capacity) ~ log(nomval), data = capDat)
summary(fit)
```

* The slope looks close to 1.

* We may test the null-hypothesis $H_0: \beta = 1$.
$$t_{obs} = \frac{1.0002636-1}{0.0002648} = 0.995.$$
This yields a p-value of around $32\%$.

  * It is a bit dubious to model a linear relationship with only 3 nominal values.
  * Also note that we have correlated measurements, since several measurements are made on the same capacitors. 

## Calibrated values

* If we stick to the linear calibration model, it is sensible to correct our measured errors according to the calibration of the meter.

* We have the model:
$$\text{measuredValue}=\alpha+\beta *\text{nominalValue}$$

* We compute the calibrated values
$$\text{calibratedValue}=(\text{measuredValue}-\alpha)/\beta$$

* We estimate the coefficients $\alpha$ and $\beta$ and calibrate the measurements.  
  
```{r}
ab = coef(fit)
ab
capDat$lnError_c = (capDat$lnError - ab[1])/ab[2]
head(capDat)
```
  
```{r echo=FALSE}
save(ab, file = "ab.RData")
```

