---
output:
  pdf_document: default
---

```{r , include=FALSE}
library(mosaic) # load the mosaic package for later use
library(jpeg)

```

# Exam exercise for Module 1: Wind speed distributions

```{r, fig.width=10, echo=FALSE, fig.align='center'}
# was ![](https://asta.math.aau.dk/static-files/asta/img/mills.jpg)
url <- "https://asta.math.aau.dk/static-files/asta/img/mills.jpg"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(jpeg::readJPEG(z))
invisible(file.remove(z))
```


In this workshop we consider a continuous probability distribution
called the Weibull distribution. Among other things, it is used to model
wind speed distributions.

Part I is a purely theoretical exercise which you should answer using
pen and paper only. For Part II and III we recommend that you answer the
exercises using Rmarkdown (you can simply use the exam Rmarkdown file as
a starting point).

There are many calculations of integrals in this execise. At the exam, you will not have the time to go through all of them in detail. It is enough to go through the main points.

# Part I: The Weibull distribution

The Weibull distribution depends on two parameters $k>0$ and $\lambda>0$. If
$X$ follows a Weibull distribution with parameters $k$ and $\lambda$, we
write $X\sim \texttt{weibull}(k,\lambda)$. In this case, $X$ has the
probability density function
$$f(x)=\begin{cases}\frac{k}{\lambda}\left( \frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^k},& x>0,\\
0,&x\leq 0.\end{cases}$$

The parameter $k$ is called the shape parameter, since it determines the
shape of the distribution, while $\lambda$ is called the scale
parameter, because it works by scaling the $x$-axis. The plots below
show the density of a Weibull distribution with $\lambda=1$ for
different values of $k$.

```{r, fig.width=10, echo=FALSE, fig.align='center'}
# was ![](https://asta.math.aau.dk/static-files/asta/img/weibullPlots.png)
url <- "https://asta.math.aau.dk/static-files/asta/img/weibullPlots.png"
z <- tempfile()
download.file(url, z, mode = "wb")
grid::grid.raster(png::readPNG(z))
invisible(file.remove(z))
```


1.  Suppose $X\sim \texttt{weibull}(k,\lambda)$. Find the distribution
    function of $X$. (Hint: Recall that the distribution function is
    defined by $F(x)=P(X\leq x).$ Write this as an integral and apply
    the substitution $u=(x/\lambda)^k$.)

2.  Show that the distribution function $F(x)$ satisfies
    $$\ln(-\ln(1-F(x)))=-k\ln(\lambda) + k\ln(x).$$

3.  Show that the mean value of $X$ is $\lambda\Gamma(1+\tfrac{1}{k})$.
    (Hint: Remember that the gamma function is given by
    $\Gamma(s)=\int_{0}^{\infty}u^{s-1}e^{-u} du.$)

4.  Show that the variance of $X$ is
    $\lambda^2(\Gamma(1+\tfrac{2}{k})-\Gamma(1+\tfrac{1}{k})^2)$ and
    find the standard deviation. (Hint: It may be convenient to use the
    formula $\text{Var}(X) = E(X^2)-E(X)^2$, see p. 26 in the slides from Lecture 1.)

# Part II: Wind speed measurements

In this part we consider a data set containing wind speed measurements
from a Danish weather station located at Sjælsmark. The data set
contains the wind speed measured at 12 noon every day of January in the
years 2001-2019. We first load the data set:

```{r}
speed<-read.delim("https://asta.math.aau.dk/datasets?file=windSpeed.txt",header=FALSE)[,1]
```

1.  Draw a histogram of the wind speed observations by editing the R
    chunk below. Explain how a histogram is constructed. Do you think
    the observations come from a normal distribution?

```{r}
# gf_histogram(~...,bins=25)
```

In the following we will convince ourselves that the data actually comes
from a Weibull distribution. We order the $n=589$ observations from
smallest to largest $$x_{(1)}\leq x_{(2)}\leq \dotsm \leq x_{(n)}.$$

2.  Argue that $F(x_{(i)})\approx \tfrac{i}{n}$ for $i=1,\ldots,n$. (Hint: How many observations are less than or equal to $x_{(i)}$?)

3.  Using Exercise 2 in Part I, argue that if the observations come from
    a $\texttt{weibull}(k,\lambda)$ distribution, then
    $$\ln(-\ln(1-\tfrac{i}{n}))\approx -k\ln(\lambda) + k\ln(x_{(i)}).$$

The code below computes a vector containing the values
$v_i=\ln(-\ln(1-\tfrac{i}{n}))$ and a vector containing the values
$u_i = \ln(x_{(i)}).$

```{r, message=FALSE}
n<-length(speed)
sortedSpeed<-sort(speed)
u<-log(sortedSpeed)
CDF<-(1:n)/n
v<-log(-log(1-CDF))
#gf_point(...~...) %>%gf_lm()
```

4.  Argue that the points $(u_i,v_i)$ should lie approximately on a
    straight line if the observations come from a
    $\texttt{weibull}(k,\lambda)$ distribution. Edit the code above to
    check that this is the case.

5.  The intercept and slope of the line can be found to be $-2.82$ and
    $1.78$, respectively. Use this to give estimates of the parameters
    $k$ and $\lambda$ of the model. Insert these values in the code
    below to plot the histogram together with the approximate density
    (`shape` is $k$ and `scale` is $\lambda$).

```{r}
#gf_dhistogram( ~ speed, bins = 25) %>%
#gf_dist("weibull", shape = ..., scale = ..., col = "red")
```

# Part III: Sample mean and the central limit theorem

In this last exercise, we investigate the distribution of the sample
mean when a random sample is taken from a population having a
$\texttt{weibull}(k,\lambda)$ distribution. We will use the values of
$k$ and $\lambda$ that you found in Part II, Exercise 5 to mimic a
sample of wind speed measurements.

1.  What is the mean and standard deviation in the population
    distribution? Use the calculations from Part I. (Hint: You can use
    the function `gamma()` in R to compute the gamma function.)

2.  Suppose that a sample consists of 30 observations from this
    distribution. We denote the sample mean by `x_bar`. Using the
    central limit theorem, answer the following questions:

-   What is the expected value of `x_bar`?

-   What is the standard deviation of `x_bar` (also called the standard
    error)?

-   What is the approximate distribution of `x_bar`?

The code below generates 30 independent realizations of a Weibull
distribution with parameters $k$ and $\lambda$. One may think of this of
as simulated random sample of 30 independent wind speed observations.

```{r}
# x<-rweibull(30, shape=..., scale = ... )
# mean(x)
```

4.  Insert the values of $k$ and $\lambda$ from Part II, Exercise 5 in
    the code. Run the command a few times. Is each sample mean close to
    what you expected?

Use `replicate` to repeat the sampling 500 times and save each mean
value in the vector `x_bar`:

```{r}
# x_bar <- replicate(500, mean(rweibull(30, shape=..., scale = ...) ))
```

4.  Calculate the mean and standard deviation of the values in `x_bar`.
    How do they match with what you expected?

5.  Make a QQ-plot to assess the distribution of `x_bar`. Does this look
    like what you would expect?
```{r}
#qqnorm(...)
#qqline(...)
```