The ASTA team
\[ f(x,y) = P(X=x,Y=y).\]
We find the probability of \((X,Y)\in A\) by summing probabilities: \[P((X,Y)\in A) = \sum_{(x,y)\in A}f(x,y).\]
Example: We roll two dies and let \(X\) be the outcome of die 1 and \(Y\) be the outcome of die 2. Since all 36 combinations are equally likely, \[f(x,y)=P(X=x,Y=y) =\frac{1}{36},\qquad x,y=1,2,\ldots,6.\] We can now compute: \[P(X+Y=4) = f(1,3)+f(2,2)+f(3,1) = \frac{1}{36}+\frac{1}{36}+\frac{1}{36} =\frac{1}{12}. \]
Let \((X,Y)\) be a pair of discrete variables with joint probability function \(f(x,y)\). The marginal probability function for \(X\) is found by \[f(x)=P(X=x) = \sum_y P(X=x,Y=y) = \sum_y f(x,y).\]
Similarly, the marginal probability function for \(Y\) is \[g(y) = \sum_x f(x,y).\]
We say that \(X\) and \(Y\) are independent if \[f(x,y) = f(x)g(y).\]
Note: Recalling the definition of the probability function, the independence condition says that \[f(x,y)=P(X=x,Y=y)=P(X=x)\cdot P(Y=y) = f(x)g(y),\] which corresponds to independence of the events \(\{X=x\}\) and \(\{Y=y\}\).
Example: We roll two dice and let \(X\) and \(Y\) be the outcome of die 1 and die 2, respectively. We found earlier that \(f(x,y)=\tfrac{1}{36}\) for \(x,y=1,2,\ldots,6\). From this we can find the marginal distribution of \(X\) \[f(x) = \sum_{y=1}^6f(x,y) = \frac{1}{36 } +\frac{1}{36 } +\frac{1}{36 } +\frac{1}{36 } +\frac{1}{36 } +\frac{1}{36 } =\frac{1}{6}, \quad x=1,2,\ldots,6,\] as we would expect. Similarly, the marginal distribution of \(Y\) is \(g(y)=\tfrac{1}{6},\quad y=1,2,\ldots,6\). We can now check that the two dice are statistically independent: \[f(x,y) = \frac{1}{36} = \frac{1}{6}\cdot \frac{1}{6} = f(x)g(y).\]
Let \(X\) and \(Y\) be two continuous random variables. The joint distribution of \(X\) and \(Y\) is given by their joint density function \(f(x,y)\).
To find the probability of \((X,Y)\in A\) we integrate over A: \[P((X,Y)\in A) = \iint_A f(x,y) dx dy.\]
Example: Suppose that \((X,Y)\) have the joint density \[f(x,y)=\begin{cases}1, & 0\leq x,y\leq 1\\ 0, & \text{otherwise.} \end{cases}\] Suppose we want to find the probability \(P(X+Y\leq 1)\). This means \((X,Y)\) should belong to the set \(A=\{(x,y): x+y\leq 1 \}\). Thus, \[P(X+Y\leq 1) =\iint_A f(x,y) dxdy =\int_0^1 \int_0^{1-x} 1 dy dx \\ =\int_0^1[y]_0^{1-x}=\int_0^1 (1-x) dx = [-\tfrac{1}{2}(1-x)^2]_0^1 =\tfrac{1}{2}. \]
\[f(x)=\int_{-\infty}^\infty f(x,y ) dy, \qquad g(y)=\int_{-\infty}^\infty f(x,y ) dx.\]
For two random variables, the dependence between them can be measured by the covariance between them. This is given by \[\sigma_{XY} = E((X-\mu_X)(Y-\mu_Y)) = \sum_{(x,y)} (x-\mu_X)(y-\mu_Y)f(x,y),\\
\sigma_{XY} = E((X-\mu_X)(Y-\mu_Y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (x-\mu_X)(y-\mu_Y)f(x,y)dxdy,\]
in the discrete and continuous case, respectively.
Properties:
\(\sigma_{XY}>0\) indicates that the values of \(X\) tend to be large when \(Y\) is large and \(X\) tends to be small when \(Y\) is small.
\(\sigma_{XY}<0\) indicates that the values of \(X\) tend to be large when \(Y\) is small and small when \(Y\) is large.
If \(X\) and \(Y\) are statistically independent, then \(\sigma_{XY}=0\).
If \(\sigma_{XY}=0\) it is not guaranteed that \(X\) and \(Y\) are independent!
Apart from this, the values of \(\sigma_{XY}\) are hard to interpret since they depend on the units that \(X\) and \(Y\) are measured in.
To obtain a unit free version of the covariance, we define the correlation coefficient \[\rho_{XY} = \frac{\sigma_{XY}}{\sigma_X\sigma_Y}.\] This can be thought of as the covariance when \(X\) and \(Y\) are measured in standard deviation units.
Properties:
\(-1\leq \rho_{XY}\leq 1\).
\(\rho_{XY}=1\) means one of the variables is linearly determined by the other, say \(Y=a+bX\), where the slope \(b>0\).
\(\rho_{XY}=-1\) means one of the variables is linearly determined by the other, say \(Y=a+bX\), where the slope \(b<0\).
If \(X\) and \(Y\) are independent, then \(\rho_{XY}=0\). Again, one cannot conclude that \(X\) and \(Y\) are independent if \(\rho_{XY}=0\).
More on correlation in Module 3.
The binomial distribution: An experiment with two possible outcomes (success/failure) is repeated \(n\) times, each independent of each other and with probability \(p\) of success.
Let \(X\) be the number of successes. Then \(X\) can take the values \(0,1,\ldots,n\).
\(X\) follows a binomial distribution, denoted \(X \sim \text{binom}(n, p)\) (or \(\text{Bin}(n, p)\)).
Example: Flip a coin \(n\) times. In each flip, the probability of head is \(p=\tfrac{1}{2}\). Let \(X\) be the number of heads. Then \(X\sim \text{binom}(n, 1/2)\)
Example: We buy \(n\) items of the same type. Each has probability \(p\) of being defect. Let \(X\) be the number of defect items. Then \(X\sim \text{binom}(n, p)\)
Probability function for binomial distribution, \(\binom{n}{x}\) is the binomial coefficient:
\[P(X=x) = \binom{n}{x}p^x(1-p)^{n-x}, \quad x=0,1,\ldots,n\]
Graph of probability functions for binomial distributions with \(n = 10\):
The normal distribution is a continuous distribution with probability density function \[n(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]
It depends on two parameters:
The mean \(\mu\)
The standard deviation \(\sigma\)
When a random variable \(Y\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), we write \(Y \sim \texttt{norm}(\mu,\sigma)\).
Interpretation of standard deviation:
The normal distribution with mean \(\mu=0\) and standard deviation \(\sigma=1\) is called the standard normal distribution.
If \(Y\sim \texttt{norm}(\mu,\sigma)\) then the corresponding \(z\)-score is \[Z=\frac{Y-\mu}{\sigma}\]
Interpretation: \(Z\) is the number of standard deviations that \(Y\) is away from the mean, where a negative value tells that we are below the mean.
We have that \(Z\sim \texttt{norm}(0,1)\), i.e. \(Z\) follows a standard normal distribution.
This implies that
It also implies that:
This is given by \[P(a<X<b) =\int_a^b \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx. \]
This integral cannot be computed by hand!
To calculate normal probabilities in R we use the mosaic
package.
The first time you use the mosaic
package, you need to install it first. This is done via the command:
library
command:mosaic
package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the mosaic
package, but you can safely ignore that.To find the probability \(P(X\leq q)\) when \(X\sim \texttt{norm}(\mu,\sigma)\), we use the pdist
function in R.
For instance with \(q=1\), \(\mu=0\) and \(\sigma=1\), we type
# For a standard normal distribution the probability of getting a value less than 1 is:
pdist("norm", q = 1, mean = 0, sd = 1)
## [1] 0.8413447
The output is always the probability of being to the left of \(q\), which is marked as the purple area.
To get the probability of being to the right of \(q\), we compute
\[P(X>q) = 1-P(X\leq q) = 1-0.8413447 = 0.1586553.\]
We can also go in the other direction using qdist
: Given a probability \(p\), find the value \(z\) such that \(P(X\leq z) = p\) when \(X\sim \texttt{norm}(\mu,\sigma)\).
For instance with \(p=0.005\), \(\mu=0\) and \(\sigma =1\):
## [1] -2.575829
## [1] 2.575829
In statistics, the word population refers to the collection of all the objects we are interested in.
Examples:
The Danish population
All possible outcomes of a lab experiment
A sample consists of finitely many elements selected randomly and independently of each other from the population.
Examples:
People selected for an opinion poll
The experiments we actually carried out
If we draw a random element from the population, the result will be a random variable \(X\) with a certain distribution.
When we sample, we draw \(n\) elements from the population independently of each other. This results in \(n\) independent random variables \(X_1,\ldots,X_n\), each having the same distribution as \(X\).
Sampling principles:
Independence: If you make experiments in the lab, reusing parts of an experiment for the next one might cause dependence between outcomes. If you measure the same quantity at a sequence of time points, measurements close in time are typically not independent.
Same distribution as the population: If we only go out and make weather measurements when the weather is good, our sample does not have the same distribution as measurements from any randomly selected day.
Note: We use capital letters \(X_1,\ldots,X_n\) to indicate that the elements of the sample are random and small letters \(x_1,\ldots,x_n\) to denote the values that are actually observed in the experiment. These values are called observations.
Statistical inference means drawing conclusions about the population based on the sample.
Typically, we want to draw conclusions about some parameters of the population, e.g. mean \(\mu\) and standard deviation \(\sigma\).
Note: The number of elements \(n\) in the sample is called the sample size. In general: the larger \(n\), the more precise conclusions we can draw about the population.
Consider an experiment with two possible outcomes, e.g. flipping a coin or testing whether a component is defect or not.
Call the two outcomes 0 and 1. We are interested in the probability \(p\) of getting the outcome 1.
Given a sample \(X_1,\ldots, X_n\), we estimate \(p\) by \[\hat{P} = \frac{\text{number of 1's among }X_1,\ldots,X_n}{n} = \frac{\sum_{i=1}^n X_i}{n}.\]
\(\hat{P}\) is a so-called summary statistics, i.e. a function of the sample.
Since \(\hat{P}\) is a function of the random sample \(X_1,\ldots, X_n\), \(\hat{P}\) is itself a random variable. Different samples may lead to different values of \(\hat{P}\).
\(E(\hat{P}) = p\).
\(\lim_{n\to \infty} \hat{P} = p\).
John Kerrich, a South African mathematician, was visiting Copenhagen when World War II broke out. Two days before he was scheduled to fly to England, the Germans invaded Denmark. Kerrich spent the rest of the war interned at a camp in Hald Ege near Viborg, Jutland. To pass the time he carried out a series of experiments in probability theory. In one, he tossed a coin 10,000 times.
The first 25 observations were (0 = tail, 1 = head): \[ 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,\ldots\]
Plot of the empirical probability \(\hat{p}\) of getting a head against the number of tosses \(n\):
(The horizontal axis is on a log scale).
Suppose we are interested in the mean value \(\mu\) of a population and we have drawn a random sample \(X_1,\ldots, X_n\).
Based on the sample we estimate \(\mu\) by the sample mean, which is the average of all the elements
\[
\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i.
\]
Properties:
\(\bar{X}\) is random, as it depends on the random sample \(X_1,\ldots,X_n\). Different samples might result in different values of \(\bar{X}\).
\(E(\bar{X}) = \mu\).
\(\bar{X}\) has standard deviation \(\frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the population standard deviation. Note that increasing \(n\) decreases \(\frac{\sigma}{\sqrt{n}}\).
To distinguish between the standard deviation of the population and the standard deviation of \(\bar{X}\), we call the standard deviation of \(\bar{X}\) the Standard error.
\(\lim_{n\to \infty} \bar{X} = \mu\).
When the population distribution is a normal distribution \(\texttt{norm}(\mu,\sigma)\), then \[\bar{X} \sim \texttt{norm}\left(\mu,\frac{\sigma}{\sqrt{n}}\right).\]
For any population distribution, the central limit theorem states:
As a rule of thumb, \(n\) is large enough when \(n\geq 30\).
## [1] 0.003466974
Suppose we are interested in the variance \(\sigma^2\) of a population and we have drawn a random sample \(X_1,\ldots, X_n\).
Based on the sample we estimate the population variance \(\sigma^2\) by the sample variance, which is
\[
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2.
\]
We estimate the population standard deviation \(\sigma\) by the sample standard deviation \[ S = \sqrt{S^2}. \]
Properties:
\(S^2\) is again a random variable.
\(E(S^2) = \sigma^2\).
According to the central limit theorem \(\bar{X}\approx \texttt{norm}(\mu,\frac{\sigma}{\sqrt{n}})\) when the population follows a normal distribution or \(n\) is large.
The corresponding \(z\)-score \(Z= \tfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}\) follows a standard normal distribution \(\texttt{norm}(0,1)\).
Problem: We don’t know \(\sigma\).
We may insert the sample standard deviation to get the \(t\)-score \[T= \tfrac{\bar{X}-\mu}{S/\sqrt{n}}.\]
Since \(S\) is random with a certain variance, this causes \(T\) to vary more than \(Z\).
As a consequence, \(T\) no longer follows a normal distribution, but a \(t\)-distribution with \(n-1\) degrees of freedom.
The expression of the density function is of slightly complicated form and will not be stated here, instead the \(t\)-distribution is plotted below for \(df =1,2,10\) and \(\infty\).