Probability

The ASTA team

Probability of events

The concept of probability

Actual experiment

import numpy as np
y_canteen = np.array([2, 5, 1, 6, 1, 1, 1, 1, 3, 4, 1, 2, 1, 2, 2, 2, 4, 2, 2, 5, 20, 2, 1, 1, 1, 1])
x_canteen = np.where(y_canteen > 2, 1, 0)
x_canteen
## array([0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
##        0, 0, 0, 0])
p_canteen = x_canteen.sum() / len(x_canteen)
p_canteen
## np.float64(0.2692307692307692)

Another experiment

np.array([0, 0, 0, 1, 1, 1, 0, 1, 0, 0]) 
## array([0, 0, 0, 1, 1, 1, 0, 1, 0, 0])

(The horizontal axis is on a log scale).

Definitions

We conduct the experiment \(n\) times. Let \(\#(A)\) denote how many times we observe the event \(A\).

Theoretical probabilites of two events

Conditional probability

Example with magazine data:

import pandas as pd

magAds = pd.read_csv("https://asta.math.aau.dk/datasets?file=magazineAds.txt", sep = "\t")
magAds['words'] = pd.cut(
    magAds['WDS'],
    bins = [31, 72, 146, 230],
    include_lowest = True
)
magAds['education'] = pd.Categorical(
    magAds['GROUP'].map({1: 'high', 2: 'medium', 3: 'low'}),
    categories = ['high', 'medium', 'low'],
    ordered = True
)
tab = pd.crosstab(magAds['words'], magAds['education'])
tab
## education       high  medium  low
## words                            
## (30.999, 72.0]     4       6    5
## (72.0, 146.0]      5       6    8
## (146.0, 230.0]     9       6    5

Conditional probability and independence

Magazine data revisited

Discrete distribution

Example: Magazine data

# Table with the percentage of ads in each combination of the levels of 'words' and 'education'
tab_counts = pd.crosstab(magAds['words'], magAds['education'])
tab_percent = (tab_counts / tab_counts.sum().sum()) * 100
tab_percent.round(2)
## education        high  medium    low
## words                               
## (30.999, 72.0]   7.41   11.11   9.26
## (72.0, 146.0]    9.26   11.11  14.81
## (146.0, 230.0]  16.67   11.11   9.26

General discrete distribution

Example: 3 coin tosses

Distribution of general random variables

Probability distribution

Sample

We conduct an experiment \(n\) times, where the outcome of the \(i\)th experiment corresponds to a measurement of a random variable \(Y_i\), where we assume

Population parameters

Population Sample
\(\mu\) \(\overline{y}\)
\(\sigma\) \(s\)

Distribution of a discrete random variable

Expected value (mean) for a discrete distribution

Example: number of heads in 3 coin flips

y (number of heads) 0 1 2 3
\(P(Y = y)\) 1/8 3/8 3/8 1/8

Variance and standard deviation for a discrete distribution

Example: number of heads in 3 coin flips

The distribution of the random variable ‘number of heads in 3 coin flops’ has variance \[ \sigma^2 = (0-1.5)^2\frac{1}{8} + (1-1.5)^2\frac{3}{8} + (2-1.5)^2 \frac{3}{8} + (3-1.5)^2 \frac{1}{8} = 0.75. \]

and standard deviation \[ \sigma = \sqrt{\sigma^2} = \sqrt{0.75} = 0.866. \]

The binomial distribution

import matplotlib.pyplot as plt
from scipy.stats import binom
n = 10
p = 0.35
x = np.arange(0, n + 1)
pmf_values = binom.pmf(x, n, p)
plt.bar(x, pmf_values, width = 0.8, color = "skyblue", edgecolor = "black")
plt.xlim(-0.5, n + 0.5);
plt.xlabel("Number of successes");
plt.ylabel("Probability");
plt.title("binom(n = 10, prob = 0.35)");

Distribution of a continuous random variable

Density function

Increasing number of observations

Density shapes

Normal distribution

Reach of the normal distribution

Interpretation of standard deviation:

Normal \(z\)-score

Calculating probabilities in the standard normal distribution

# For a standard normal distribution the probability of getting a value less than 1 is:
from scipy.stats import norm
mean = 0
sd = 1
q = 1
left_prob = norm.cdf(q, loc = mean, scale = sd)
left_prob
## np.float64(0.8413447460685429)
x = np.linspace(mean - 4*sd, mean + 4*sd, 500)
y = norm.pdf(x, loc = mean, scale = sd)
plt.plot(x, y, color = 'black')
plt.fill_between(x, y, 0, where = (x <= q), color = 'skyblue', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('Density')
plt.title(f'Normal Distribution (mean={mean}, sd={sd})\nP(X <= {q}) = {left_prob:.4f}')

So q=1 corresponds to the 0.841-percentile/quantile for the standard normal distribution

right_prob = 1 - left_prob
right_prob.round(4)
## np.float64(0.1587)

Calculating \(z\)-values (quantiles) in the standard normal distribution

left_z = norm.ppf(q = 0.005, loc = 0, scale = 1)
left_z
## np.float64(-2.575829303548901)

right_z = norm.ppf(q = 1 - 0.005, loc = 0, scale = 1)
right_z
## np.float64(2.5758293035489004)

Example

The Stanford-Binet Intelligence Scale is calibrated to be approximately normal with mean 100 and standard deviation 16.

What is the 99-percentile of IQ scores?

Distribution of sample statistic

Estimates and their variability

We are given a sample \(y_1,y_2,\ldots,y_n\).

Distribution of sample mean

Central limit theorem

Illustration of CLT

Example