Introduction

Terminology

  • Supervised learning (“labeled” training data)
    • Classification
    • Regression
  • Unsupervised learning (describe hidden structure from “unlabeled” data)
    • PCA
    • Clustering (\(K\)-means, …)

Supervised learning

  • Regression
    • Explain/predict a number \(Y\) from covariates/predictors/features/explanatory variables
  • Classification
    • Now \(Y\) is not a number, but a qualitative variable
    • \(Y = \text{Eye color} \in \{ \text{green}, \text{blue}, \text{brown} \}\)
    • \(Y = \text{E-mail type} \in \{ \text{Spam}, \text{Not spam} \}\)
  • Supervised: Training data is labeled (we know \(Y\)!!)

Classification

  • Given a feature vector \(x\) and a qualitative response \(Y\) taking values in the set \(C\), the classification task is to build a function \(f(x)\) that takes as input the feature vector \(x\) and predicts its value for \(Y\); i.e. \(f(x) \in C\)
  • Often: interested in estimating the probabilities that \(X\) belongs to each category in \(C\)

Many methods for classification.

  • Logistic regression
  • Classification (and regression) trees
  • Nearest Neighbours
  • Support Vector Machines (SVM)

Logistic regression

Intuition

Linear regression (ignoring error term): \[ y = \beta_0 + \beta_1 x \] Here, \(y \in (-\infty, \infty)\), unless \(\beta_1 = 0\).

Intuition

\[ \text{logit}(p) = \log\left ( \frac{p}{1 - p} \right ), \]

\(\text{logit}(p) \in (-\infty, \infty)\) for \(p \in (0, 1)\).

Go from \((-\infty, \infty)\) to \((0, 1)\) (and back).

\[\text{logit}(p) = x \Leftrightarrow p = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1 + \exp(-x)}\]

Intuition

\(Y \in \{ 0, 1 \}\). Model \(P(Y = 1)\). Linear regression?

Logistic regression (ignoring error term): \[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 x \] Here, \(\text{logit}(P(Y = 1)) \in (-\infty, \infty)\), unless \(\beta_1 = 0\), and \[ \text{logit}(P(Y = 1)) = \log\left ( \frac{P(Y = 1)}{1-P(Y = 1)} \right ), \] such that \[ P(Y = 1) = \frac{\exp(\beta_0 + \beta_1 x)}{1+\exp(\beta_0 + \beta_1 x)} = \frac{1}{1 + \exp \left ( - (\beta_0 + \beta_1 x ) \right )} \] and \(P(Y = 1) \in (0, 1)\).

In R: glm(y ~ x, family = binomial).

Iris Flowers

data(iris)
head(iris, n = 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
Iris flowers

Iris flowers

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Example

  • Only works for binary outcome (extensions exist: multinomial regression, nnet::multinom)
iris2 <- iris %>% 
  mutate(Virginica = (Species == "virginica"))

iris2 %>% 
  ggplot(aes(Petal.Length, Petal.Width, color = Species)) + 
  geom_point()

iris2 %>% 
  ggplot(aes(Petal.Length, Petal.Width, color = Virginica)) + 
  geom_point()


iris_fit <- glm(Virginica ~ Petal.Length + Petal.Width, 
                data = iris2, 
                family = binomial)
summary(iris_fit)

Call:
glm(formula = Virginica ~ Petal.Length + Petal.Width, family = binomial, 
    data = iris2)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.73752  -0.01028   0.00000   0.00415   1.89659  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -45.272     13.611  -3.326 0.000881 ***
Petal.Length    5.755      2.306   2.496 0.012575 *  
Petal.Width    10.447      3.756   2.782 0.005409 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.954  on 149  degrees of freedom
Residual deviance:  20.564  on 147  degrees of freedom
AIC: 26.564

Number of Fisher Scoring iterations: 10

NB! The type = "response" is important in the predict function. Otherwise, we get predictions on the "link" (which is the logit-scale for logistic regression).

iris2 <- iris2 %>% 
  mutate(Prob_Virginica = predict(iris_fit, type = "response"))
iris2 %>% mutate(Virginica = factor(as.integer(Virginica))) %>% 
  ggplot(aes(x = Virginica, y = Prob_Virginica)) + geom_boxplot() + 
  labs(x = "Virginica", y = "P(Y = Virginica | Petal.Length, Petal.Width)")

Predictions

Decision boundary

Decision boundary and probability

Uncertainty

ROC curve

The package plotROC has some nice features for ROC curves (see the package’s vignette).

The AUC can be computed for ROC curve by 0.997.