Logistic regression

Intuition

Linear regression (ignoring error term): \[ y = \beta_0 + \beta_1 x \] Here, \(y \in (-\infty, \infty)\), unless \(\beta_1 = 0\).

Intuition

\[ \text{logit}(p) = \log\left ( \frac{p}{1 - p} \right ), \]

\(\text{logit}(p) \in (-\infty, \infty)\) for \(p \in (0, 1)\).

Go from \((-\infty, \infty)\) to \((0, 1)\) (and back).

\[\text{logit}(p) = x \Leftrightarrow p = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1 + \exp(-x)}\]

Intuition

\(Y \in \{ 0, 1 \}\). Model \(P(Y = 1)\). Linear regression?

Logistic regression (ignoring error term): \[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 x \] Here, \(\text{logit}(P(Y = 1)) \in (-\infty, \infty)\), unless \(\beta_1 = 0\), and \[ \text{logit}(P(Y = 1)) = \log\left ( \frac{P(Y = 1)}{1-P(Y = 1)} \right ), \] such that \[ P(Y = 1) = \frac{\exp(\beta_0 + \beta_1 x)}{1+\exp(\beta_0 + \beta_1 x)} = \frac{1}{1 + \exp \left ( - (\beta_0 + \beta_1 x ) \right )} \] and \(P(Y = 1) \in (0, 1)\).

In R: glm(y ~ x, family = binomial).

Iris Flowers

data(iris)
head(iris, n = 3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Iris flowers

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Example

Only works for binary outcome (extensions exist: multinomial regression, nnet::multinom)

iris2 <- iris %>% 
  mutate(Virginica = (Species == "virginica"))

iris2 %>% 
  ggplot(aes(Petal.Length, Petal.Width, color = Species)) + 
  geom_point()

iris2 %>% 
  ggplot(aes(Petal.Length, Petal.Width, color = Virginica)) + 
  geom_point()

iris_fit <- glm(Virginica ~ Petal.Length + Petal.Width, 
                data = iris2, 
                family = binomial)
summary(iris_fit)


Call:
glm(formula = Virginica ~ Petal.Length + Petal.Width, family = binomial, 
    data = iris2)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.73752  -0.01028   0.00000   0.00415   1.89659  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -45.272     13.611  -3.326 0.000881 ***
Petal.Length    5.755      2.306   2.496 0.012575 *  
Petal.Width    10.447      3.756   2.782 0.005409 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.954  on 149  degrees of freedom
Residual deviance:  20.564  on 147  degrees of freedom
AIC: 26.564

Number of Fisher Scoring iterations: 10

NB! The type = "response" is important in the predict function. Otherwise, we get predictions on the "link" (which is the logit-scale for logistic regression).

iris2 <- iris2 %>% 
  mutate(Prob_Virginica = predict(iris_fit, type = "response"))
iris2 %>% mutate(Virginica = factor(as.integer(Virginica))) %>% 
  ggplot(aes(x = Virginica, y = Prob_Virginica)) + geom_boxplot() + 
  labs(x = "Virginica", y = "P(Y = Virginica | Petal.Length, Petal.Width)")

Predictions

Decision boundary

Decision boundary and probability

Uncertainty

ROC curve

The package plotROC has some nice features for ROC curves (see the package’s vignette).

The AUC can be computed for ROC curve by 0.997.

Logistic regression

Classification

Torben Tvedebrink

COWIDUR

Introduction

Terminology

Supervised learning

Classification

Logistic regression

Intuition

Intuition

Intuition

Iris Flowers

Example

Predictions