Many methods for classification.
Linear regression (ignoring error term): \[ y = \beta_0 + \beta_1 x \] Here, \(y \in (-\infty, \infty)\), unless \(\beta_1 = 0\).
\[ \text{logit}(p) = \log\left ( \frac{p}{1 - p} \right ), \]
\(\text{logit}(p) \in (-\infty, \infty)\) for \(p \in (0, 1)\).
Go from \((-\infty, \infty)\) to \((0, 1)\) (and back).
\[\text{logit}(p) = x \Leftrightarrow p = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1 + \exp(-x)}\]
\(Y \in \{ 0, 1 \}\). Model \(P(Y = 1)\). Linear regression?
Logistic regression (ignoring error term): \[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 x \] Here, \(\text{logit}(P(Y = 1)) \in (-\infty, \infty)\), unless \(\beta_1 = 0\), and \[ \text{logit}(P(Y = 1)) = \log\left ( \frac{P(Y = 1)}{1-P(Y = 1)} \right ), \] such that \[ P(Y = 1) = \frac{\exp(\beta_0 + \beta_1 x)}{1+\exp(\beta_0 + \beta_1 x)} = \frac{1}{1 + \exp \left ( - (\beta_0 + \beta_1 x ) \right )} \] and \(P(Y = 1) \in (0, 1)\).
In R: glm(y ~ x, family = binomial)
head(iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Iris flowers
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
)iris2 <- iris %>%
mutate(Virginica = (Species == "virginica"))
iris2 %>%
ggplot(aes(Petal.Length, Petal.Width, color = Species)) +
iris2 %>%
ggplot(aes(Petal.Length, Petal.Width, color = Virginica)) +
iris_fit <- glm(Virginica ~ Petal.Length + Petal.Width,
data = iris2,
family = binomial)
glm(formula = Virginica ~ Petal.Length + Petal.Width, family = binomial,
data = iris2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.73752 -0.01028 0.00000 0.00415 1.89659
Estimate Std. Error z value Pr(>|z|)
(Intercept) -45.272 13.611 -3.326 0.000881 ***
Petal.Length 5.755 2.306 2.496 0.012575 *
Petal.Width 10.447 3.756 2.782 0.005409 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 190.954 on 149 degrees of freedom
Residual deviance: 20.564 on 147 degrees of freedom
AIC: 26.564
Number of Fisher Scoring iterations: 10
NB! The type = "response"
is important in the predict
function. Otherwise, we get predictions on the "link"
(which is the logit-scale for logistic regression).
iris2 <- iris2 %>%
mutate(Prob_Virginica = predict(iris_fit, type = "response"))
iris2 %>% mutate(Virginica = factor(as.integer(Virginica))) %>%
ggplot(aes(x = Virginica, y = Prob_Virginica)) + geom_boxplot() +
labs(x = "Virginica", y = "P(Y = Virginica | Petal.Length, Petal.Width)")
Decision boundary
Decision boundary and probability
ROC curve
The package plotROC
has some nice features for ROC curves (see the package’s vignette).
The AUC can be computed for ROC curve by 0.997.