---
title: "Logistic Regression"
author: The ASTA team
output:
  slidy_presentation:
    highlight: tango
    theme: cerulean
    fig_caption: false
    css: "https://asta.math.aau.dk/course/asta/2018-1/?file=lecture_style.css"
  pdf_document:
    toc: yes
    highlight: tango
    number_section: true
    fig_caption: false
---

```{r include=FALSE}
library(mosaic)
```

# Introduction to logistic regression
## Binary response

 - We consider a binary response $y$ with outcome $1$ or
    $0$. This might be a code indicating whether a person is able or
    unable to perform a given task.
 - Furthermore, we are given an explanatory variable $x$, which
    is numeric, e.g.\ age.
 - We shall study models for $$P(y=1\, |\, x)$$ i.e.\ the probability
    that a person of age $x$ is able to complete the task.
 - We shall see methods for determining whether or not age
    actually influences the probability, i.e. is $y$ independent of
    $x$?


## A linear model

$$P(y=1\, |\, x)=\alpha+\beta x$$ is simple, but often
    inappropiate. If $\beta$ 
    is positive and $x$ sufficiently large, then the probability
    exceeds $1$.

# Simple logistic regression
## Logistic model

Instead we consider the **odds** that the person is able to complete the task
$$\mathtt{Odds}(y=1\, |\, x)=\frac{P(y=1\, |\, x)}{P(y=0\, |\, x)}=
  \frac{P(y=1\, |\, x)}{1-P(y=1\, |\, x)}$$
which can have any positive value.

**The logistic model** is defined as:
$$\mathtt{logit}(P(y=1\, |\, x))=\log(\mathtt{Odds}(y=1\, |\, x))=\alpha+\beta x$$ 
The function $\mathtt{logit}(p)=\log(\frac{p}{1-p})$
- i.e. **log of odds** - is termed **the logistic transformation**.

Remark that log odds can be any number, where zero corresponds to 
$P(y=1\,|\, x)=0.5$. Solving $\alpha+\beta x=0$ shows that at age
$x_0=-\alpha/\beta$ you have fifty-fifty chance of solving the task.

## Logistic transformation

- The function `logit()` (remember to load `mosaic` first) can be used to calculate the logistic transformation:
```{r}
p <- seq(0.1, 0.9, by = 0.2)
p
l <- logit(p)
l
```

- The inverse logistic transformation `ilogit()` applied to the transformed values can recover the original probabilities:
```{r}
ilogit(l)
```


## Odds-ratio

Interpretation of $\beta$:

What happens to odds, if we increase age by 1 year?

Consider the so-called **odds-ratio**:
$$\frac{\mathtt{Odds}(y=1\, |\, x+1)}{\mathtt{Odds}(y=1\, |\, x)}=
\frac{\exp(\alpha+\beta(x+1))}{\exp(\alpha+\beta x)}=\exp(\beta)$$
where we see, that $\exp(\beta)$ equals the odds for age $x+1$
relative to odds at age $x$.

This means that when age increase by 1 year, then the relative
change in odds is given by $100(\exp(\beta)-1)\%$.

## Simple logistic regression

```{r echo=FALSE,fig.width=6,fig.height=3}
logit <- function(p) log(p/(1-p))
Ilogit <- function(x,a,b) exp(a+b*x)/(1+exp(a+b*x))
y <- logit((1:100)/101)
x <- c((y-0.5)/10,-(y+.5)/3)
x <- seq(min(x),max(x),l=100)
plot(x,sapply(x,Ilogit,a=.5,b=10),axes=F,xlab="x",ylab="P(y=1 | x)",
type="l",ylim=c(-.1,1.1), lwd=3,cex.lab=1.5,cex.main=2,mex=1.5,
main="Logistic curves")
lines(x,sapply(x,Ilogit,a=-.5,b=-3),col="red",lwd=3)
axis(2, at=c(0, .5, 1), labels=T, pos=-1.9)
axis(1,at=range(x),labels=F,pos=-.1)
abline(h = .5, lty = 2)
abline(v = -.5/3, lty = 2, col = "red")
text(-.5/3, -.2, labels = expression(-alpha/beta), col = "red", xpd = TRUE, pos = 2)
abline(v = -.5/10, lty = 2)
text(-.5/10, -.2, labels = expression(-alpha/beta), xpd = TRUE, pos = 4)
```

Examples of logistic curves. The black curve has a positive
$\beta$-value (=10), whereas the red has a negative
$\beta$ (=-3).

In addition we note that:

 - Increasing the absolute value of $\beta$ yields a steeper curve.
 - When $P(y=1\, |\, x)=\frac{1}{2}$ then logit is zero, i.e.\ $\alpha+\beta
    x=0$.

This means that at age $x=-\frac{\alpha}{\beta}$ you have 50\% chance
to perform the task.

## Example: Credit card data

We shall investigate if income is a good
predictor of whether or not you have a credit card.

 - Data structure: For each level of income, we let
    `n` denote the number of persons with that income, and
    `credit` how many of these that carries a credit card.
   
```{r }
creInc <- read.csv("https://asta.math.aau.dk/datasets?file=income-credit.csv")
```

```{r }
head(creInc)
```

## Example: Fitting the model

```{r}
modelFit <- glm(cbind(credit,n-credit) ~ Income, data = creInc, family = binomial)
```

 - `cbind` gives a matrix with two column vectors: `credit` and
    `n-credit`, where the latter
    is the vector counting the number of persons without a credit
    card.
 - The response has the form
    `cbind(credit,n-credit)`.
 - We need to use the function `glm` (generalized linear model).
 - The argument `family=binomial` tells the function that the data has binomial
   variation. Leaving out this argument will lead `R` to believe that data
   follows a normal distribution - as with `lm`.

 - The function `coef` extracts the coefficients (estimates of parameters)
from the model summary:
    
```{r}
coef(summary(modelFit))
```

## Test of no effect

```{r }
coef(summary(modelFit))
```
Our model for dependence of odds of having a credit card related to
income($x$) is
$$\mathtt{logit}(x)=\alpha+\beta x$$
The hypothesis of no relation between income and ability to obtain a
credit card corresponds to
$$H_0:\ \ \beta=0$$
with the alternative $\beta\not=0$.
Inspecting the summary reveals that $\hat{\beta}=0.1054$ is more
than 4 standard errors away from zero.

----

With a z-score equal to 4.03 we get the tail probability
```{r fig.width=6,fig.height=3}
ptail <- 2*(1-pdist("norm",4.03,xlim=c(-5,5)))
ptail
```

Which is very significant - as reflected by the p-value.

## Confidence interval for odds ratio

From the summary:

 - $\hat{\beta}=0.10541$ and hence $\exp(\hat{\beta})-1=0.11$.
    If income
    increases by 1000 euro, then odds increases by 11\%.
 - Standard error on $\hat{\beta}$ is $0.02616$ and hence a 95\%
    confidence interval for log-odds ratio is
    $\hat{\beta}\pm 1.96\times
    0.02616=(0.054;0,157)$.
 - Corresponding interval for odds ratio:
    $\exp((0.054;0,157))=(1.056;1.170)$,

    i.e.\ the increase in odds is - with confidence 95\% - between
   5.6\% and 17\%.
 
## Plot of model predictions against actual data

```{r fig.width=6,fig.height=5,echo=FALSE}
creInc$pobs <- creInc$credit/creInc$n
newdat <- data.frame(Income = 10:70)
newdat$pred <- predict(modelFit, type = "response", newdata = newdat)
gf_point(pobs ~ Income, data = creInc) %>% 
  gf_line(pred ~ Income, data = newdat, col="red", lwd=1.5) %>%
  gf_labs(xlab="Income") %>% 
  gf_labs(ylab="Probability of credit card") %>%
  gf_labs(title="Expected (red line) and observed (black dots) probabilities")
```

- Tendency is fairly clear and very significant.
- Due to low sample size at some income levels, the deviations are quite large.

# Multiple logistic regression
## Several numeric predictors

We generalize the model to the case, where we have $k$ predictors
$x_1,x_2,\ldots,x_k$. Where some might be dummies for a factor. 
$$\mathtt{logit}(P(y=1\, |\, x_1,x_2,\ldots,x_k))=
\alpha+\beta_1 x_1+\cdots+\beta_k x_k$$
Interpretation of $\beta$-values is unaltered: If we fix
$x_2,\ldots,x_k$ and increase $x_1$ by one unit, then the relative
change in odds is given by $\exp(\beta_1)-1$.

## Example

Wisconsin Breast Cancer Database covers 683 observations of 10
variables in relation to examining tumors in the breast.

 - Nine clinical variables with a score between 0 and 10.
 - The binary variable `Class` with levels
   `benign/malignant`.
 - By default `R` orders the levels lexicografically and chooses the
   first level as reference ($y=0$). Hence `benign` is reference, and we
   model odds of `malignant`.
   
We shall work with only 4 of the predictors, where two of these have
been discretized.

----

```{r }
BC <- read.table("https://asta.math.aau.dk/datasets?file=BC0.dat",header=TRUE)
head(BC)
```

## Global test of no effects

First we fit the model `mainEffects` with main effect of all predictors - remember
the notaion `~ .` for all predictors. Then we fit the model `noEffects`
with no predictors.
```{r }
mainEffects <- glm(Class~., data=BC, family=binomial)
noEffects <- glm(Class~1, data=BC, family=binomial)
```
First we want to test, whether there is any effect of the predictors, i.e
the nul hypothesis
$$H_0:\ \ \beta_1=\beta_2=\beta_3=\beta_4=\beta_5=0$$

## Example

Similarly to `lm` we can use the function `anova` to compare  `mainEffects` and
`noEffects`. Only difference is that we need to tell the function that the test is
a chi-square test and not an F-test.

```{r }
anova(noEffects, mainEffects, test="Chisq")
```
`mainEffects` is a much better model.

The test statistic is the Deviance (749.29), which should be small. 

It is evaluated in a
chi-square with 5 (the number 
of parameters equal to zero under the nul hypothesis) degrees of
freedom.

The 95%-critical value for the $\chi^2(5)$ distribution is 11.07 and
the p-value is in practice zero. 

## Test of influence of a given predictor

```{r }
round(coef(summary(mainEffects)),4)
```

For each predictor $p$ can we test the hypothesis:
$$H_0:\ \ \beta_p=0$$

- Looking at the `z`-values, there is a clear effect of all 5
  predictors.
  Which of course is also supported by the p-values.
- Is it relevant to include interactions?

## Model selection by stepwise selection

We extend the model to `BIG` including interactions. And then perform a
so-called **stepwise selection**:

```{r }
BIG <- glm(Class~.^2, data=BC, family=binomial)
final <- step(BIG, k=log(dim(BC)[1]), trace=0)
round(coef(summary(final)), 4)
```

 - `step`: Stepwise removal of "insignificant" predictors from `BIG` (our model including all interactions).
 - Choise of `k=log(dim(BC)[1])` corresponds to the so-called BIC
   (Bayesian Information Criterion), which we shall not treat in detail. 
   Just note that when `k` increases, we gradually obtain a simpler model, i.e. the number of 
   predictors decrease.
 - If `trace=1`, you will see all steps in the iterative process.
 - We end up with a model including one interaction.

## Prediction and classification

```{r }
BC$pred <- round(predict(final,type="response"),3)
```

 - We add the column `pred` to our dataframe `BC`. 
 - `pred` is the final model's estimate of the probability of `malignant`.


```{r }
head(BC[,c("Class","pred")])
```
Not good for patients 2 and 4.


----

We may classify by `round(BC$pred)`:

 - 0 to denote `benign`
 - 1 to denote `malignant`


```{r }
tally(~ Class + round(pred), data = BC)
```
23 patients are misclassified.



----


```{r }
sort(BC$pred[BC$Class=="malignant"])[1:5]
```

There is a malignant woman with a predicted probability of malignancy,
which is only 8.4%. 

If we assign all women with predicted probability of malignancy above
5% to further investigation, then we catch all malignant.


```{r }
tally(~ Class + I(pred>.05), data = BC)
```

The expense is that the number of false positive increases from 12
to 39. 

----

```{r }
tally(~ Class + I(pred>.1), data = BC)
```
- If we instead set the alarm to 10%, then the number of false positives
decreases from 39 to 26. 
- But at the expense of 2 false negative.
