9 Logistic Regression*

This chapter introduces the popular classification method logistic regression.

Packages Used in this Chapter

This chapter only uses R’s base functionality and does not need extra packages.

9.1 Introduction

Logistic regression contains the word regression, but it is actually a probabilistic statistical classification model to predict a binary outcome (a probability) given a set of features. It is a very powerful model that can be fit very quickly. It is one of the first classification models you should try on new data.

Logistic regression can be thought of as a linear regression with the log odds ratio (logit) of the binary outcome as the dependent variable:

\[logit(p) = ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]

logit  <- function(p) log(p/(1-p))
x <- seq(0, 1, length.out = 100)
plot(x, logit(x), type = "l")
abline(v=0.5, lty = 2)
abline(h=0, lty = 2)

This is equivalent to modeling the probability of the outcome \(p\) by

\[ p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...)}}\]

9.2 Data Preparation

Load and shuffle data. We also add a useless variable to see if the logistic regression removes it.

data(iris)
set.seed(100) # for reproducability

x <- iris[sample(1:nrow(iris)),]
x <- cbind(x, useless = rnorm(nrow(x)))

Make Species into a binary classification problem so we will classify if a flower is of species Virginica

x$virginica <- x$Species == "virginica"
x$Species <- NULL
plot(x, col=x$virginica+1)

9.3 Create a Logistic Regression Model

Logistic regression is a generalized linear model (GLM) with logit as the link function and a binomial error model.

model <- glm(virginica ~ .,
  family = binomial(logit), data=x)
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred

About the warning: glm.fit: fitted probabilities numerically 0 or 1 occurred means that the data is possibly linearly separable.

model
## 
## Call:  glm(formula = virginica ~ ., family = binomial(logit), data = x)
## 
## Coefficients:
##  (Intercept)  Sepal.Length   Sepal.Width  Petal.Length  
##      -41.649        -2.531        -6.448         9.376  
##  Petal.Width       useless  
##       17.696         0.098  
## 
## Degrees of Freedom: 149 Total (i.e. Null);  144 Residual
## Null Deviance:       191 
## Residual Deviance: 11.9  AIC: 23.9

Check which features are significant?

summary(model)
## 
## Call:
## glm(formula = virginica ~ ., family = binomial(logit), data = x)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -41.649     26.556   -1.57    0.117  
## Sepal.Length   -2.531      2.458   -1.03    0.303  
## Sepal.Width    -6.448      4.794   -1.34    0.179  
## Petal.Length    9.376      4.763    1.97    0.049 *
## Petal.Width    17.696     10.632    1.66    0.096 .
## useless         0.098      0.807    0.12    0.903  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  11.884  on 144  degrees of freedom
## AIC: 23.88
## 
## Number of Fisher Scoring iterations: 12

AIC can be used for model selection

9.4 Stepwise Variable Selection

model2 <- step(model, data = x)
## Start:  AIC=23.88
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + 
##     useless
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
##                Df Deviance  AIC
## - useless       1     11.9 21.9
## - Sepal.Length  1     13.2 23.2
## <none>                11.9 23.9
## - Sepal.Width   1     14.8 24.8
## - Petal.Width   1     22.4 32.4
## - Petal.Length  1     25.9 35.9
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## 
## Step:  AIC=21.9
## virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
##                Df Deviance  AIC
## - Sepal.Length  1     13.3 21.3
## <none>                11.9 21.9
## - Sepal.Width   1     15.5 23.5
## - Petal.Width   1     23.8 31.8
## - Petal.Length  1     25.9 33.9
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## 
## Step:  AIC=21.27
## virginica ~ Sepal.Width + Petal.Length + Petal.Width
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1
## occurred
##                Df Deviance  AIC
## <none>                13.3 21.3
## - Sepal.Width   1     20.6 26.6
## - Petal.Length  1     27.4 33.4
## - Petal.Width   1     31.5 37.5
summary(model2)
## 
## Call:
## glm(formula = virginica ~ Sepal.Width + Petal.Length + Petal.Width, 
##     family = binomial(logit), data = x)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -50.53      23.99   -2.11    0.035 *
## Sepal.Width     -8.38       4.76   -1.76    0.079 .
## Petal.Length     7.87       3.84    2.05    0.040 *
## Petal.Width     21.43      10.71    2.00    0.045 *
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  13.266  on 146  degrees of freedom
## AIC: 21.27
## 
## Number of Fisher Scoring iterations: 12

The estimates (\(\beta_0, \beta_1,...\) ) are log-odds and can be converted into odds using \(exp(\beta)\). A negative log-odds ratio means that the odds go down with an increase in the value of the predictor. A predictor with a positive log-odds ratio increases the odds. In this case, the odds of looking at a Virginica iris goes down with Sepal.Width and increases with the other two predictors.

9.5 Calculate the Response

Note: we do here in-sample testing on the data we learned the data from. To get a generalization error estimate you should use a test set or cross-validation!

pr <- predict(model2, x, type="response")
round(pr, 2)
##  102  112    4   55   70   98  135    7   43  140   51   25 
## 1.00 1.00 0.00 0.00 0.00 0.00 0.86 0.00 0.00 1.00 0.00 0.00 
##    2   68  137   48   32   85   91  121   16  116   66  146 
## 0.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 
##   93   45   30  124  126   87   95   97  120   29   92   31 
## 0.00 0.00 0.00 0.98 1.00 0.00 0.00 0.00 0.93 0.00 0.00 0.00 
##   54   41  105  113   24  142  143   63   65    9  150   20 
## 0.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 0.00 0.00 0.96 0.00 
##   14   78   88    3   36   27   46   59   96   69   47  147 
## 0.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.20 0.00 1.00 
##  129  136   12  141  130   56   22   82   53   99    5   44 
## 1.00 1.00 0.00 1.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
##   28   52  139   42   15   57   75   37   26  110  100  149 
## 0.00 0.00 0.67 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 
##  132  107   35   58  127  111  144   86  114   71  123  119 
## 1.00 0.60 0.00 0.00 0.92 1.00 1.00 0.00 1.00 0.28 1.00 1.00 
##   18    8  128   83  138   19  115   23   89   62   80  104 
## 0.00 0.00 0.82 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 1.00 
##   40   17   94  133   60   81  118  125  122   49  148   61 
## 0.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00 0.00 
##   10  109  106   72   13   77   79   39  134   84   67  117 
## 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.16 0.79 0.00 1.00 
##  108  101  103   76    1   50  131   90   34   38    6   64 
## 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 
##   33  145   74   11   21   73 
## 0.00 1.00 0.00 0.00 0.00 0.32
hist(pr, breaks=20)
hist(pr[x$virginica==TRUE], col="red", breaks=20, add=TRUE)

9.6 Check Classification Performance

We calculate the predicted class by checking if the probability is larger than .5.

pred <- pr > .5

Now er can create a confusion table and calculate the accuracy.

tbl <- table(actual = x$virginica, predicted = pr>.5)
tbl
##        predicted
## actual  FALSE TRUE
##   FALSE    98    2
##   TRUE      1   49
sum(diag(tbl))/sum(tbl)
## [1] 0.98

We can also use caret’s more advanced function confusionMatrix(). Our code above uses logical vectors. but foo caret, we need to make sure that both, the reference and the predictions are coded as factor.

caret::confusionMatrix(
  reference = factor(x$virginica, levels = c(TRUE, FALSE)), 
  data = factor(pr>.5, levels = c(TRUE, FALSE)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction TRUE FALSE
##      TRUE    49     2
##      FALSE    1    98
##                                         
##                Accuracy : 0.98          
##                  95% CI : (0.943, 0.996)
##     No Information Rate : 0.667         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.955         
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.980         
##             Specificity : 0.980         
##          Pos Pred Value : 0.961         
##          Neg Pred Value : 0.990         
##              Prevalence : 0.333         
##          Detection Rate : 0.327         
##    Detection Prevalence : 0.340         
##       Balanced Accuracy : 0.980         
##                                         
##        'Positive' Class : TRUE          
## 

We see that the model performs well with a very high accuracy and kappa value.

9.7 Regularized Logistic Regression

Glmnet fits generalized linear models (including logistic regression) using regularization via penalized maximum likelihood. The regularization parameter \(\lambda\) is a hyperparameter and glmnet can use cross-validation to find an appropriate value. glmnet does not have a function interface, so we have to supply a matrix for X and a vector of responses for y.

library(glmnet)
## Loaded glmnet 4.1-8
X <- as.matrix(x[, 1:5])
y <- x$virginica

fit <- cv.glmnet(X, y, family = "binomial")
fit
## 
## Call:  cv.glmnet(x = X, y = y, family = "binomial") 
## 
## Measure: Binomial Deviance 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.00164    59   0.126 0.0456       5
## 1se 0.00664    44   0.167 0.0422       3

There are several selection rules for lambda, we look at the coefficients of the logistic regression using the lambda that gives the most regularized model such that the cross-validated error is within one standard error of the minimum cross-validated error.

coef(fit, s = fit$lambda.1se)
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept)  -16.961
## Sepal.Length   .    
## Sepal.Width   -1.766
## Petal.Length   2.197
## Petal.Width    6.820
## useless        .

A dot means 0. We see that the predictors Sepal.Length and useless are not used in the prediction giving a models similar to stepwise variable selection above.

A predict function is provided. We need to specify what regularization to use and that we want to predict a class label.

predict(fit, newx = X[1:5,], s = fit$lambda.1se, type = "class")
##     s1     
## 102 "TRUE" 
## 112 "TRUE" 
## 4   "FALSE"
## 55  "FALSE"
## 70  "FALSE"

Glmnet provides supports many types of generalized linear models. Examples can be found in the article An Introduction to glmnet.

9.8 Exercises

We will again use the Palmer penguin data for the exercises.

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

Create an R markdown document that performs the following:

  1. Create a test and a training data set (see section Holdout Method in Chapter 3).
  2. Create a logistic regression using the training set to predict the variable sex.
  3. Use stepwise variable selection. What variables are selected?
  4. What do the parameters for for each of the selected features tell you?
  5. Predict the sex of the penguins in the test set. Create a confusion table and calculate the accuracy and discuss how well the model works.