4 Classification: Alternative Techniques

Install the packages used in this chapter:

pkgs <- sort(c('tidyverse', 'caret', 'RWeka', 'lattice', 'scales',
  'e1071', 'MASS', 'nnet', 'rpart', 'C50', 'randomForest', 'keras',
  'mlbench'
  ))

pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are: C50 (M. Kuhn and Quinlan 2023), caret (M. Kuhn 2023), e1071 (Meyer et al. 2023), keras (Allaire and Chollet 2023), lattice (Sarkar 2023), MASS (B. Ripley 2023a), mlbench (Leisch and Dimitriadou. 2023), nnet (B. Ripley 2023b), randomForest (Breiman et al. 2022), rpart (Therneau and Atkinson 2022), RWeka (Hornik 2023), scales (Wickham and Seidel 2022), tidyverse (Wickham 2023b)

We will use tidyverse to prepare the data.

Show fewer digits

options(digits=3)

4.1 Introduction

Many different classification algorithms have been proposed in the literature. In this chapter, we will apply some of the more popular methods.

4.2 Training and Test Data

We will use the Zoo dataset which is included in the R package mlbench (you may have to install it). The Zoo dataset containing 17 (mostly logical) variables on different 101 animals as a data frame with 17 columns (hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, type). We convert the data frame into a tidyverse tibble (optional).

data(Zoo, package="mlbench")
Zoo <- as_tibble(Zoo)
Zoo
## # A tibble: 101 × 17
##    hair  feathers eggs  milk  airborne aquatic predator
##    <lgl> <lgl>    <lgl> <lgl> <lgl>    <lgl>   <lgl>   
##  1 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   TRUE    
##  2 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   FALSE   
##  3 FALSE FALSE    TRUE  FALSE FALSE    TRUE    TRUE    
##  4 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   TRUE    
##  5 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   TRUE    
##  6 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   FALSE   
##  7 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   FALSE   
##  8 FALSE FALSE    TRUE  FALSE FALSE    TRUE    FALSE   
##  9 FALSE FALSE    TRUE  FALSE FALSE    TRUE    TRUE    
## 10 TRUE  FALSE    FALSE TRUE  FALSE    FALSE   FALSE   
## # ℹ 91 more rows
## # ℹ 10 more variables: toothed <lgl>, backbone <lgl>,
## #   breathes <lgl>, venomous <lgl>, fins <lgl>,
## #   legs <int>, tail <lgl>, domestic <lgl>,
## #   catsize <lgl>, type <fct>

We will use the package caret to make preparing training sets and building classification (and regression) models easier. A great cheat sheet can be found here.

Multi-core support can be used for cross-validation. Note: It is commented out here because it does not work with rJava used in RWeka below.

##library(doMC, quietly = TRUE)
##registerDoMC(cores = 4)
##getDoParWorkers()

Test data is not used in the model building process and needs to be set aside purely for testing the model after it is completely built. Here I use 80% for training.

inTrain <- createDataPartition(y = Zoo$type, p = .8)[[1]]
Zoo_train <- Zoo |> slice(inTrain)
Zoo_test <- Zoo |> slice(-inTrain)

4.3 Fitting Different Classification Models to the Training Data

Create a fixed sampling scheme (10-folds) so we can compare the fitted models later.

train_index <- createFolds(Zoo_train$type, k = 10)

The fixed folds are used in train() with the argument trControl = trainControl(method = "cv", indexOut = train_index)). If you don’t need fixed folds, then remove indexOut = train_index in the code below.

For help with building models in caret see: ? train

Note: Be careful if you have many NA values in your data. train() and cross-validation many fail in some cases. If that is the case then you can remove features (columns) which have many NAs, omit NAs using na.omit() or use imputation to replace them with reasonable values (e.g., by the feature mean or via kNN). Highly imbalanced datasets are also problematic since there is a chance that a fold does not contain examples of each class leading to a hard to understand error message.

4.3.1 Conditional Inference Tree (Decision Tree)

ctreeFit <- Zoo_train |> train(type ~ .,
  method = "ctree",
  data = _,
    tuneLength = 5,
    trControl = trainControl(method = "cv", indexOut = train_index))
ctreeFit
## Conditional Inference Tree 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 75, 76, 74, 74, 76, 74, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy  Kappa
##   0.010         0.808     0.746
##   0.255         0.808     0.746
##   0.500         0.808     0.746
##   0.745         0.808     0.746
##   0.990         0.808     0.746
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final value used for the model was mincriterion
##  = 0.99.
plot(ctreeFit$finalModel)

4.3.2 C 4.5 Decision Tree

C45Fit <- Zoo_train |> train(type ~ .,
  method = "J48",
  data = _,
    tuneLength = 5,
    trControl = trainControl(method = "cv", indexOut = train_index))
C45Fit
## C4.5-like Trees 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 76, 73, 74, 74, 76, 76, ... 
## Resampling results across tuning parameters:
## 
##   C      M  Accuracy  Kappa
##   0.010  1  0.978     0.971
##   0.010  2  0.978     0.971
##   0.010  3  0.978     0.971
##   0.010  4  0.907     0.879
##   0.010  5  0.918     0.893
##   0.133  1  0.978     0.971
##   0.133  2  0.978     0.971
##   0.133  3  0.978     0.971
##   0.133  4  0.907     0.879
##   0.133  5  0.918     0.893
##   0.255  1  0.989     0.985
##   0.255  2  0.989     0.985
##   0.255  3  0.978     0.971
##   0.255  4  0.907     0.879
##   0.255  5  0.918     0.893
##   0.378  1  0.989     0.985
##   0.378  2  0.989     0.985
##   0.378  3  0.978     0.971
##   0.378  4  0.907     0.879
##   0.378  5  0.918     0.893
##   0.500  1  0.989     0.985
##   0.500  2  0.989     0.985
##   0.500  3  0.978     0.971
##   0.500  4  0.907     0.879
##   0.500  5  0.918     0.893
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final values used for the model were C = 0.255
##  and M = 1.
C45Fit$finalModel
## J48 pruned tree
## ------------------
## 
## feathersTRUE <= 0
## |   milkTRUE <= 0
## |   |   toothedTRUE <= 0
## |   |   |   airborneTRUE <= 0
## |   |   |   |   predatorTRUE <= 0
## |   |   |   |   |   legs <= 2: mollusc.et.al (2.0)
## |   |   |   |   |   legs > 2: insect (2.0)
## |   |   |   |   predatorTRUE > 0: mollusc.et.al (6.0)
## |   |   |   airborneTRUE > 0: insect (5.0)
## |   |   toothedTRUE > 0
## |   |   |   finsTRUE <= 0
## |   |   |   |   aquaticTRUE <= 0: reptile (3.0)
## |   |   |   |   aquaticTRUE > 0
## |   |   |   |   |   eggsTRUE <= 0: reptile (1.0)
## |   |   |   |   |   eggsTRUE > 0: amphibian (4.0)
## |   |   |   finsTRUE > 0: fish (11.0)
## |   milkTRUE > 0: mammal (33.0)
## feathersTRUE > 0: bird (16.0)
## 
## Number of Leaves  :  10
## 
## Size of the tree :   19

4.3.3 K-Nearest Neighbors

Note: kNN uses Euclidean distance, so data should be standardized (scaled) first. Here legs are measured between 0 and 6 while all other variables are between 0 and 1. Scaling can be directly performed as preprocessing in train using the parameter preProcess = "scale".

knnFit <- Zoo_train |> train(type ~ .,
  method = "knn",
  data = _,
  preProcess = "scale",
    tuneLength = 5,
  tuneGrid=data.frame(k = 1:10),
    trControl = trainControl(method = "cv", indexOut = train_index))
knnFit
## k-Nearest Neighbors 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## Pre-processing: scaled (16) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 74, 73, 76, 74, 74, 75, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa
##    1  1.000     1.000
##    2  0.978     0.971
##    3  0.967     0.957
##    4  0.943     0.926
##    5  0.965     0.954
##    6  0.916     0.891
##    7  0.883     0.850
##    8  0.872     0.835
##    9  0.883     0.848
##   10  0.908     0.881
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final value used for the model was k = 1.
knnFit$finalModel
## 1-nearest neighbor model
## Training set outcome distribution:
## 
##        mammal          bird       reptile 
##            33            16             4 
##          fish     amphibian        insect 
##            11             4             7 
## mollusc.et.al 
##             8

4.3.4 PART (Rule-based classifier)

rulesFit <- Zoo_train |> train(type ~ .,
  method = "PART",
  data = _,
  tuneLength = 5,
  trControl = trainControl(method = "cv", indexOut = train_index))
rulesFit
## Rule-Based Classifier 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 77, 72, 77, 74, 74, 73, ... 
## Resampling results across tuning parameters:
## 
##   threshold  pruned  Accuracy  Kappa
##   0.010      yes     0.965     0.955
##   0.010      no      0.988     0.984
##   0.133      yes     0.965     0.955
##   0.133      no      0.988     0.984
##   0.255      yes     0.965     0.955
##   0.255      no      0.988     0.984
##   0.378      yes     0.965     0.955
##   0.378      no      0.988     0.984
##   0.500      yes     0.965     0.955
##   0.500      no      0.988     0.984
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final values used for the model were threshold
##  = 0.5 and pruned = no.
rulesFit$finalModel
## PART decision list
## ------------------
## 
## feathersTRUE <= 0 AND
## milkTRUE > 0: mammal (33.0)
## 
## feathersTRUE > 0: bird (16.0)
## 
## toothedTRUE <= 0 AND
## airborneTRUE <= 0 AND
## predatorTRUE > 0: mollusc.et.al (6.0)
## 
## toothedTRUE <= 0 AND
## legs > 2: insect (7.0)
## 
## finsTRUE > 0: fish (11.0)
## 
## toothedTRUE > 0 AND
## aquaticTRUE <= 0: reptile (3.0)
## 
## aquaticTRUE > 0 AND
## venomousTRUE <= 0: amphibian (3.0)
## 
## aquaticTRUE <= 0: mollusc.et.al (2.0)
## 
## : reptile (2.0/1.0)
## 
## Number of Rules  :   9

4.3.5 Linear Support Vector Machines

svmFit <- Zoo_train |> train(type ~.,
  method = "svmLinear",
  data = _,
    tuneLength = 5,
    trControl = trainControl(method = "cv", indexOut = train_index))
svmFit
## Support Vector Machines with Linear Kernel 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 73, 75, 75, 74, 74, 76, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1    
## 
## Tuning parameter 'C' was held constant at a value of 1
svmFit$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 44 
## 
## Objective Function Value : -0.143 -0.198 -0.148 -0.175 -0.0945 -0.104 -0.19 -0.0814 -0.154 -0.0917 -0.115 -0.177 -0.568 -0.104 -0.15 -0.119 -0.0478 -0.083 -0.123 -0.148 -0.58 
## Training error : 0

4.3.6 Random Forest

randomForestFit <- Zoo_train |> train(type ~ .,
  method = "rf",
  data = _,
    tuneLength = 5,
    trControl = trainControl(method = "cv", indexOut = train_index))
randomForestFit
## Random Forest 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 74, 76, 75, 74, 73, 76, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    0.976     0.968
##    5    0.976     0.968
##    9    0.976     0.968
##   12    0.965     0.954
##   16    0.976     0.969
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final value used for the model was mtry = 2.
randomForestFit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.61%
## Confusion matrix:
##               mammal bird reptile fish amphibian
## mammal            33    0       0    0         0
## bird               0   16       0    0         0
## reptile            0    0       2    1         1
## fish               0    0       0   11         0
## amphibian          0    0       0    0         4
## insect             0    0       0    0         0
## mollusc.et.al      0    0       0    0         0
##               insect mollusc.et.al class.error
## mammal             0             0       0.000
## bird               0             0       0.000
## reptile            0             0       0.500
## fish               0             0       0.000
## amphibian          0             0       0.000
## insect             7             0       0.000
## mollusc.et.al      1             7       0.125

4.3.7 Gradient Boosted Decision Trees (xgboost)

xgboostFit <- Zoo_train |> train(type ~ .,
  method = "xgbTree",
  data = _,
  tuneLength = 5,
  trControl = trainControl(method = "cv", indexOut = train_index),
  tuneGrid = expand.grid(
    nrounds = 20,
    max_depth = 3,
    colsample_bytree = .6,
    eta = 0.1,
    gamma=0,
    min_child_weight = 1,
    subsample = .5
  ))
xgboostFit
## eXtreme Gradient Boosting 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 76, 75, 75, 74, 76, 74, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.976     0.969
## 
## Tuning parameter 'nrounds' was held constant at
##  a value of 1
## Tuning parameter 'subsample' was
##  held constant at a value of 0.5
xgboostFit$finalModel
## ##### xgb.Booster
## raw: 111.3 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, num_class = length(lev), 
##     objective = "multi:softprob")
## params (as set within xgb.train):
##   eta = "0.1", max_depth = "3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "0.5", num_class = "7", objective = "multi:softprob", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 16 
## niter: 20
## nfeatures : 16 
## xNames : hairTRUE feathersTRUE eggsTRUE milkTRUE airborneTRUE aquaticTRUE predatorTRUE toothedTRUE backboneTRUE breathesTRUE venomousTRUE finsTRUE legs tailTRUE domesticTRUE catsizeTRUE 
## problemType : Classification 
## tuneValue :
##    nrounds max_depth eta gamma colsample_bytree
## 1      20         3 0.1     0              0.6
##   min_child_weight subsample
## 1                1       0.5
## obsLevels : mammal bird reptile fish amphibian insect mollusc.et.al 
## param :
##  list()

4.3.8 Artificial Neural Network

nnetFit <- Zoo_train |> train(type ~ .,
  method = "nnet",
  data = _,
    tuneLength = 5,
    trControl = trainControl(method = "cv", indexOut = train_index),
  trace = FALSE)
nnetFit
## Neural Network 
## 
## 83 samples
## 16 predictors
##  7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 75, 77, 72, 75, 75, 76, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy  Kappa
##   1     0e+00  0.694     0.558
##   1     1e-04  0.807     0.728
##   1     1e-03  0.892     0.852
##   1     1e-02  0.825     0.766
##   1     1e-01  0.727     0.633
##   3     0e+00  0.954     0.939
##   3     1e-04  0.989     0.986
##   3     1e-03  0.989     0.986
##   3     1e-02  0.989     0.986
##   3     1e-01  0.989     0.986
##   5     0e+00  0.939     0.917
##   5     1e-04  0.965     0.954
##   5     1e-03  0.989     0.986
##   5     1e-02  0.989     0.986
##   5     1e-01  0.989     0.986
##   7     0e+00  0.989     0.986
##   7     1e-04  0.989     0.986
##   7     1e-03  0.989     0.986
##   7     1e-02  1.000     1.000
##   7     1e-01  0.989     0.986
##   9     0e+00  0.989     0.986
##   9     1e-04  0.989     0.986
##   9     1e-03  0.989     0.986
##   9     1e-02  0.989     0.986
##   9     1e-01  1.000     1.000
## 
## Accuracy was used to select the optimal model
##  using the largest value.
## The final values used for the model were size = 7
##  and decay = 0.01.
nnetFit$finalModel
## a 16-7-7 network with 175 weights
## inputs: hairTRUE feathersTRUE eggsTRUE milkTRUE airborneTRUE aquaticTRUE predatorTRUE toothedTRUE backboneTRUE breathesTRUE venomousTRUE finsTRUE legs tailTRUE domesticTRUE catsizeTRUE 
## output(s): .outcome 
## options were - softmax modelling  decay=0.01

4.4 Comparing Models

Collect the performance metrics from the models trained on the same data.

resamps <- resamples(list(
  ctree = ctreeFit,
  C45 = C45Fit,
  SVM = svmFit,
  KNN = knnFit,
  rules = rulesFit,
  randomForest = randomForestFit,
  xgboost = xgboostFit,
  NeuralNet = nnetFit
    ))
resamps
## 
## Call:
## resamples.default(x = list(ctree = ctreeFit, C45
##  = rulesFit, randomForest = randomForestFit, xgboost
##  = xgboostFit, NeuralNet = nnetFit))
## 
## Models: ctree, C45, SVM, KNN, rules, randomForest, xgboost, NeuralNet 
## Number of resamples: 10 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

Calculate summary statistics

summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: ctree, C45, SVM, KNN, rules, randomForest, xgboost, NeuralNet 
## Number of resamples: 10 
## 
## Accuracy 
##               Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## ctree        0.750   0.778  0.778 0.808   0.851 0.889
## C45          0.889   1.000  1.000 0.989   1.000 1.000
## SVM          1.000   1.000  1.000 1.000   1.000 1.000
## KNN          1.000   1.000  1.000 1.000   1.000 1.000
## rules        0.875   1.000  1.000 0.988   1.000 1.000
## randomForest 0.875   1.000  1.000 0.976   1.000 1.000
## xgboost      0.875   1.000  1.000 0.976   1.000 1.000
## NeuralNet    1.000   1.000  1.000 1.000   1.000 1.000
##              NA's
## ctree           0
## C45             0
## SVM             0
## KNN             0
## rules           0
## randomForest    0
## xgboost         0
## NeuralNet       0
## 
## Kappa 
##               Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## ctree        0.673   0.701  0.721 0.746   0.794 0.857
## C45          0.850   1.000  1.000 0.985   1.000 1.000
## SVM          1.000   1.000  1.000 1.000   1.000 1.000
## KNN          1.000   1.000  1.000 1.000   1.000 1.000
## rules        0.837   1.000  1.000 0.984   1.000 1.000
## randomForest 0.833   1.000  1.000 0.968   1.000 1.000
## xgboost      0.833   1.000  1.000 0.969   1.000 1.000
## NeuralNet    1.000   1.000  1.000 1.000   1.000 1.000
##              NA's
## ctree           0
## C45             0
## SVM             0
## KNN             0
## rules           0
## randomForest    0
## xgboost         0
## NeuralNet       0
library(lattice)
bwplot(resamps, layout = c(3, 1))

Perform inference about differences between models. For each metric, all pair-wise differences are computed and tested to assess if the difference is equal to zero. By default Bonferroni correction for multiple comparison is used. Differences are shown in the upper triangle and p-values are in the lower triangle.

difs <- diff(resamps)
difs
## 
## Call:
## diff.resamples(x = resamps)
## 
## Models: ctree, C45, SVM, KNN, rules, randomForest, xgboost, NeuralNet 
## Metrics: Accuracy, Kappa 
## Number of differences: 28 
## p-value adjustment: bonferroni
summary(difs)
## 
## Call:
## summary.diff.resamples(object = difs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##              ctree    C45      SVM      KNN     
## ctree                 -0.18095 -0.19206 -0.19206
## C45          0.000109          -0.01111 -0.01111
## SVM          3.49e-05 1.000000           0.00000
## KNN          3.49e-05 1.000000 NA               
## rules        5.75e-05 1.000000 1.000000 1.000000
## randomForest 0.000126 1.000000 1.000000 1.000000
## xgboost      0.001617 1.000000 1.000000 1.000000
## NeuralNet    3.49e-05 1.000000 NA       NA      
##              rules    randomForest xgboost  NeuralNet
## ctree        -0.17956 -0.16845     -0.16845 -0.19206 
## C45           0.00139  0.01250      0.01250 -0.01111 
## SVM           0.01250  0.02361      0.02361  0.00000 
## KNN           0.01250  0.02361      0.02361  0.00000 
## rules                  0.01111      0.01111 -0.01250 
## randomForest 1.000000               0.00000 -0.02361 
## xgboost      1.000000 1.000000              -0.02361 
## NeuralNet    1.000000 1.000000     1.000000          
## 
## Kappa 
##              ctree    C45       SVM       KNN      
## ctree                 -0.238879 -0.253879 -0.253879
## C45          6.67e-05           -0.015000 -0.015000
## SVM          2.17e-05 1.00000              0.000000
## KNN          2.17e-05 1.00000   NA                 
## rules        3.92e-05 1.00000   1.00000   1.00000  
## randomForest 8.30e-05 1.00000   1.00000   1.00000  
## xgboost      0.00135  1.00000   1.00000   1.00000  
## NeuralNet    2.17e-05 1.00000   NA        NA       
##              rules     randomForest xgboost  
## ctree        -0.237552 -0.222212    -0.222927
## C45           0.001327  0.016667     0.015952
## SVM           0.016327  0.031667     0.030952
## KNN           0.016327  0.031667     0.030952
## rules                   0.015340     0.014626
## randomForest 1.00000                -0.000714
## xgboost      1.00000   1.00000               
## NeuralNet    1.00000   1.00000      1.00000  
##              NeuralNet
## ctree        -0.253879
## C45          -0.015000
## SVM           0.000000
## KNN           0.000000
## rules        -0.016327
## randomForest -0.031667
## xgboost      -0.030952
## NeuralNet

All perform similarly well except ctree (differences in the first row are negative and the p-values in the first column are <.05 indicating that the null-hypothesis of a difference of 0 can be rejected).

4.5 Applying the Chosen Model to the Test Data

Most models do similarly well on the data. We choose here the random forest model.

pr <- predict(randomForestFit, Zoo_test)
pr
##  [1] mammal        mollusc.et.al bird         
##  [4] mammal        insect        bird         
##  [7] mammal        mollusc.et.al mammal       
## [10] mammal        bird          bird         
## [13] fish          mammal        fish         
## [16] mammal        bird          mammal       
## 7 Levels: mammal bird reptile fish ... mollusc.et.al

Calculate the confusion matrix for the held-out test data.

confusionMatrix(pr, reference = Zoo_test$type)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      mammal bird reptile fish amphibian
##   mammal             8    0       0    0         0
##   bird               0    4       1    0         0
##   reptile            0    0       0    0         0
##   fish               0    0       0    2         0
##   amphibian          0    0       0    0         0
##   insect             0    0       0    0         0
##   mollusc.et.al      0    0       0    0         0
##                Reference
## Prediction      insect mollusc.et.al
##   mammal             0             0
##   bird               0             0
##   reptile            0             0
##   fish               0             0
##   amphibian          0             0
##   insect             1             0
##   mollusc.et.al      0             2
## 
## Overall Statistics
##                                         
##                Accuracy : 0.944         
##                  95% CI : (0.727, 0.999)
##     No Information Rate : 0.444         
##     P-Value [Acc > NIR] : 1.08e-05      
##                                         
##                   Kappa : 0.922         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: mammal Class: bird
## Sensitivity                  1.000       1.000
## Specificity                  1.000       0.929
## Pos Pred Value               1.000       0.800
## Neg Pred Value               1.000       1.000
## Prevalence                   0.444       0.222
## Detection Rate               0.444       0.222
## Detection Prevalence         0.444       0.278
## Balanced Accuracy            1.000       0.964
##                      Class: reptile Class: fish
## Sensitivity                  0.0000       1.000
## Specificity                  1.0000       1.000
## Pos Pred Value                  NaN       1.000
## Neg Pred Value               0.9444       1.000
## Prevalence                   0.0556       0.111
## Detection Rate               0.0000       0.111
## Detection Prevalence         0.0000       0.111
## Balanced Accuracy            0.5000       1.000
##                      Class: amphibian Class: insect
## Sensitivity                        NA        1.0000
## Specificity                         1        1.0000
## Pos Pred Value                     NA        1.0000
## Neg Pred Value                     NA        1.0000
## Prevalence                          0        0.0556
## Detection Rate                      0        0.0556
## Detection Prevalence                0        0.0556
## Balanced Accuracy                  NA        1.0000
##                      Class: mollusc.et.al
## Sensitivity                         1.000
## Specificity                         1.000
## Pos Pred Value                      1.000
## Neg Pred Value                      1.000
## Prevalence                          0.111
## Detection Rate                      0.111
## Detection Prevalence                0.111
## Balanced Accuracy                   1.000

4.6 Deep Learning with keras/tensorflow

The keras package needs the packages reticulate and tensorflow. To install keras you need to

  1. have a working Python installation,
  2. install the keras R package install.packages("keras"), and
  3. install the tensorflow/keras Python modules with library(keras); install_keras()

Prepare the data. All data needs to be in a matrix of all numeric/integer values. The class variable needs to be one-hot encoded and the the keras function to_categorical() expects a class index starting with 0 instead of 1.

X <- Zoo_train |> 
  select(!type) |> 
  mutate(across(everything(), as.integer)) |> 
  as.matrix()
head(X)
##      hair feathers eggs milk airborne aquatic predator
## [1,]    1        0    0    1        0       0        1
## [2,]    0        0    1    0        0       1        1
## [3,]    1        0    0    1        0       0        1
## [4,]    1        0    0    1        0       0        1
## [5,]    1        0    0    1        0       0        0
## [6,]    1        0    0    1        0       0        0
##      toothed backbone breathes venomous fins legs tail
## [1,]       1        1        1        0    0    4    0
## [2,]       1        1        0        0    1    0    1
## [3,]       1        1        1        0    0    4    0
## [4,]       1        1        1        0    0    4    1
## [5,]       1        1        1        0    0    4    1
## [6,]       1        1        1        0    0    4    1
##      domestic catsize
## [1,]        0       1
## [2,]        0       0
## [3,]        0       1
## [4,]        0       1
## [5,]        0       1
## [6,]        1       1
y <- Zoo_train |> 
  pull("type") |> 
  {\(x)(as.integer(x) - 1L)}() |>   ## make index start with 0
  to_categorical()
head(y)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    0    0    0    0    0    0
## [2,]    0    0    0    1    0    0    0
## [3,]    1    0    0    0    0    0    0
## [4,]    1    0    0    0    0    0    0
## [5,]    1    0    0    0    0    0    0
## [6,]    1    0    0    0    0    0    0
X_test <- Zoo_test |> 
  select(!type) |> 
  mutate(across(everything(), as.integer)) |> 
  as.matrix()
y_test <- Zoo_test |> 
  pull("type") |> 
  {\(x)(as.integer(x) - 1L)}() |>
  to_categorical()

Deep learning uses a large set of hyper-parameters. Choices are the activation function, number of layers, number of units per layer, the optimizer, and the type of regularization used.

We define here a simple network with a single dense hidden layer. Note that in keras, we have to specify the input size in this first dense layer. The output of the classifier is a categorical class value, therefore we use for the output layer the softmax activation function, and for training we specify categorical cross-entropy as the loss function, and accuracy as an additional metric. Also, L2 regularization is used for the weights of the hidden layer to reduce overfitting.

model <- keras_model_sequential(input_shape = c(ncol(X)), 
              name = "single_hidden_layer_classifier") |>
  layer_dense(units = 10, activation = 'relu', 
              kernel_regularizer=regularizer_l2(l=0.01), name = "hidden1") |>
  layer_dense(units = ncol(y), activation = 'softmax', name = "output")

model
## Model: "single_hidden_layer_classifier"
## _______________________________________________________
##  Layer (type)           Output Shape          Param #  
## =======================================================
##  hidden1 (Dense)        (None, 10)            170      
##  output (Dense)         (None, 7)             77       
## =======================================================
## Total params: 247
## Trainable params: 247
## Non-trainable params: 0
## _______________________________________________________
model <- model |>  
  compile(loss = 'categorical_crossentropy', 
          optimizer = 'adam', 
          metrics = 'accuracy')

For model training, we need to specify the batch size and the number of training epochs. The fitting process can also use a fraction of the training data for validation to provide generalization loss/accuracy.

history <- model |>
  fit(
    X, 
    y,
    batch_size = 10,
    epochs = 100,
    validation_split = .2
  )

plot(history)

To create predictions from the model, we have to convert the one-hot encoding back to class labels.

class_labels <- levels(Zoo_train |> pull(type))

pr <- predict(model, X_test) |> 
  apply(MARGIN = 1, FUN = which.max)
pr <- factor(pr, labels = class_labels, levels = seq_along(class_labels))

pr
##  [1] mammal        mollusc.et.al bird         
##  [4] mammal        insect        bird         
##  [7] mammal        mollusc.et.al mammal       
## [10] mammal        bird          bird         
## [13] fish          mammal        fish         
## [16] mammal        bird          mammal       
## 7 Levels: mammal bird reptile fish ... mollusc.et.al
confusionMatrix(data = pr, 
                reference = Zoo_test$type)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      mammal bird reptile fish amphibian
##   mammal             8    0       0    0         0
##   bird               0    4       1    0         0
##   reptile            0    0       0    0         0
##   fish               0    0       0    2         0
##   amphibian          0    0       0    0         0
##   insect             0    0       0    0         0
##   mollusc.et.al      0    0       0    0         0
##                Reference
## Prediction      insect mollusc.et.al
##   mammal             0             0
##   bird               0             0
##   reptile            0             0
##   fish               0             0
##   amphibian          0             0
##   insect             1             0
##   mollusc.et.al      0             2
## 
## Overall Statistics
##                                         
##                Accuracy : 0.944         
##                  95% CI : (0.727, 0.999)
##     No Information Rate : 0.444         
##     P-Value [Acc > NIR] : 1.08e-05      
##                                         
##                   Kappa : 0.922         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: mammal Class: bird
## Sensitivity                  1.000       1.000
## Specificity                  1.000       0.929
## Pos Pred Value               1.000       0.800
## Neg Pred Value               1.000       1.000
## Prevalence                   0.444       0.222
## Detection Rate               0.444       0.222
## Detection Prevalence         0.444       0.278
## Balanced Accuracy            1.000       0.964
##                      Class: reptile Class: fish
## Sensitivity                  0.0000       1.000
## Specificity                  1.0000       1.000
## Pos Pred Value                  NaN       1.000
## Neg Pred Value               0.9444       1.000
## Prevalence                   0.0556       0.111
## Detection Rate               0.0000       0.111
## Detection Prevalence         0.0000       0.111
## Balanced Accuracy            0.5000       1.000
##                      Class: amphibian Class: insect
## Sensitivity                        NA        1.0000
## Specificity                         1        1.0000
## Pos Pred Value                     NA        1.0000
## Neg Pred Value                     NA        1.0000
## Prevalence                          0        0.0556
## Detection Rate                      0        0.0556
## Detection Prevalence                0        0.0556
## Balanced Accuracy                  NA        1.0000
##                      Class: mollusc.et.al
## Sensitivity                         1.000
## Specificity                         1.000
## Pos Pred Value                      1.000
## Neg Pred Value                      1.000
## Prevalence                          0.111
## Detection Rate                      0.111
## Detection Prevalence                0.111
## Balanced Accuracy                   1.000

4.8 More Information on Classification with R