Introduction

We use caret to compare the following classification techniques on a dataset with 214 examples, 6 classes and 9 inputs.

We compare the following classification techniques:

Caret Package Source

The caret package is constantly being updated with new methods. The complete source is given on github, (https://github.com/topepo/caret/tree/master/pkg/caret/R).

It is helpful to examine R scripts that are provided for testing each method which are available at, (https://github.com/topepo/caret/tree/master/RegressionTests/Code).

Forensic Glass Classification

The forensic glass dataset is widely used to illustrate ML classification algorithms since it is a challenging classification problem with 6 classes and 10 input variables. There are \(n=214\) observations.

Forensic Glass Classification Inputs
Variable.Name Data.Description
RI refractive index
Na sodium
Mg manganese
Al aluminium
Si silicon
K potassium
Ca calcium
Ba barium
Fe iron

The output variable type indicates which one of the following six types of glass the fragment belongs to:

  • window float glass (WinF: 70),
  • window non-float glass (WinNF: 76),
  • vehicle window glass (Veh: 17)
  • containers (Con: 13),
  • tableware (Tabl: 9)
  • vehicle headlamps (Head: 29).

The distribution across classes is shown below:

## Frequency Count of Each Class
##  1  2  3  5  6  7 
## 70 76 17 13  9 29

There is modest imbalance which will be ignored in our analysis.

LDA: Linear Discriminant Analysis

There are no tuning parameters but we run the cross-validation to get the comparison samples.

ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
#no tuning but generates cv replicates for comparisons
set.seed(333477)
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
glass2 <- mutate(glass, Type=factor(Type))
ans_lda <- train(Type ~ .,
                method="lda",
                data=glass2,
                trControl=ctrl)
ans_lda
## Linear Discriminant Analysis 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times) 
## Summary of sample sizes: 191, 193, 194, 192, 192, 193, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6309734  0.4825856

MDA: Mixture Discriminant Analysis

set.seed(333477)
ans_mda <- train(Type ~ .,
             trControl=ctrl,
             data=glass2,
             method="mda")
ans_mda
## Mixture Discriminant Analysis 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times) 
## Summary of sample sizes: 191, 193, 194, 192, 192, 193, ... 
## Resampling results across tuning parameters:
## 
##   subclasses  Accuracy   Kappa    
##   2           0.6515240  0.5161948
##   3           0.6797852  0.5561867
##   4           0.7080175  0.5982440
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was subclasses = 4.

MLP: Multilayer Perceptron

There are a few implementations of the MLP architecture.

method=avNNet

This method uses the underying engine nnet::nnet() but it is adapted to include bagging. See help(avNNet) for more details.

Tuning parameters:

  • size (#Hidden Units)
  • decay (Weight Decay)
  • bag (Bagging)

Run vanilla version

The options ’trace=FALSEandskip=TRUE` are passed through to MASS::nnet().

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimenavnet0 <- proc.time()[3] -  startTime #<10 sec
ans_avnnet0 <- train(x=X, y=y,
                method="nnet",
                trace=FALSE, skip=TRUE)
stopCluster(cl)
totTime_avnnet0 <- proc.time()[3] -  startTime #<10 sec
ans_avnnet0
## Neural Network 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.6064198  0.4567436
##   1     1e-04  0.6169150  0.4710546
##   1     1e-01  0.6236058  0.4739848
##   3     0e+00  0.6103005  0.4654721
##   3     1e-04  0.6156357  0.4681011
##   3     1e-01  0.6398639  0.4976732
##   5     0e+00  0.5989586  0.4476061
##   5     1e-04  0.6234196  0.4807651
##   5     1e-01  0.6426619  0.5033888
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

Tune avNNet

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(
 size = seq(7, 25, 2), 
 decay = c(0.001, 0.01, 0.1, 0.2, 0.3), #in [0, 1),
 bag = TRUE
)
#Remark: bag is not really a tuning parameter!
totTimeavNNet0 <- proc.time()[3] -  startTime #<10 sec
ans_avNNet <- train(x=X, y=y,
                method="avNNet",
                trace=FALSE, skip=TRUE, entropy=TRUE,
                tuneGrid = Grid,
                trControl = ctrl)
stopCluster(cl)
totTimeavNNet <- proc.time()[3] -  startTime #about 4 min
indOpt <- rownames(ans_avNNet$bestTune)
ans_avNNet$results[indOpt,]
##    size decay  bag  Accuracy     Kappa AccuracySD   KappaSD
## 42   23  0.01 TRUE 0.6767395 0.5455375 0.08401128 0.1191703
ans_avNNet$bestTune
##    size decay  bag
## 42   23  0.01 TRUE
Regularization Path

The plot suggests the tuning parameters size = 23 and decay = 23 are adequate.

plot(ans_avNNet)

Variable Importance Plot
plot(varImp(ans_avNNet))

Multi-Layer Perceptron, multiple layers, method = ‘mlpWeightDecayML’

This method uses the R package RSNNS which supports up to 3 layers.

The tuning parameters are self-explanatory:

layer1 (#Hidden Units layer1) layer2 (#Hidden Units layer2) layer3 (#Hidden Units layer3) decay (Weight Decay)

Run vanilla version

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimeRSNNS0 <- proc.time()[3] -  startTime #<10 sec
ans_RSNNS0 <- train(x=X, y=y, method='mlpWeightDecayML')
stopCluster(cl)
totTime_RSNNS0 <- proc.time()[3] -  startTime #<10 sec
ans_RSNNS0
## Multi-Layer Perceptron, multiple layers 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results across tuning parameters:
## 
##   layer1  decay  Accuracy   Kappa
##   1       0e+00  0.3302787  0    
##   1       1e-04  0.3161720  0    
##   1       1e-01  0.3153336  0    
##   3       0e+00  0.3192728  0    
##   3       1e-04  0.3199692  0    
##   3       1e-01  0.3184675  0    
##   5       0e+00  0.3245901  0    
##   5       1e-04  0.3211871  0    
##   5       1e-01  0.3001275  0    
## 
## Tuning parameter 'layer2' was held constant at a value of 0
## 
## Tuning parameter 'layer3' was held constant at a value of 0
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were layer1 = 1, layer2 = 0, layer3
##  = 0 and decay = 0.

This method doesn’t seem so promising.

Tune ‘mlpWeightDecayML’

We explore a wider range of tuning parameters but for speed we just use 25 iterations of bootstrap cross-validation.

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(
 layer1 =  seq(1, 25, 6), 
 layer2 =  seq(1, 25, 6),
 layer3 =  seq(1, 25, 6),
 decay = c(0.01, 0.1, 0.2, 0.3) #in [0, 1)
)
totTimeRSNNS <- proc.time()[3] -  startTime #<10 sec
ans_RSNNS <- train(x=X, y=y, method='mlpWeightDecayML', tuneGrid = Grid)
stopCluster(cl)
totTime_RSNNS <- proc.time()[3] -  startTime #about 5 minutes
indOpt <- rownames(ans_RSNNS$bestTune)
ans_RSNNS$results[indOpt,]
##     layer1 layer2 layer3 decay  Accuracy Kappa AccuracySD KappaSD
## 203     13      1      1   0.2 0.3500116     0 0.04893104       0
ans_RSNNS$bestTune
##     layer1 layer2 layer3 decay
## 203     13      1      1   0.2
Regularization Path
plot(ans_RSNNS)

MLP with Keras, method = ‘mlpKerasDecay’

Tutorial see (https://keras.rstudio.com/)

Tuning parameters:

Run vanilla version - with data leakage problem

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimekeras0 <- proc.time()[3] -  startTime #<10 sec
ans_keras0 <- train(x=X, y=y, method='mlpKerasDecay')
stopCluster(cl)
totTime_keras0 <- proc.time()[3] -  startTime #45 sec
ans_keras0

Run vanilla version - avoiding data leakage

Use caret::train(…, preProc = c(“center”, “scale”)) for honest cross-validation!

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimekeras0 <- proc.time()[3] -  startTime 
ans_keras0 <- train(x=X, y=y, method='mlpKerasDecay', preProc = c("center", "scale"))
stopCluster(cl)
totTime_keras0 <- proc.time()[3] -  startTime #45 sec
ans_keras0

The out-of-box performance is very poor. But it can be improved with tuning.

We see the default activation function is “relu”.

Note: it is essential to standardize the inputs since keras or tensorflow does not do this!

Many statistical algorithms such as lm() and lda() don’t need rescaling since the answers are mathematically equivalent with or without rescaling. In other cases, where rescaling usually makes sense to avoid giving undue importance to the scale as in principal component regression and in penalized regression, rescaling is the default. In these cases, we don’t need to use train(…, preProc = c(“center”, “scale”))).

Tuning

set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(size=seq(3000,8000,1000),
   lambda=c(0.0001),
   batch_size=c(214),
   lr=c(0.001, .005),
   rho=0.9,
   decay=c(0.001, 0.01, 0.1),
   activation="relu"
   )
ans_keras <- train(x=X, y=y, method='mlpKerasDecay', verbose=1, epochs = 20, 
                  tuneGrid=Grid, trControl = ctrl, 
                   preProc = c("center", "scale"))
stopCluster(cl)
totTime_keras <- proc.time()[3] -  startTime #few minutes
indOpt <- rownames(ans_keras$bestTune)
ans_keras$results[indOpt,]
##    size lambda batch_size    lr rho decay activation  Accuracy     Kappa
## 26 7000  1e-04        214 0.001 0.9  0.01       relu 0.7082813 0.5896957
##    AccuracySD   KappaSD
## 26 0.08089785 0.1143908
ans_keras$bestTune
##    size lambda batch_size    lr rho decay activation
## 26 7000  1e-04        214 0.001 0.9  0.01       relu

With honest cross-validation our Accuracy appears to be less.

Regularization Path
plot(ans_keras)

Variable Importance
plot(varImp(ans_keras))

Resampling Comparisons

## 
## Call:
## summary.resamples(object = out)
## 
## Models: LDA, MDA, nnet, keras 
## Number of resamples: 70 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA   0.4285714 0.5714286 0.6363636 0.6309734 0.6956522 0.8500000    0
## MDA   0.4761905 0.6221805 0.7142857 0.7080175 0.7826087 0.9500000    0
## nnet  0.4545455 0.6363636 0.6818182 0.6767395 0.7142857 0.9047619    0
## keras 0.5454545 0.6666667 0.7142857 0.7082813 0.7619048 0.8636364    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA   0.1897106 0.3951954 0.4926213 0.4825856 0.5723198 0.7849462    0
## MDA   0.2712934 0.4778685 0.6098922 0.5982440 0.7062565 0.9293286    0
## nnet  0.2625698 0.4827125 0.5472996 0.5455375 0.6037736 0.8679245    0
## keras 0.3452381 0.5175239 0.5902428 0.5896957 0.6736448 0.8119658    0

## 
## Call:
## summary.diff.resamples(object = diff(out))
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##       LDA       MDA        nnet       keras     
## LDA             -0.0770441 -0.0457660 -0.0773079
## MDA   2.158e-10             0.0312780 -0.0002638
## nnet  0.06220   0.66767               -0.0315419
## keras 6.261e-05 1.00000    0.01497              
## 
## Kappa 
##       LDA       MDA       nnet      keras    
## LDA             -0.115658 -0.062952 -0.107110
## MDA   1.491e-11            0.052706  0.008548
## nnet  0.07485   0.34485             -0.044158
## keras 7.882e-05 1.00000   0.01768