We use caret to compare the following classification techniques on a dataset with 214 examples, 6 classes and 9 inputs.
We compare the following classification techniques:
The caret
package is constantly being updated with new methods. The complete source is given on github, (https://github.com/topepo/caret/tree/master/pkg/caret/R).
It is helpful to examine R scripts that are provided for testing each method which are available at, (https://github.com/topepo/caret/tree/master/RegressionTests/Code).
The forensic glass dataset is widely used to illustrate ML classification algorithms since it is a challenging classification problem with 6 classes and 10 input variables. There are \(n=214\) observations.
Variable.Name | Data.Description |
---|---|
RI | refractive index |
Na | sodium |
Mg | manganese |
Al | aluminium |
Si | silicon |
K | potassium |
Ca | calcium |
Ba | barium |
Fe | iron |
The output variable type
indicates which one of the following six types of glass the fragment belongs to:
The distribution across classes is shown below:
## Frequency Count of Each Class
## 1 2 3 5 6 7
## 70 76 17 13 9 29
There is modest imbalance which will be ignored in our analysis.
There are no tuning parameters but we run the cross-validation to get the comparison samples.
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
#no tuning but generates cv replicates for comparisons
set.seed(333477)
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
glass2 <- mutate(glass, Type=factor(Type))
ans_lda <- train(Type ~ .,
method="lda",
data=glass2,
trControl=ctrl)
ans_lda
## Linear Discriminant Analysis
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times)
## Summary of sample sizes: 191, 193, 194, 192, 192, 193, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6309734 0.4825856
set.seed(333477)
ans_mda <- train(Type ~ .,
trControl=ctrl,
data=glass2,
method="mda")
ans_mda
## Mixture Discriminant Analysis
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times)
## Summary of sample sizes: 191, 193, 194, 192, 192, 193, ...
## Resampling results across tuning parameters:
##
## subclasses Accuracy Kappa
## 2 0.6515240 0.5161948
## 3 0.6797852 0.5561867
## 4 0.7080175 0.5982440
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was subclasses = 4.
There are a few implementations of the MLP architecture.
This method uses the underying engine nnet::nnet() but it is adapted to include bagging. See help(avNNet) for more details.
Tuning parameters:
The options ’trace=FALSEand
skip=TRUE` are passed through to MASS::nnet().
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimenavnet0 <- proc.time()[3] - startTime #<10 sec
ans_avnnet0 <- train(x=X, y=y,
method="nnet",
trace=FALSE, skip=TRUE)
stopCluster(cl)
totTime_avnnet0 <- proc.time()[3] - startTime #<10 sec
ans_avnnet0
## Neural Network
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.6064198 0.4567436
## 1 1e-04 0.6169150 0.4710546
## 1 1e-01 0.6236058 0.4739848
## 3 0e+00 0.6103005 0.4654721
## 3 1e-04 0.6156357 0.4681011
## 3 1e-01 0.6398639 0.4976732
## 5 0e+00 0.5989586 0.4476061
## 5 1e-04 0.6234196 0.4807651
## 5 1e-01 0.6426619 0.5033888
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(
size = seq(7, 25, 2),
decay = c(0.001, 0.01, 0.1, 0.2, 0.3), #in [0, 1),
bag = TRUE
)
#Remark: bag is not really a tuning parameter!
totTimeavNNet0 <- proc.time()[3] - startTime #<10 sec
ans_avNNet <- train(x=X, y=y,
method="avNNet",
trace=FALSE, skip=TRUE, entropy=TRUE,
tuneGrid = Grid,
trControl = ctrl)
stopCluster(cl)
totTimeavNNet <- proc.time()[3] - startTime #about 4 min
indOpt <- rownames(ans_avNNet$bestTune)
ans_avNNet$results[indOpt,]
## size decay bag Accuracy Kappa AccuracySD KappaSD
## 42 23 0.01 TRUE 0.6767395 0.5455375 0.08401128 0.1191703
ans_avNNet$bestTune
## size decay bag
## 42 23 0.01 TRUE
The plot suggests the tuning parameters size = 23 and decay = 23 are adequate.
plot(ans_avNNet)
plot(varImp(ans_avNNet))
This method uses the R package RSNNS which supports up to 3 layers.
The tuning parameters are self-explanatory:
layer1 (#Hidden Units layer1) layer2 (#Hidden Units layer2) layer3 (#Hidden Units layer3) decay (Weight Decay)
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimeRSNNS0 <- proc.time()[3] - startTime #<10 sec
ans_RSNNS0 <- train(x=X, y=y, method='mlpWeightDecayML')
stopCluster(cl)
totTime_RSNNS0 <- proc.time()[3] - startTime #<10 sec
ans_RSNNS0
## Multi-Layer Perceptron, multiple layers
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results across tuning parameters:
##
## layer1 decay Accuracy Kappa
## 1 0e+00 0.3302787 0
## 1 1e-04 0.3161720 0
## 1 1e-01 0.3153336 0
## 3 0e+00 0.3192728 0
## 3 1e-04 0.3199692 0
## 3 1e-01 0.3184675 0
## 5 0e+00 0.3245901 0
## 5 1e-04 0.3211871 0
## 5 1e-01 0.3001275 0
##
## Tuning parameter 'layer2' was held constant at a value of 0
##
## Tuning parameter 'layer3' was held constant at a value of 0
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were layer1 = 1, layer2 = 0, layer3
## = 0 and decay = 0.
This method doesn’t seem so promising.
We explore a wider range of tuning parameters but for speed we just use 25 iterations of bootstrap cross-validation.
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(
layer1 = seq(1, 25, 6),
layer2 = seq(1, 25, 6),
layer3 = seq(1, 25, 6),
decay = c(0.01, 0.1, 0.2, 0.3) #in [0, 1)
)
totTimeRSNNS <- proc.time()[3] - startTime #<10 sec
ans_RSNNS <- train(x=X, y=y, method='mlpWeightDecayML', tuneGrid = Grid)
stopCluster(cl)
totTime_RSNNS <- proc.time()[3] - startTime #about 5 minutes
indOpt <- rownames(ans_RSNNS$bestTune)
ans_RSNNS$results[indOpt,]
## layer1 layer2 layer3 decay Accuracy Kappa AccuracySD KappaSD
## 203 13 1 1 0.2 0.3500116 0 0.04893104 0
ans_RSNNS$bestTune
## layer1 layer2 layer3 decay
## 203 13 1 1 0.2
plot(ans_RSNNS)
Tutorial see (https://keras.rstudio.com/)
Tuning parameters:
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimekeras0 <- proc.time()[3] - startTime #<10 sec
ans_keras0 <- train(x=X, y=y, method='mlpKerasDecay')
stopCluster(cl)
totTime_keras0 <- proc.time()[3] - startTime #45 sec
ans_keras0
Use caret::train(…, preProc = c(“center”, “scale”)) for honest cross-validation!
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
totTimekeras0 <- proc.time()[3] - startTime
ans_keras0 <- train(x=X, y=y, method='mlpKerasDecay', preProc = c("center", "scale"))
stopCluster(cl)
totTime_keras0 <- proc.time()[3] - startTime #45 sec
ans_keras0
The out-of-box performance is very poor. But it can be improved with tuning.
We see the default activation function is “relu”.
Note: it is essential to standardize the inputs since keras or tensorflow does not do this!
Many statistical algorithms such as lm()
and lda()
don’t need rescaling since the answers are mathematically equivalent with or without rescaling. In other cases, where rescaling usually makes sense to avoid giving undue importance to the scale as in principal component regression and in penalized regression, rescaling is the default. In these cases, we don’t need to use train(…, preProc = c(“center”, “scale”))).
set.seed(11133387) #require same seeds!
startTime <- proc.time()[3] #elapsed time
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
X <- glass[,1:9]
y <- factor(glass[,10])
Grid <- expand.grid(size=seq(3000,8000,1000),
lambda=c(0.0001),
batch_size=c(214),
lr=c(0.001, .005),
rho=0.9,
decay=c(0.001, 0.01, 0.1),
activation="relu"
)
ans_keras <- train(x=X, y=y, method='mlpKerasDecay', verbose=1, epochs = 20,
tuneGrid=Grid, trControl = ctrl,
preProc = c("center", "scale"))
stopCluster(cl)
totTime_keras <- proc.time()[3] - startTime #few minutes
indOpt <- rownames(ans_keras$bestTune)
ans_keras$results[indOpt,]
## size lambda batch_size lr rho decay activation Accuracy Kappa
## 26 7000 1e-04 214 0.001 0.9 0.01 relu 0.7082813 0.5896957
## AccuracySD KappaSD
## 26 0.08089785 0.1143908
ans_keras$bestTune
## size lambda batch_size lr rho decay activation
## 26 7000 1e-04 214 0.001 0.9 0.01 relu
With honest cross-validation our Accuracy appears to be less.
plot(ans_keras)
plot(varImp(ans_keras))
##
## Call:
## summary.resamples(object = out)
##
## Models: LDA, MDA, nnet, keras
## Number of resamples: 70
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.4285714 0.5714286 0.6363636 0.6309734 0.6956522 0.8500000 0
## MDA 0.4761905 0.6221805 0.7142857 0.7080175 0.7826087 0.9500000 0
## nnet 0.4545455 0.6363636 0.6818182 0.6767395 0.7142857 0.9047619 0
## keras 0.5454545 0.6666667 0.7142857 0.7082813 0.7619048 0.8636364 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.1897106 0.3951954 0.4926213 0.4825856 0.5723198 0.7849462 0
## MDA 0.2712934 0.4778685 0.6098922 0.5982440 0.7062565 0.9293286 0
## nnet 0.2625698 0.4827125 0.5472996 0.5455375 0.6037736 0.8679245 0
## keras 0.3452381 0.5175239 0.5902428 0.5896957 0.6736448 0.8119658 0
##
## Call:
## summary.diff.resamples(object = diff(out))
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## LDA MDA nnet keras
## LDA -0.0770441 -0.0457660 -0.0773079
## MDA 2.158e-10 0.0312780 -0.0002638
## nnet 0.06220 0.66767 -0.0315419
## keras 6.261e-05 1.00000 0.01497
##
## Kappa
## LDA MDA nnet keras
## LDA -0.115658 -0.062952 -0.107110
## MDA 1.491e-11 0.052706 0.008548
## nnet 0.07485 0.34485 -0.044158
## keras 7.882e-05 1.00000 0.01768