Introduction

Many datasets in medicine and insurance pose difficulties in class prediction due to extreme imbalance in the classes. Chapter 16 of Applied Predictive Modeling discusses strategies for dealing with this problem. In the case of binary classification, one strategy is to use ROC Curve analysis to select a more useful cutoff than 0.5.

Poorly Calibrated Probabilities

We noted that probabilities generated using the softmax or the logistic function by classifiers such as Support Vector Machines (SVM) and Multilayer Perceptrons (MLP) have a Bayesian interpretation. But to be useful in practice, there probabilities must be well-calibrated. Fawcett (2006) reports an interesting example involving a Naive Bayes (NB) classifier. As we know the NB classifier makes very strong assumptions so it may be not too surprising that the probabilities it generates may not be realistic. In Fawcett’s example, the NB classifier produced the posterior probabilities shown in the Table below:

p <- c(0.99999, 0.99999, 0.99993, 0.99986, 0.99964, 0.99955, 0.68139, 0.50961, 0.48880, 0.44951)
Y <- factor(c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0), levels=c(1, 0), labels=c("Yes", "No"))
tb <- list(Y=Y, prediction=p)
kable(tb, caption="Fawcett - Not well-Calibrated Probabilities") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Fawcett - Not well-Calibrated Probabilities

x
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No

x
0.99999
0.99999
0.99993
0.99986
0.99964
0.99955
0.68139
0.50961
0.48880
0.44951

The ROC curve is shown below. We see that AUC=1 so this is a perfect classifier!

ROCFawcett <- roc(Y, p)
plot(ROCFawcett)

But the default method of predicting produces the confusion matrix:

YH <- factor(ifelse(p>0.5, "Yes", "No"), levels=c("Yes", "No"))
table(Y, YH, dnn=list("Truth", "Predicted"))

##      Predicted
## Truth Yes No
##   Yes   6  0
##   No    2  2

So the mis-classification rate is 2/10 which is not perfect! Also specificity=50% and sensitivity=100%.
However simply adjusting the cutoff results in perfect classification.

YH <- ifelse(p>0.8, "Yes", "No")
table(Y, YH, dnn=list("Truth", "Predicted"))

##      Predicted
## Truth No Yes
##   Yes  0   6
##   No   4   0

Many ML classifiers such as the Naive Bayes, in the above example, produce “posterior probabilities” that are useful for ranking and provide a measure of relative plausibility but they are not true probabilities!

A formal method of using the ROC curve to choose an optimal classifier will be discussed later in this note.

Training and contest test data

The Caravan Dataset was used in data mining contest. The contest details, results and many submissions are described here: (http://www.liacs.nl/~putten/library/cc2000/)

The dataset itself is described here: (http://www.liacs.nl/~putten/library/cc2000/data.html)

TICDATA2000.txt: Dataset to train and validate prediction models and build a description (5822 customer records). Each record consists of 86 attributes, containing sociodemographic data (attribute 1-43) and product ownership (attributes 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes.
Attribute 86, “CARAVAN:Number of mobile home policies”, is the target variable.

TICEVAL2000.txt: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants supposed to predict which customers have purchased insurance. Accurate prediction is valuable because in actual practice customers characteristic may be used to predict which customers are likely to buy insurance.

TICTGTS2000.txt Targets for the evaluation set.

The dataset Caravan available in the package ISLR is TICDATA2000 that used for the contest. There are \(n=5822\) observations and \(p=85\) inputs. The output variable Purchase is Yes or No according as whether or not customer purchased insurance or not. We randomly partition this data into two parts one for training and one for testing using 70% for training.

The challenges with the dataset include:

with 85 predictor, it is relatively high-dimensionsal
the dependence between inputs and output is weak
extreme class imbalance since only about 6% of the customers are Yes

If only 6% are Yes this means that simply predicting No for all new customers will have an expected accuracy of about 94%.

Summary of Techniques for Dealing with Extreme Imbalance

recalibrate the probabilties
use left-most rule in pROC
select top 10 or 20 etc.
adjust priors
adjust weights
adjust tuning parameters
stratified sampling with caret::upSample() or downSample()
instead of Accuracy or AUC, cost-sensitive training with rpart() or C50() or other suitable algorithm
feature selection may be helpful
LASSO regression does feature selection and regularization
simple parsimonious methods such as logistic or LDA may better than more complex methods such as RF

Logistic Regression and ROC Curve Analysis

We use 7 replications of 10-fold CV, it takes less than 30 seconds using 16 cores.

set.seed(1441414113)
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
startTime <- proc.time()[3] #elapsed time
#
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7, classProbs = TRUE, summaryFunction = ROCACC)
ansLor <- train(Purchase ~ ., data = XyTr,
               method = "glm", family=binomial,
               trControl = ctrl,
               metric = "ROC")
totTimeLor <- proc.time()[3] -  startTime #about 23 sec w 16 cores
stopCluster(cl)
ansLor

## Generalized Linear Model 
## 
## 4076 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times) 
## Summary of sample sizes: 3669, 3668, 3668, 3668, 3667, 3669, ... 
## Resampling results:
## 
##   ROC       Sens        Spec      Accuracy   Kappa     
##   0.721223  0.01752381  0.995266  0.9367393  0.02177975

From the summary we see the Specificity and Accuracy are very high but the Sensitivity, the most important criterion in this application is extremely low. Our main focus is to be able to predict which type of customers are likely to buy insurance. The low sensitivity informs us that there is apparently little predictive power in predictions.

The ROC curve does indicate that the model is capable of more than random predictions. Indeed is works well in informing us that very few customers purchase insurance!

yHTr <- predict(ansLor, XyTr, type = "prob")[,1]
yHTe <- predict(ansLor, XyTe, type = "prob")[,1]
LorROC <- roc(yTe, yHTe)
LorROC

## 
## Call:
## roc.default(response = yTe, predictor = yHTe)
## 
## Data: yHTe in 104 controls (yTe Yes) > 1642 cases (yTe No).
## Area under the curve: 0.7222

plot(LorROC, main="Performance of Logistic Regression on Test Data")

CM <- confusionMatrix(predict(ansLor, XyTe), yTe)
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes    1    4
##        No   103 1638
##                                           
##                Accuracy : 0.9387          
##                  95% CI : (0.9264, 0.9495)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 0.6431          
##                                           
##                   Kappa : 0.013           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0096154       
##             Specificity : 0.9975639       
##          Pos Pred Value : 0.2000000       
##          Neg Pred Value : 0.9408386       
##              Prevalence : 0.0595647       
##          Detection Rate : 0.0005727       
##    Detection Prevalence : 0.0028637       
##       Balanced Accuracy : 0.5035897       
##                                           
##        'Positive' Class : Yes             
##

The ‘closest.topleft’ rule is used to select an optimal cutoff based on the idea of having balanced good performance for both sensitivity and specificity. We see that we get about 71% correct in predicting customers with insurance.

rfThresh <- coords(LorROC, x = "best", ret="threshold", best.method="closest.topleft")
plot(LorROC, print.thres = rfThresh, type = "S",
     print.thres.pattern = "%.3f (Spec = %.2f, Sens = %.2f)",
     print.thres.cex = .8, legacy.axes = TRUE)

The confusion matrix correspond to the ‘closest.topleft’ rule is shown below.

yHTeBest <- factor(ifelse(yHTe>rfThresh, "Yes", "No"), levels=c("Yes", "No"))
CMBest <- confusionMatrix(yHTeBest, yTe)
CMBest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes   66  480
##        No    38 1162
##                                           
##                Accuracy : 0.7033          
##                  95% CI : (0.6813, 0.7247)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1145          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63462         
##             Specificity : 0.70767         
##          Pos Pred Value : 0.12088         
##          Neg Pred Value : 0.96833         
##              Prevalence : 0.05956         
##          Detection Rate : 0.03780         
##    Detection Prevalence : 0.31271         
##       Balanced Accuracy : 0.67114         
##                                           
##        'Positive' Class : Yes             
##

Random Forest

We investigate the use of Random Forest for prediction with the Caravan data. We use 10-fold cross-validation with 7 replications to select the tuning parameter mtry. Best Accuracy was achieved with mtry=15 but better sensivity was with mtry=85. We could investigate if larger number values of the tuning parameter continue to improve sensitivity since that is most important in this application.

set.seed(1441414113)
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
startTime <- proc.time()[3] #elapsed time
#
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7, classProbs = TRUE, summaryFunction = ROCACC)
ansRF <- train(Purchase ~ ., data = XyTr,
               method = "rf",
               trControl = ctrl,
               ntree = 1000,
               tuneLength = 7,
               metric = "ROC")
totTimeRF <- proc.time()[3] -  startTime #about 16 min w 16 cores
stopCluster(cl)
ansRF

## Random Forest 
## 
## 4076 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 7 times) 
## Summary of sample sizes: 3669, 3668, 3668, 3668, 3667, 3669, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens        Spec       Accuracy   Kappa     
##    2    0.6858877  0.00000000  1.0000000  0.9401388  0.00000000
##   15    0.7003884  0.05197619  0.9873995  0.9314095  0.06003764
##   29    0.6990951  0.06135714  0.9836344  0.9284309  0.06561795
##   43    0.6956308  0.06426190  0.9819566  0.9270289  0.06567929
##   57    0.6938865  0.06952381  0.9803532  0.9258362  0.06968623
##   71    0.6932205  0.06954762  0.9788993  0.9244695  0.06661373
##   85    0.6929595  0.07188095  0.9781908  0.9239436  0.06856909
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.

The performance of RF on the ROC curve is very similar to that of logistic regression.

yHTr <- predict(ansRF, XyTr, type = "prob")[,1]
yHTe <- predict(ansRF, XyTe, type = "prob")[,1]
rfROC <- roc(yTe, yHTe)
rfROC

## 
## Call:
## roc.default(response = yTe, predictor = yHTe)
## 
## Data: yHTe in 104 controls (yTe Yes) > 1642 cases (yTe No).
## Area under the curve: 0.7256

plot(rfROC, main="Performance of RF on Test Dataset")

The sensitivity is slightly better but still very unsatisfactory.

CM <- confusionMatrix(predict(ansRF, XyTe), yTe)
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes    3   13
##        No   101 1629
##                                           
##                Accuracy : 0.9347          
##                  95% CI : (0.9221, 0.9458)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 0.8555          
##                                           
##                   Kappa : 0.0347          
##  Mcnemar's Test P-Value : 3.691e-16       
##                                           
##             Sensitivity : 0.028846        
##             Specificity : 0.992083        
##          Pos Pred Value : 0.187500        
##          Neg Pred Value : 0.941618        
##              Prevalence : 0.059565        
##          Detection Rate : 0.001718        
##    Detection Prevalence : 0.009164        
##       Balanced Accuracy : 0.510464        
##                                           
##        'Positive' Class : Yes             
##

Using the ’closest.topleft` we achieve much better sensitivity and even slightly better than logistic regression.

rfThresh <- coords(rfROC, x = "best", ret="threshold",
                   best.method="closest.topleft")
plot(rfROC, print.thres = rfThresh, type = "S",
     print.thres.pattern = "%.3f (Spec = %.2f, Sens = %.2f)",
     print.thres.cex = .8, legacy.axes = TRUE)

yHTeBest <- factor(ifelse(yHTe>rfThresh, "Yes", "No"), levels=c("Yes", "No"))
CMBest <- confusionMatrix(yHTeBest, yTe)
CMBest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes   69  444
##        No    35 1198
##                                           
##                Accuracy : 0.7257          
##                  95% CI : (0.7041, 0.7465)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1383          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.66346         
##             Specificity : 0.72960         
##          Pos Pred Value : 0.13450         
##          Neg Pred Value : 0.97161         
##              Prevalence : 0.05956         
##          Detection Rate : 0.03952         
##    Detection Prevalence : 0.29381         
##       Balanced Accuracy : 0.69653         
##                                           
##        'Positive' Class : Yes             
##

Regularized Random Forest

We briefly explored a global regularization method but it doesn’t seem to help.

method = ‘RRFglobal’

Tuning parameters:

mtry (#Randomly Selected Predictors)
coefReg (Regularization Value)

Required packages: RRF

We briefly explore this methods using 25 bootstrap iterations with vanilla default settings for the tuning parameters.

set.seed(1441414113)
ncores <- parallel::detectCores()
cl <- makeCluster(ncores)
registerDoParallel(cl)
startTime <- proc.time()[3] #elapsed time
#
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7, classProbs = TRUE, summaryFunction = ROCACC)
ctrl <- trainControl(classProbs = TRUE, summaryFunction = ROCACC) #vanilla version
ansRRF <- train(Purchase ~ ., data = XyTr,
               method = "RRFglobal",
               trControl = ctrl,
               ntree = 1000,
               #tuneGrid =  expand.grid(mtry=seq(3, 123, 40), coefReg=c(0.01, 0.1, 0.3, 1)),
               metric = "ROC")
totTimeRRF <- proc.time()[3] -  startTime #about 15 min w 16 cores
stopCluster(cl)
ansRRF

## Regularized Random Forest 
## 
## 4076 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4076, 4076, 4076, 4076, 4076, 4076, ... 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  ROC        Sens        Spec       Accuracy   Kappa     
##    2    0.010    0.6882491  0.08175341  0.9775146  0.9244964  0.08041220
##    2    0.505    0.6871190  0.07963653  0.9778192  0.9246492  0.07789480
##    2    1.000    0.6823090  0.07211541  0.9762936  0.9227879  0.06463349
##   43    0.010    0.6944883  0.08507596  0.9772065  0.9243934  0.08361815
##   43    0.505    0.6857537  0.07912707  0.9763573  0.9232467  0.07424340
##   43    1.000    0.6833671  0.07033226  0.9762976  0.9226826  0.06242819
##   85    0.010    0.6895868  0.09148491  0.9758699  0.9235136  0.08883031
##   85    0.505    0.6907761  0.08761700  0.9752292  0.9226651  0.08214386
##   85    1.000    0.6838077  0.07129142  0.9763493  0.9227860  0.06386061
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 43 and coefReg = 0.01.

yHTr <- predict(ansRRF, XyTr, type = "prob")[,1]
yHTe <- predict(ansRRF, XyTe, type = "prob")[,1]
rfROC <- roc(yTe, yHTe)
rfROC

## 
## Call:
## roc.default(response = yTe, predictor = yHTe)
## 
## Data: yHTe in 104 controls (yTe Yes) > 1642 cases (yTe No).
## Area under the curve: 0.6858

plot(rfROC, main="Performance of `RRF` on Test Dataset")

CM <- confusionMatrix(predict(ansRRF, XyTe), yTe)
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes    4   24
##        No   100 1618
##                                           
##                Accuracy : 0.929           
##                  95% CI : (0.9159, 0.9406)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 0.9787          
##                                           
##                   Kappa : 0.0363          
##  Mcnemar's Test P-Value : 1.637e-11       
##                                           
##             Sensitivity : 0.038462        
##             Specificity : 0.985384        
##          Pos Pred Value : 0.142857        
##          Neg Pred Value : 0.941793        
##              Prevalence : 0.059565        
##          Detection Rate : 0.002291        
##    Detection Prevalence : 0.016037        
##       Balanced Accuracy : 0.511923        
##                                           
##        'Positive' Class : Yes             
##

rfThresh <- coords(rfROC, x = "best", ret="threshold",
                   best.method="closest.topleft")
plot(rfROC, print.thres = rfThresh, type = "S",
     print.thres.pattern = "%.3f (Spec = %.2f, Sens = %.2f)",
     print.thres.cex = .8, legacy.axes = TRUE)

yHTeBest <- factor(ifelse(yHTe>rfThresh, "Yes", "No"), levels=c("Yes", "No"))
CMBest <- confusionMatrix(yHTeBest, yTe)
CMBest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes   72  574
##        No    32 1068
##                                           
##                Accuracy : 0.6529          
##                  95% CI : (0.6301, 0.6753)
##     No Information Rate : 0.9404          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0996          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.69231         
##             Specificity : 0.65043         
##          Pos Pred Value : 0.11146         
##          Neg Pred Value : 0.97091         
##              Prevalence : 0.05956         
##          Detection Rate : 0.04124         
##    Detection Prevalence : 0.36999         
##       Balanced Accuracy : 0.67137         
##                                           
##        'Positive' Class : Yes             
##

LASSO Logistic Regression

Feature selection may be helpful in understanding this data. Especially for the second part of the contest. We show how to use LASSO regression to select the most important predictors.

startTime <- proc.time()[3] #elapsed time
xtr <- model.matrix(Purchase ~ ., data=XyTr)
ansL1 <- glmnet(xtr, yTr, alpha=1, family="binomial")
set.seed(77711153)
ans_cv <- cv.glmnet(xtr, y=yTr, family="binomial", alpha=1)
plot(ans_cv)
lambdaHat <- ans_cv$lambda.1se
abline(v=log(lambdaHat), lwd=2.5, col=alpha("magenta", 0.4))

xte <- model.matrix(Purchase ~ ., data=XyTe)
yHat <- factor(predict(ansL1, newx=xte, type="class", s=lambdaHat)[,1], levels=c("Yes", "No"))
totTimeL1 <- proc.time()[3] -  startTime #about 41 sec w 16 cores
confusionMatrix(yTe, yHat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes    0  104
##        No     0 1642
##                                           
##                Accuracy : 0.9404          
##                  95% CI : (0.9283, 0.9511)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity :      NA         
##             Specificity : 0.94044         
##          Pos Pred Value :      NA         
##          Neg Pred Value :      NA         
##              Prevalence : 0.00000         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.05956         
##       Balanced Accuracy :      NA         
##                                           
##        'Positive' Class : Yes             
##

b<-as.matrix(coef(ans_cv))
tibble(predictor=rownames(b), bHat=c(b)) %>%
  filter(bHat !=0)

## # A tibble: 9 x 2
##   predictor        bHat
##   <chr>           <dbl>
## 1 (Intercept)  3.13    
## 2 MOPLLAAG     0.0321  
## 3 MINKGEM     -0.000621
## 4 MKOOPKLA    -0.0167  
## 5 PWAPART     -0.0818  
## 6 PPERSAUT    -0.0812  
## 7 PBRAND      -0.0325  
## 8 APERSAUT    -0.0493  
## 9 APLEZIER    -1.18

Extreme Class Imbalance

Caravan Contest Dataset

Ian McLeod

April 7, 2019