Using Caret for Predictive Modeling in R: A Guide

Blog

Caret: A Cornucopia of Functions For Doing Predictive Analytics In R

Dr. Sebastian Petry

Published on

6.7.2017

8.5.2025

Updated on

8.5.2025

Data Science & AI

Caret: A Cornucopia of Functions For Doing Predictive Analytics In R

R is one of the most popular open source programming languages for predictive analytics. One of its upsides is the abundance of modeling choices provided by more than 10000 user-created packages on the Comprehensive R Archive Network (CRAN). On the downside, package-specific syntax choices (which are a much bigger problem in R than in e.g. in Python) impede the employment of new models. The caret package attempts to streamline the process of creating predictive models by providing a uniform interface to various training and prediction functions. Caret’s data preparation- , feature selection- and model tuning functionalities facilitate the process of building and evaluating predictive models. This blog post focuses on model tuning and selection and shows how to tackle common model building challenges with caret.

Between And Within-Model Selection

There is no gold standard: No model beats every other model in every other situation. As data scientists, we have to justify our modelling choices. We select between different models by training them and comparing their performances. The same reasoning applies to parameter choices that are not determined by the optimization process: We select within all models of a class by comparing the performance of models fed by different parameter values.

With caret’s train function we can quickly estimate several different models. As an illustration, we create a predictive model to determine the defaults of credit card holders, based on personal information and their previous payment history.[1]

To select a model we create a small training set, estimate different models for several resamples, compare their performance and deploy the best one.

Data Preprocessing

Caret provides simple functions for creating balanced splits of the data, preserving the overall distribution of the dependent variable. Stratified random sampling is advisable if one class has a disproportionately small frequency compared to others.

library(caret)

set.seed(11)

# reading data and preparation
customer_data <- read.csv("credit_default_prepared.csv", sep = ";",stringsAsFactors = TRUE)

# data partition
rows_train <- createDataPartition(y = customer_data$DEFAULT, 
                                  p = 0.3, list = FALSE)
training <- customer_data[rows_train,  ]

‍

A simple but powerful model for binary variables is logistic regression, which we will assess against random forests and boosting. Hence, we will do a between-model selection in which we pit logistic regression against two different ensemble models. Random forests estimate and aggregate several classification trees on bootstrapped subsamples while decorrelating individual trees by means of sampling the set of included predictors. The original boosting algorithm aggregates classification trees too, but estimates them in a sequential fashion:

It starts with equal weights for each observation. In every iteration, the weights of misclassified observations increase and the ones of correctly classified observations decrease. In the next iteration, the updated weights are utilized to concentrate on the correct prediction of previously misclassified observations. The final prediction is obtained by weighting each tree’s prediction based on model accuracy.

Both ensemble methods include tuning parameters we have to decide upon (within-model selection). For the random forest model, we will only consider the number of predictor variables sampled as candidates at each split (mtry) as a tuning parameter and search through parameter values 4 to 14. Table 1 depicts the tuning parameters for boosting:

Table with tuning parameters for boosting

Expanding The Search Grid

By default, train uses a minimal search grid of three values for each tuning parameter. With caret’s expand.grid() function we create a Cartesian product of eligible parameters and pass it to the train function:

# parameter grid random forest
grid_rf <- expand.grid(mtry = 4:14)

# parameter grid boosting
grid_bo <- expand.grid(interaction.depth = 1:3,
                    n.trees = seq(80, 100, by = 20),
                    shrinkage = c(0.01, 0.1),
                    n.minobsinnode = 10)

Resampling Techniques And Performance Measures

One approach for model tuning is to fit models with different tuning parameter values to many resampled versions of the training set, estimate their performance and determine the final parameters based on a performance metric.

Choices for the resampling technique and performance measure can be set with the trainControl() function. For this illustration, we do a 5-fold cross-validation and measure model performance by the area under the ROC curve (the default setting in caret is the bootstrap and overall accuracy):

ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = twoClassSummary,
  classProbs = TRUE
)

Model Training

The code below shows how to train several different models with caret, substituting only the specific method and the previously specified parameter grid. Caret selects the optimal parameter choice and provides a summary with “print(object_name)”, where “object_name” is the name of the object returned by the train function:

# random forest
set.seed(11)
fit_rf <- train(y ~.,
                data = training,
                method = "rf",
                ntrees = 5,
                metric = "ROC",
                trControl = ctrl,
                tuneGrid = grid_rf)

# boosting
set.seed(11)
fit_bo <- train(y ~.,
                data = training,
                method = "gbm",
                metric = "ROC",
                verbose = FALSE,
                trControl = ctrl,
                tuneGrid = grid_bo)

# performance plot random forest
plot(fit_rf, main = "Random forest ROC for several mtry values")

Plotting the train object displays a performance profile for the tuning parameters. As an example, figure 1 shows the ROC of the random forest in dependence of the number of randomly selected predictors. Choosing six predictors randomly at every split yields to the model with the highest ROC.

Infografik random forest ROC for several mtry values

Building a good logistic regression model requires great care (i.e. with respect to feature and variable selection). Here we take an arbitrary shortcut by selecting the union of the five most important variables from our boosting and random forest models as predictors. Doing so we introduce caret’s helper function varImp(), which provides us with the ranked importance of the independent variables:

# logistic regression: variable selection

top_n <- function(x, n = 5){

  x <- varImp(x)

  row.names(x$importance)[order(x$importance, decreasing = TRUE)][1:n]

}

lg_par <- unique(as.vector(unique(sapply(list(fit_bo,fit_rf), top_n))))

mt <- model.matrix(~.,data = training)[,lg_par]

# logistic regression
set.seed(11)
fit_lr <- train(mt, training$DEFAULT,
                method = "glm",
                family = binomial,
                metric = "ROC",
                trControl = ctrl)

Model Comparison

We have just estimated over 120 models and can now visually compare the distribution of their performance. The resamples() function allows to collect, analyse and visualize the resampling results from our data set. Figure 2 summarises the performance distribution for the optimal models over the different folds. Boosting and random forest outperform the ad-hoc logistic regression model. Both ensemble methods perform comparable, but the variance of the boosting algorithm appears to be smaller. For our case study, the average boosting model (ROC: 0.7807) slightly outperforms the average random forest model (ROC of 0.7802).

# Model comparison
comp <- resamples(list(Logistic = fit_lr, RandomForest = fit_rf, Boosting = fit_bo))
summary(comp)

# plot performance comparison
bwplot(comp, metric="ROC", main = "Model Comparison: RF vs Boosting vs LR")

A Cautionary Note On Utilising Caret

The strongest argument for using caret is that it streamlines the model building process, so you can focus on more important modelling decisions: What kind of models could work for your problem at hand? What is the right performance measure? What is the right choice for a parameter space? Caret simplifies model building and selection significantly, but you still need to think:

Not just choose some models and parameter search spaces to plug into caret’s functions, but the right ones.
Even more importantly, get the feature engineering right, and tailor it to each of the models you want to use.

[1] The data source is a preprocessed data set from Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480ll. (http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). Contact XYC for the prepared version of the data se

Let’s Unlock the Full Potential of Your Data – Together!

Looking to become more data-driven, optimize processes, or leverage cutting-edge technologies? Our blog provides valuable insights – but the best way to tackle your specific challenges is through a direct conversation.

Let’s talk – our experts are just one click away!

Want To Learn More? Contact Us!

Dr. Sebastian Petry

Domain Lead Data Science & AI

Who is b.telligent?

Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.

Get to know us

The top of an office building on a bright day

All posts

No previous post

No next post

Caret: A Cornucopia of Functions For Doing Predictive Analytics In R

Table of Contents

Between And Within-Model Selection

Data Preprocessing

Expanding The Search Grid

Resampling Techniques And Performance Measures

Model Training

Model Comparison

A Cautionary Note On Utilising Caret

Let’s Unlock the Full Potential of Your Data – Together!

Want To Learn More? Contact Us!

Your contact person

Dr. Sebastian Petry

Who is b.telligent?

Munich

Basel

Berlin

Cluj

Dusseldorf

Frankfurt

Hamburg

Nuremberg

Vienna

Zurich

Cluj

Vienna – Postal address

Vienna – Visitor address

Basel

Zurich

Nürnberg

Frankfurt

Düsseldorf

Hamburg

Berlin

Munich

Caret: A Cornucopia of Functions For Doing Predictive Analytics In R

Table of Contents

Between And Within-Model Selection

Data Preprocessing

Expanding The Search Grid

Resampling Techniques And Performance Measures

Model Training

Model Comparison

A Cautionary Note On Utilising Caret

Let’s Unlock the Full Potential of Your Data – Together!

Want To Learn More? Contact Us!

Your contact person

Dr. Sebastian Petry

Who is b.telligent?

Related Posts

Snowflake Document AI – Easily Extract Data From Unstructured Documents

Neural Averaging Ensembles for Tabular Data With TensorFlow 2.0

Neural Networks for Tabular Data: Ensemble Learning Without Trees

Sizing and Scaling Azure AI Search

Munich

Basel

Berlin

Cluj

Dusseldorf

Frankfurt

Hamburg

Nuremberg

Vienna

Zurich

Cluj

Vienna – Postal address

Vienna – Visitor address

Basel

Zurich

Nürnberg

Frankfurt

Düsseldorf

Hamburg

Berlin

Munich