The book Applied Predictive Modeling features caret and over 40 other R packages. It is on sale at Amazon or the the publisher’s website. There is a companion website too. There is also a paper on caret in the Journal of Statistical Software. The example data can be obtained here(the predictors) and here (the outcomes). Caret is a Markdown editor. Features Code highlighting Auto-completion Context commands Extendable selection Preview File navigation Recent files Customizable look Keyboard navigation. Misc functions for training and plotting classification and regression models. Download Caret (Chrome) 1.8.1 for Windows. Get an open-sourced text/code editor for your Google Chrome with Caret. The caret Package. 2 Visualizations. The featurePlot function is a wrapper for different lattice plots to visualize the data. For example, the following figures show the default plot for continuous outcomes generated using the featurePlot function.
Note: If you’re new to caret, I suggest learning tidymodels instead (http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/). Tidymodels is essentially caret’s successor. Don’t worry though, your caret code will still work!
Older note: This tutorial was based on an older version of the abalone data that had a binary
old
varibale rather than a numeric age
variable. It has been modified lightly so that it uses a manual old
variable (is the abalone older than 10 or not) and ignores the numeric age
variable.Materials prepared by Rebecca Barter. Package developed by Max Kuhn.
An interactive Jupyter Notebook version of this tutorial can be found at https://github.com/rlbarter/STAT-215A-Fall-2017/tree/master/week11. Feel free to download it and use for your own learning or teaching adventures! App uninstaller & cleaner pro 5 1 download free.
R has a wide number of packages for machine learning (ML), which is great, but also quite frustrating since each package was designed independently and has very different syntax, inputs and outputs. Gifquickmaker 1 5 0 – make animated gifs.
This means that if you want to do machine learning in R, you have to learn a large number of separate methods.
Recognizing this, Max Kuhn (at the time working in drug discovery at Pfizer, now at RStudio) put together a single package for performing any machine learning method you like. This package is called
caret
. Caret stands for Classification And Regression Training. Apparently caret has little to do with our orange friend, the carrot.Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:
- Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness)
- Data splitting
- Variable selection
- Model evaluation
An extensive vignette for caret can be found here: https://topepo.github.io/caret/index.html
A simple view of caret: the default train
function
To implement your machine learning model of choice using caret you will use the
train
function. The types of modeling options available are many and are listed here: https://topepo.github.io/caret/available-models.html. In the example below, we will use the ranger implementation of random forest to predict whether abalone are “old” or not based on a bunch of physical properties of the abalone (sex, height, weight, diameter, etc). The abalone data came from the UCI Machine Learning repository (we split the data into a training and test set).First we load the data into R:
It looks like we have 3,759 abalone:
![Caret 2 1 download free version Caret 2 1 download free version](https://upload.wikimedia.org/wikipedia/en/thumb/9/94/Caret.svg/1024px-Caret.svg.png)
Time to fit a random forest model using caret. Anytime we want to fit a model using
train
we tell it which model to fit by providing a formula for the first argument (as.factor(old) ~ .
means that we want to model old
as a function of all of the other variables). Then we need to provide a method (we specify 'ranger'
to implement randomForest).By default, the
train
function without any arguments re-runs the model over 25 bootstrap samples and across 3 options of the tuning parameter (the tuning parameter for ranger
is mtry
; the number of randomly selected predictors at each cut in the tree).To test the data on an independent test set is equally as simple using the inbuilt
predict
function.We have now seen how to fit a model along with the default resampling implementation (bootstrapping) and parameter selection. While this is great, there are many more things we could do with caret.
Pre-processing (preProcess
)
There are a number of pre-processing steps that are easily implemented by caret. Several stand-alone functions from caret target specific issues that might arise when setting up the model. These include
dummyVars
: creating dummy variables from categorical variables with multiple categoriesnearZeroVar
: identifying zero- and near zero-variance predictors (these may cause issues when subsampling)findCorrelation
: identifying correlated predictorsfindLinearCombos
: identify linear dependencies between predictors
In addition to these individual functions, there also exists the
preProcess
function which can be used to perform more common tasks such as centering and scaling, imputation and transformation. preProcess
takes in a data frame to be processed and a method which can be any of “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”.Data splitting (createDataPartition
and groupKFold
)
Generating subsets of the data is easy with the
createDataPartition
function. While this function can be used to simply generate training and testing sets, it can also be used to subset the data while respecting important groupings that exist within the data.First, we show an example of performing general sample splitting to generate 10 different 80% subsamples.
While the above is incredibly useful, it is also very easy to do using a for loop. Not so exciting.
Caret Code
Something that IS more exciting is the ability to do K-fold cross validation which respects groupings in the data. The
groupKFold
function does just that!As an example, let’s consider the following made-up abalone groups so that each sequential set of 5 abalone that appear in the dataset together are in the same group. For simplicity we will only consider the first 50 abalone.
The following code performs 10-fold cross-validation while respecting the groups in the abalone data. That is, each group of abalone must always appear in the same group together.
Resampling options (trainControl
)
One of the most important part of training ML models is tuning parameters. You can use the
trainControl
function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl
will be provided as an argument for train
.We could instead use our grouped folds (rather than random CV folds) by assigning the
index
argument of trainControl
to be grouped_folds
.Caret 2 1 download free. full
You can also pass functions to
trainControl
that would have otherwise been passed to preProcess
.Caret App
Model parameter tuning options (tuneGrid =
)
Caret Software
You could specify your own tuning grid for model parameters using the
tuneGrid
argument of the train
function. For example, you can define a grid of parameter combinations.Caret 2 1 Download Free Version
This tutorial has only scratched the surface of all of the options in the caret package. To find out more, see the extensive vignette https://topepo.github.io/caret/index.html.