How to Do Cross Validation in R: Simple Guide with Example
In R, you can perform cross validation using the
train function from the caret package by specifying the trainControl method as "cv" for k-fold cross validation. This method splits your data into k parts, trains the model on k-1 parts, and tests on the remaining part, repeating this process k times to evaluate model performance.Syntax
The main syntax for cross validation in R using the caret package involves creating a control object with trainControl() and then training the model with train().
trainControl(method = "cv", number = k): sets up k-fold cross validation.train(formula, data, method, trControl): trains the model using the control settings.
r
library(caret) control <- trainControl(method = "cv", number = 5) # 5-fold CV model <- train(Species ~ ., data = iris, method = "rpart", trControl = control)
Example
This example shows how to perform 5-fold cross validation on the iris dataset using a decision tree model (rpart). It trains the model and prints the accuracy results from cross validation.
r
library(caret) set.seed(123) # for reproducibility control <- trainControl(method = "cv", number = 5) model <- train(Species ~ ., data = iris, method = "rpart", trControl = control) print(model)
Output
CART
150 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 120, 120, 120, 120, 120
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01 0.9533333 0.92
0.05 0.9533333 0.92
0.1 0.9466667 0.91
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01.
Common Pitfalls
Common mistakes when doing cross validation in R include:
- Not setting
set.seed()for reproducible splits. - Using the wrong
methodintrainControl()(e.g., forgetting to specify "cv" for cross validation). - Not passing the
trControlargument totrain(), which disables cross validation. - Confusing cross validation with simple train-test split.
r
library(caret) # Wrong: missing trControl disables CV model_wrong <- train(Species ~ ., data = iris, method = "rpart") # Right: include trControl with CV control <- trainControl(method = "cv", number = 5) model_right <- train(Species ~ ., data = iris, method = "rpart", trControl = control)
Quick Reference
Summary of key parameters for cross validation with caret:
| Parameter | Description | Example |
|---|---|---|
| method | Type of resampling method | "cv" for k-fold cross validation |
| number | Number of folds for CV | 5 or 10 |
| trainControl() | Function to set resampling method | trainControl(method = "cv", number = 5) |
| train() | Function to train model with CV | train(Species ~ ., data, method = "rpart", trControl = control) |
| set.seed() | Set seed for reproducibility | set.seed(123) |
Key Takeaways
Use the caret package's trainControl with method = "cv" to perform k-fold cross validation.
Always set a seed with set.seed() to get reproducible results.
Pass the trainControl object to train() via the trControl argument to enable cross validation.
Cross validation helps evaluate model performance more reliably than a single train-test split.
Common mistakes include forgetting trControl or using the wrong method parameter.