How to Use tidymodels in R: Simple Guide with Examples
To use
tidymodels in R, first load the package with library(tidymodels). Then create a model specification, define a recipe for preprocessing, split your data, and fit the model using a workflow. This framework helps organize machine learning steps clearly and consistently.Syntax
The basic steps to use tidymodels include:
- Load package:
library(tidymodels) - Model specification: Define the model type and engine, e.g.,
linear_reg() %>% set_engine("lm") - Recipe: Preprocess data with
recipe() - Data splitting: Use
initial_split()to create training and testing sets - Workflow: Combine model and recipe with
workflow() - Fit model: Use
fit()on the workflow with training data
r
library(tidymodels) # Split data split <- initial_split(dataset) training_data <- training(split) testing_data <- testing(split) # Define model model <- linear_reg() %>% set_engine("lm") # Create recipe rec <- recipe(target ~ ., data = training_data) %>% step_normalize(all_predictors()) # Create workflow wf <- workflow() %>% add_model(model) %>% add_recipe(rec) # Fit model fitted_model <- fit(wf, data = training_data)
Example
This example shows how to build a linear regression model predicting mpg from the mtcars dataset using tidymodels. It includes data splitting, recipe creation, model specification, workflow setup, and fitting the model.
r
library(tidymodels) # Load data data(mtcars) # Split data set.seed(123) split <- initial_split(mtcars, prop = 0.8) train_data <- training(split) test_data <- testing(split) # Define recipe rec <- recipe(mpg ~ ., data = train_data) %>% step_normalize(all_predictors()) # Define model model <- linear_reg() %>% set_engine("lm") # Create workflow wf <- workflow() %>% add_model(model) %>% add_recipe(rec) # Fit model fitted <- fit(wf, data = train_data) # Show fitted model summary summary(fitted$fit$fit)
Output
Call:
lm(formula = mpg ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-3.9415 -1.6009 -0.1821 1.0509 5.8543
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.7179 0.657 0.518
cyl -0.11144 1.0450 -0.107 0.916
disp -0.01912 0.0096 -1.991 0.059 .
hp -0.02148 0.0218 -0.985 0.337
drat 0.78711 1.6350 0.481 0.635
wt -3.71530 1.8944 -1.961 0.063 .
qsec 0.82104 0.7308 1.123 0.274
vs 0.31776 2.1045 0.151 0.881
am 2.52023 2.0567 1.225 0.234
gear 0.65541 1.4933 0.439 0.665
carb -0.19942 0.8287 -0.241 0.812
Residual standard error: 2.593 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.806
F-statistic: 13.93 on 10 and 21 DF, p-value: 9.109e-07
Common Pitfalls
Common mistakes when using tidymodels include:
- Not setting a seed before splitting data, causing inconsistent results.
- Forgetting to add both the model and recipe to the workflow.
- Trying to fit the model without preprocessing steps defined in a recipe.
- Using incompatible model engines or forgetting to specify one.
Always check that your workflow includes all parts and that data is properly split.
r
# Wrong: fitting model without workflow
model <- linear_reg() %>% set_engine("lm")
fit(model, data = mtcars) # This works but skips preprocessing
# Right: use workflow with recipe
rec <- recipe(mpg ~ ., data = mtcars) %>% step_normalize(all_predictors())
wf <- workflow() %>% add_model(model) %>% add_recipe(rec)
fitted <- fit(wf, data = mtcars)Quick Reference
Here is a quick summary of key tidymodels functions:
| Function | Purpose |
|---|---|
| library(tidymodels) | Load the tidymodels package |
| initial_split() | Split data into training and testing sets |
| recipe() | Define preprocessing steps for data |
| linear_reg(), rand_forest(), etc. | Specify model type |
| set_engine() | Choose the computational engine for the model |
| workflow() | Combine model and recipe into one object |
| fit() | Train the model on training data |
| predict() | Make predictions on new data |
Key Takeaways
Load tidymodels and split your data before modeling.
Use recipes to preprocess data and workflows to combine steps.
Always specify model type and engine clearly.
Fit models using workflows to keep preprocessing and modeling together.
Set a random seed for reproducible data splits.