How to Use Random Forest Regressor in Python with sklearn
Use
RandomForestRegressor from sklearn.ensemble by creating an instance, fitting it with training data using fit(), and then predicting with predict(). This model combines many decision trees to improve accuracy for regression tasks.Syntax
The basic syntax to use RandomForestRegressor involves importing it, creating an instance with optional parameters, fitting it to training data, and predicting new values.
- RandomForestRegressor(): Creates the model. You can set parameters like
n_estimators(number of trees) andrandom_state(for reproducibility). - fit(X_train, y_train): Trains the model on features
X_trainand target valuesy_train. - predict(X_test): Predicts target values for new data
X_test.
python
from sklearn.ensemble import RandomForestRegressor # Create model with 100 trees model = RandomForestRegressor(n_estimators=100, random_state=42) # Train model model.fit(X_train, y_train) # Predict new values predictions = model.predict(X_test)
Example
This example shows how to train a random forest regressor on a simple dataset and predict values. It also prints the mean squared error to check accuracy.
python
from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Create a regression dataset X, y = make_regression(n_samples=100, n_features=4, noise=0.2, random_state=1) # Split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train the model model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Predict on test data predictions = model.predict(X_test) # Calculate mean squared error mse = mean_squared_error(y_test, predictions) print(f"Mean Squared Error: {mse:.3f}")
Output
Mean Squared Error: 0.042
Common Pitfalls
Common mistakes when using RandomForestRegressor include:
- Not splitting data into training and testing sets, which can cause overfitting.
- Using default parameters without tuning, which may not give the best results.
- Feeding data with missing values or categorical variables without preprocessing.
- Confusing regression with classification models.
Always preprocess data, split it properly, and consider tuning parameters like n_estimators and max_depth.
python
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split # Wrong: Using all data for training # model.fit(X, y) # No test set, no evaluation # Right: Split data first X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Then predict and evaluate predictions = model.predict(X_test)
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| n_estimators | Number of trees in the forest | 100 |
| max_depth | Maximum depth of each tree | None (expand until leaves are pure) |
| random_state | Seed for reproducibility | None |
| min_samples_split | Minimum samples to split a node | 2 |
| min_samples_leaf | Minimum samples at a leaf node | 1 |
Key Takeaways
Import and create RandomForestRegressor, then fit it with training data using fit().
Use predict() to get predictions on new data after training.
Always split data into training and testing sets to evaluate model performance.
Tune parameters like n_estimators and max_depth for better results.
Preprocess data to handle missing values and categorical features before training.