How to Use Random Forest Classifier in sklearn with Python
Use
RandomForestClassifier from sklearn.ensemble by creating an instance, fitting it with training data using fit(), and predicting with predict(). This model builds many decision trees and averages their results for better accuracy.Syntax
The basic syntax to use RandomForestClassifier involves importing it, creating an instance with optional parameters, fitting it to training data, and then predicting new data.
- RandomForestClassifier(): Creates the model. You can set parameters like
n_estimators(number of trees) andrandom_state(for reproducibility). fit(X_train, y_train): Trains the model on featuresX_trainand labelsy_train.predict(X_test): Predicts labels for new dataX_test.
python
from sklearn.ensemble import RandomForestClassifier # Create the model model = RandomForestClassifier(n_estimators=100, random_state=42) # Train the model model.fit(X_train, y_train) # Predict new data predictions = model.predict(X_test)
Example
This example shows how to train a random forest classifier on the Iris dataset, predict labels on test data, and print the accuracy score.
python
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X = iris.data y = iris.target # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create and train the model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Predict on test data predictions = model.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, predictions) print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 1.00
Common Pitfalls
Common mistakes when using RandomForestClassifier include:
- Not splitting data into training and testing sets, which leads to overfitting and misleading accuracy.
- Using default parameters without tuning, which might not give the best results.
- Forgetting to set
random_statefor reproducibility. - Passing data with missing values without preprocessing, causing errors.
python
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris # Wrong: Using all data for training and testing iris = load_iris() X = iris.data y = iris.target model = RandomForestClassifier() model.fit(X, y) predictions = model.predict(X) # Right: Split data before training X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test)
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| n_estimators | Number of trees in the forest | 100 |
| criterion | Function to measure quality of split ('gini' or 'entropy') | 'gini' |
| max_depth | Maximum depth of each tree | None (nodes expanded until pure) |
| random_state | Seed for reproducibility | None |
| max_features | Number of features to consider when looking for best split | auto |
Key Takeaways
Create a RandomForestClassifier instance and fit it with training data using fit().
Always split your data into training and testing sets to avoid overfitting.
Set random_state for reproducible results.
Tune parameters like n_estimators and max_depth for better performance.
Preprocess data to handle missing values before training.