0
0
MlopsHow-ToBeginner · 3 min read

How to Use Random Forest Classifier in sklearn with Python

Use RandomForestClassifier from sklearn.ensemble by creating an instance, fitting it with training data using fit(), and predicting with predict(). This model builds many decision trees and averages their results for better accuracy.
📐

Syntax

The basic syntax to use RandomForestClassifier involves importing it, creating an instance with optional parameters, fitting it to training data, and then predicting new data.

  • RandomForestClassifier(): Creates the model. You can set parameters like n_estimators (number of trees) and random_state (for reproducibility).
  • fit(X_train, y_train): Trains the model on features X_train and labels y_train.
  • predict(X_test): Predicts labels for new data X_test.
python
from sklearn.ensemble import RandomForestClassifier

# Create the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict new data
predictions = model.predict(X_test)
💻

Example

This example shows how to train a random forest classifier on the Iris dataset, predict labels on test data, and print the accuracy score.

python
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on test data
predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 1.00
⚠️

Common Pitfalls

Common mistakes when using RandomForestClassifier include:

  • Not splitting data into training and testing sets, which leads to overfitting and misleading accuracy.
  • Using default parameters without tuning, which might not give the best results.
  • Forgetting to set random_state for reproducibility.
  • Passing data with missing values without preprocessing, causing errors.
python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Wrong: Using all data for training and testing
iris = load_iris()
X = iris.data
y = iris.target
model = RandomForestClassifier()
model.fit(X, y)
predictions = model.predict(X)

# Right: Split data before training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
📊

Quick Reference

ParameterDescriptionDefault
n_estimatorsNumber of trees in the forest100
criterionFunction to measure quality of split ('gini' or 'entropy')'gini'
max_depthMaximum depth of each treeNone (nodes expanded until pure)
random_stateSeed for reproducibilityNone
max_featuresNumber of features to consider when looking for best splitauto

Key Takeaways

Create a RandomForestClassifier instance and fit it with training data using fit().
Always split your data into training and testing sets to avoid overfitting.
Set random_state for reproducible results.
Tune parameters like n_estimators and max_depth for better performance.
Preprocess data to handle missing values before training.