MlopsHow-ToBeginner · 4 min read

How to Use Gradient Boosting Classifier in Python with sklearn

Use GradientBoostingClassifier from sklearn.ensemble to create a gradient boosting model. Fit it on training data with fit(), then predict with predict() or predict_proba().

📐

Syntax

The GradientBoostingClassifier is imported from sklearn.ensemble. You create an instance by specifying parameters like n_estimators (number of trees), learning_rate (step size), and max_depth (tree depth). Use fit(X, y) to train the model on features X and labels y. Use predict(X) to get class predictions.

python

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

💻

Example

This example shows how to train a Gradient Boosting Classifier on the Iris dataset, then evaluate accuracy on test data.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Output

Accuracy: 0.98

⚠️

Common Pitfalls

Overfitting: Using too many trees or very deep trees can cause the model to memorize training data and perform poorly on new data.
Learning rate too high: A large learning rate can make training unstable and reduce accuracy.
Not scaling features: Although gradient boosting is less sensitive to feature scaling, very different scales can still affect performance.
Ignoring random_state: Not setting random_state can lead to non-reproducible results.

python

from sklearn.ensemble import GradientBoostingClassifier

# Wrong: Too many estimators and no random_state
model_wrong = GradientBoostingClassifier(n_estimators=1000)

# Right: Balanced estimators and fixed random_state
model_right = GradientBoostingClassifier(n_estimators=100, random_state=42)

📊

Quick Reference

Parameter	Description	Default
n_estimators	Number of boosting stages (trees)	100
learning_rate	Step size shrinkage to prevent overfitting	0.1
max_depth	Maximum depth of each tree	3
random_state	Seed for reproducibility	None
subsample	Fraction of samples used for fitting each tree	1.0

✅

Key Takeaways

Use GradientBoostingClassifier from sklearn.ensemble to build gradient boosting models.

Tune n_estimators, learning_rate, and max_depth to balance accuracy and overfitting.

Always split data into training and testing sets to evaluate model performance.

Set random_state for reproducible results.

Beware of overfitting with too many trees or very deep trees.