0
0
MlopsHow-ToBeginner · 4 min read

How to Use Gradient Boosting Classifier in Python with sklearn

Use GradientBoostingClassifier from sklearn.ensemble to create a gradient boosting model. Fit it on training data with fit(), then predict with predict() or predict_proba().
📐

Syntax

The GradientBoostingClassifier is imported from sklearn.ensemble. You create an instance by specifying parameters like n_estimators (number of trees), learning_rate (step size), and max_depth (tree depth). Use fit(X, y) to train the model on features X and labels y. Use predict(X) to get class predictions.

python
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
💻

Example

This example shows how to train a Gradient Boosting Classifier on the Iris dataset, then evaluate accuracy on test data.

python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 0.98
⚠️

Common Pitfalls

  • Overfitting: Using too many trees or very deep trees can cause the model to memorize training data and perform poorly on new data.
  • Learning rate too high: A large learning rate can make training unstable and reduce accuracy.
  • Not scaling features: Although gradient boosting is less sensitive to feature scaling, very different scales can still affect performance.
  • Ignoring random_state: Not setting random_state can lead to non-reproducible results.
python
from sklearn.ensemble import GradientBoostingClassifier

# Wrong: Too many estimators and no random_state
model_wrong = GradientBoostingClassifier(n_estimators=1000)

# Right: Balanced estimators and fixed random_state
model_right = GradientBoostingClassifier(n_estimators=100, random_state=42)
📊

Quick Reference

ParameterDescriptionDefault
n_estimatorsNumber of boosting stages (trees)100
learning_rateStep size shrinkage to prevent overfitting0.1
max_depthMaximum depth of each tree3
random_stateSeed for reproducibilityNone
subsampleFraction of samples used for fitting each tree1.0

Key Takeaways

Use GradientBoostingClassifier from sklearn.ensemble to build gradient boosting models.
Tune n_estimators, learning_rate, and max_depth to balance accuracy and overfitting.
Always split data into training and testing sets to evaluate model performance.
Set random_state for reproducible results.
Beware of overfitting with too many trees or very deep trees.