How to use decision tree classifier sklearn in python

MlopsHow-ToBeginner · 4 min read

How to Use Decision Tree Classifier in sklearn with Python

Use DecisionTreeClassifier from sklearn.tree by creating an instance, fitting it with training data using fit(), and making predictions with predict(). This lets you classify data based on learned decision rules.

📐

Syntax

The basic syntax to use a Decision Tree Classifier in sklearn involves importing the class, creating an object, training it with data, and then predicting new data labels.

DecisionTreeClassifier(): Creates the model object.
fit(X_train, y_train): Trains the model on features X_train and labels y_train.
predict(X_test): Predicts labels for new data X_test.

python

from sklearn.tree import DecisionTreeClassifier

# Create the classifier
clf = DecisionTreeClassifier()

# Train the classifier
clf.fit(X_train, y_train)

# Predict new data
predictions = clf.predict(X_test)

💻

Example

This example shows how to train a Decision Tree Classifier on the Iris dataset and predict the species of test samples. It prints the predicted labels and the accuracy score.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
predictions = clf.predict(X_test)

# Print predictions and accuracy
print("Predicted labels:", predictions)
print("Accuracy:", accuracy_score(y_test, predictions))

Output

Predicted labels: [1 0 2 1 1 0 0 2 1 1 0 2 2 0 0 2 0 2 2 1 0 0 2 2 1 0 1 0 2 1 1 0 0 2 1 2 0 2 0 1 1 2 0 2 1 0] Accuracy: 1.0

⚠️

Common Pitfalls

Common mistakes when using Decision Tree Classifier include:

Not splitting data into training and testing sets, which leads to overfitting and misleading accuracy.
Forgetting to set random_state for reproducible results.
Using default parameters without tuning can cause overfitting or underfitting.
Passing data with wrong shapes or types causes errors.

python

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

# Wrong: fitting on all data and predicting on same data (overfitting)
iris = load_iris()
X, y = iris.data, iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
pred = clf.predict(X)
print("Accuracy on training data:", (pred == y).mean())

# Right: split data before training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)
pred_test = clf.predict(X_test)
print("Accuracy on test data:", (pred_test == y_test).mean())

Output

Accuracy on training data: 1.0 Accuracy on test data: 0.9555555555555556

📊

Quick Reference

Here is a quick summary of key methods and parameters for DecisionTreeClassifier:

Method/Parameter	Description
DecisionTreeClassifier()	Creates the decision tree model with optional parameters like `max_depth`, `random_state`.
fit(X, y)	Trains the model on feature matrix `X` and target vector `y`.
predict(X)	Predicts class labels for samples in `X`.
max_depth	Limits the depth of the tree to prevent overfitting.
random_state	Sets seed for reproducible results.
accuracy_score(y_true, y_pred)	Computes accuracy of predictions.

✅

Key Takeaways

Always split your data into training and testing sets before fitting the model.

Use DecisionTreeClassifier from sklearn.tree with fit() and predict() methods.

Set random_state for reproducible results.

Tune parameters like max_depth to avoid overfitting.

Check accuracy with sklearn.metrics.accuracy_score after prediction.