How to Use Decision Tree Classifier in sklearn with Python
Use
DecisionTreeClassifier from sklearn.tree by creating an instance, fitting it with training data using fit(), and making predictions with predict(). This lets you classify data based on learned decision rules.Syntax
The basic syntax to use a Decision Tree Classifier in sklearn involves importing the class, creating an object, training it with data, and then predicting new data labels.
DecisionTreeClassifier(): Creates the model object.fit(X_train, y_train): Trains the model on featuresX_trainand labelsy_train.predict(X_test): Predicts labels for new dataX_test.
python
from sklearn.tree import DecisionTreeClassifier # Create the classifier clf = DecisionTreeClassifier() # Train the classifier clf.fit(X_train, y_train) # Predict new data predictions = clf.predict(X_test)
Example
This example shows how to train a Decision Tree Classifier on the Iris dataset and predict the species of test samples. It prints the predicted labels and the accuracy score.
python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score # Load data iris = load_iris() X = iris.data y = iris.target # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create and train the model clf = DecisionTreeClassifier(random_state=42) clf.fit(X_train, y_train) # Predict on test data predictions = clf.predict(X_test) # Print predictions and accuracy print("Predicted labels:", predictions) print("Accuracy:", accuracy_score(y_test, predictions))
Output
Predicted labels: [1 0 2 1 1 0 0 2 1 1 0 2 2 0 0 2 0 2 2 1 0 0 2 2 1 0 1 0 2 1 1 0 0 2 1 2 0 2 0 1 1 2 0 2 1 0]
Accuracy: 1.0
Common Pitfalls
Common mistakes when using Decision Tree Classifier include:
- Not splitting data into training and testing sets, which leads to overfitting and misleading accuracy.
- Forgetting to set
random_statefor reproducible results. - Using default parameters without tuning can cause overfitting or underfitting.
- Passing data with wrong shapes or types causes errors.
python
from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris # Wrong: fitting on all data and predicting on same data (overfitting) iris = load_iris() X, y = iris.data, iris.target clf = DecisionTreeClassifier() clf.fit(X, y) pred = clf.predict(X) print("Accuracy on training data:", (pred == y).mean()) # Right: split data before training from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) clf = DecisionTreeClassifier(random_state=1) clf.fit(X_train, y_train) pred_test = clf.predict(X_test) print("Accuracy on test data:", (pred_test == y_test).mean())
Output
Accuracy on training data: 1.0
Accuracy on test data: 0.9555555555555556
Quick Reference
Here is a quick summary of key methods and parameters for DecisionTreeClassifier:
| Method/Parameter | Description |
|---|---|
| DecisionTreeClassifier() | Creates the decision tree model with optional parameters like max_depth, random_state. |
| fit(X, y) | Trains the model on feature matrix X and target vector y. |
| predict(X) | Predicts class labels for samples in X. |
| max_depth | Limits the depth of the tree to prevent overfitting. |
| random_state | Sets seed for reproducible results. |
| accuracy_score(y_true, y_pred) | Computes accuracy of predictions. |
Key Takeaways
Always split your data into training and testing sets before fitting the model.
Use DecisionTreeClassifier from sklearn.tree with fit() and predict() methods.
Set random_state for reproducible results.
Tune parameters like max_depth to avoid overfitting.
Check accuracy with sklearn.metrics.accuracy_score after prediction.