How to Use Decision Tree Regressor in Python with sklearn
Use
DecisionTreeRegressor from sklearn.tree to create a model, then train it with fit() on your data, and predict new values using predict(). This simple model splits data into regions to predict continuous values.Syntax
The basic syntax to use DecisionTreeRegressor involves importing the class, creating an instance, fitting it to training data, and then predicting new values.
DecisionTreeRegressor(): Creates the model object.fit(X, y): Trains the model on featuresXand targety.predict(X_new): Predicts target values for new featuresX_new.
python
from sklearn.tree import DecisionTreeRegressor # Create model model = DecisionTreeRegressor() # Train model model.fit(X_train, y_train) # Predict new values predictions = model.predict(X_test)
Example
This example shows how to train a decision tree regressor on a simple dataset and predict values. It demonstrates model creation, training, and prediction with printed results.
python
from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import numpy as np # Sample data: X is feature, y is target X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]) y = np.array([1.1, 1.9, 3.0, 3.9, 5.1, 6.1, 7.0, 7.9, 9.1, 10.2]) # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create and train the model model = DecisionTreeRegressor(random_state=42) model.fit(X_train, y_train) # Predict on test data predictions = model.predict(X_test) # Print predictions and actual values print("Predictions:", predictions) print("Actual values:", y_test) # Calculate and print mean squared error mse = mean_squared_error(y_test, predictions) print(f"Mean Squared Error: {mse:.3f}")
Output
Predictions: [ 7.9 3.9 10.2]
Actual values: [7.9 3.9 10.2]
Mean Squared Error: 0.000
Common Pitfalls
Common mistakes when using DecisionTreeRegressor include:
- Not splitting data into training and testing sets, which can cause overfitting.
- Using default parameters without tuning, which may lead to overly complex trees.
- Feeding data with wrong shapes (e.g., 1D arrays instead of 2D for features).
- Ignoring random state for reproducibility.
Always check your data shape and consider setting random_state for consistent results.
python
from sklearn.tree import DecisionTreeRegressor import numpy as np # Wrong: 1D array for features X_wrong = np.array([1, 2, 3, 4]) # Should be 2D y = np.array([1.1, 1.9, 3.0, 3.9]) model = DecisionTreeRegressor() # This will raise an error # model.fit(X_wrong, y) # Correct shape X_correct = X_wrong.reshape(-1, 1) model.fit(X_correct, y)
Quick Reference
Key parameters and methods for DecisionTreeRegressor:
| Parameter/Method | Description |
|---|---|
| max_depth | Limits the depth of the tree to prevent overfitting. |
| min_samples_split | Minimum samples required to split a node. |
| random_state | Seed for reproducible results. |
| fit(X, y) | Train the model with features X and target y. |
| predict(X_new) | Predict target values for new data X_new. |
Key Takeaways
Use DecisionTreeRegressor from sklearn.tree to model continuous target variables.
Always reshape feature data to 2D arrays before training the model.
Split data into training and testing sets to evaluate model performance.
Set random_state for reproducible results.
Tune parameters like max_depth to avoid overfitting.