0
0
ML Pythonml~20 mins

Random forest in depth in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Random forest in depth
Problem:We want to classify flowers in the Iris dataset using a random forest model.
Current Metrics:Training accuracy: 100%, Validation accuracy: 75%
Issue:The model is overfitting: training accuracy is perfect but validation accuracy is much lower.
Your Task
Reduce overfitting by tuning the random forest hyperparameters to achieve validation accuracy above 85% while keeping training accuracy below 95%.
You can only change random forest parameters like number of trees, max depth, and minimum samples per leaf.
Do not change the dataset or use other models.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Create random forest with tuned hyperparameters
model = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_split=5, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)

# Calculate accuracy
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
Limited max_depth to 5 to reduce tree complexity.
Increased min_samples_split to 5 to prevent splitting on very small samples.
Increased number of trees to 100 for stable predictions.
Results Interpretation

Before tuning: Training accuracy was 100%, validation accuracy was 75%. The model memorized training data but did not generalize well.

After tuning: Training accuracy dropped to 93.33%, validation accuracy improved to 90%. The model generalizes better with less overfitting.

Limiting tree depth and requiring more samples to split nodes helps random forests avoid overfitting and improves validation accuracy.
Bonus Experiment
Try adding bootstrap sampling and changing the max_features parameter to see if validation accuracy improves further.
💡 Hint
Set bootstrap=True and experiment with max_features='sqrt' or 'log2' to reduce correlation between trees.