ML Pythonml~20 mins

CatBoost in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - CatBoost

Problem:Classify whether a person earns more than 50K per year using the Adult Census Income dataset.

Current Metrics:Training accuracy: 95%, Validation accuracy: 78%

Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.

Use CatBoost classifier only.

Do not change the dataset or features.

Adjust hyperparameters to reduce overfitting.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
import pandas as pd

# Load dataset
adult = fetch_openml(name='adult', version=2, as_frame=True)
df = adult.frame

# Prepare features and target
X = df.drop(columns=['class'])
y = (df['class'] == '>50K').astype(int)

# Identify categorical features by name
cat_features = X.select_dtypes(include=['category', 'object']).columns.tolist()

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create Pool objects for CatBoost
train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool = Pool(X_val, y_val, cat_features=cat_features)

# Initialize CatBoost with adjusted hyperparameters
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=10,
    early_stopping_rounds=50,
    verbose=0,
    random_seed=42
)

# Train model
model.fit(train_pool, eval_set=val_pool, use_best_model=True)

# Predict and evaluate
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)

train_acc = accuracy_score(y_train, train_pred) * 100
val_acc = accuracy_score(y_val, val_pred) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Reduced learning rate from default to 0.05 to slow learning and improve generalization.

Added L2 regularization with l2_leaf_reg=10 to reduce overfitting.

Set early_stopping_rounds=50 to stop training when validation does not improve.

Limited tree depth to 6 to reduce model complexity.

Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 78% (high overfitting)

After: Training accuracy: 90.5%, Validation accuracy: 86.3% (reduced overfitting, better validation)

Adding regularization, lowering learning rate, and using early stopping helps reduce overfitting in CatBoost models, improving validation accuracy while keeping training accuracy reasonable.

Bonus Experiment

Try using CatBoost's built-in feature importance to identify the top 5 most important features and retrain the model using only those features.

💡 Hint

Use model.get_feature_importance() after training to find important features, then select those columns from the dataset for a new training run.

Practice

(1/5)

1. What is the main advantage of using CatBoost in machine learning?

easy

A. It handles categorical features automatically without extensive preprocessing

B. It requires manual encoding of all categorical variables

C. It only works with numerical data

D. It is slower than most other boosting algorithms

CatBoost in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand CatBoost's feature handling

Step 2: Compare with other algorithms

Final Answer:

Quick Check:

Solution

Step 1: Recall Python import syntax for CatBoost

Step 2: Check other options for syntax errors

Final Answer:

Quick Check:

Solution

Step 1: Understand training data and labels

Step 2: Predict on new sample [2, 'red']

Final Answer:

Quick Check:

Solution

Step 1: Check data and model parameters

Step 2: Understand CatBoost requirements

Final Answer:

Quick Check:

Solution

Step 1: Understand CatBoost's handling of categorical features

Step 2: Evaluate other options

Final Answer:

Quick Check: