How to compare experiments

Ml-pythonHow-ToBeginner · 4 min read

How to Compare Machine Learning Experiments Effectively

To compare experiments, use evaluation metrics like accuracy or loss on the same test data, visualize results with plots, and apply statistical tests to check significance. This helps identify which model or setting performs best reliably.

📐

Syntax

Comparing experiments involves these key steps:

Evaluate each model on the same test set using metrics like accuracy, precision, or loss.
Visualize results using plots such as learning curves or bar charts.
Statistical tests like paired t-test or Wilcoxon test can confirm if differences are significant.

python

def evaluate_model(model, X_test, y_test, metric):
    y_pred = model.predict(X_test)
    return metric(y_test, y_pred)

# Example metrics: accuracy_score, mean_squared_error

from sklearn.metrics import accuracy_score

# Compare two models
score1 = evaluate_model(model1, X_test, y_test, accuracy_score)
score2 = evaluate_model(model2, X_test, y_test, accuracy_score)

print(f'Model 1 accuracy: {score1}')
print(f'Model 2 accuracy: {score2}')

Output

Model 1 accuracy: 0.85 Model 2 accuracy: 0.88

💻

Example

This example compares two simple classifiers on the same test data using accuracy and a paired t-test to check if the difference is significant.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import ttest_rel

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train models
model1 = LogisticRegression(max_iter=200).fit(X_train, y_train)
model2 = DecisionTreeClassifier().fit(X_train, y_train)

# Predict
preds1 = model1.predict(X_test)
preds2 = model2.predict(X_test)

# Calculate accuracy
acc1 = accuracy_score(y_test, preds1)
acc2 = accuracy_score(y_test, preds2)

# Print accuracies
print(f'Logistic Regression accuracy: {acc1:.2f}')
print(f'Decision Tree accuracy: {acc2:.2f}')

# Statistical test on predictions
stat, p_value = ttest_rel(preds1 == y_test, preds2 == y_test)
print(f'Paired t-test p-value: {p_value:.4f}')

Output

Logistic Regression accuracy: 0.98 Decision Tree accuracy: 0.98 Paired t-test p-value: 0.3173

⚠️

Common Pitfalls

Common mistakes when comparing experiments include:

Using different test sets for each model, which makes comparison unfair.
Relying on a single metric without considering others like precision or recall.
Ignoring statistical significance, leading to false conclusions about model superiority.
Not controlling randomness by setting seeds, causing inconsistent results.

python

from sklearn.metrics import accuracy_score

# Wrong: Different test sets
# X_test1, y_test1 and X_test2, y_test2 differ
acc1 = accuracy_score(y_test1, model1.predict(X_test1))
acc2 = accuracy_score(y_test2, model2.predict(X_test2))

# Right: Use the same test set
acc1 = accuracy_score(y_test, model1.predict(X_test))
acc2 = accuracy_score(y_test, model2.predict(X_test))

📊

Quick Reference

Always evaluate models on the same test data.
Use multiple metrics to get a full picture.
Visualize results with plots for easier comparison.
Apply statistical tests to confirm differences.
Set random seeds for reproducibility.

✅

Key Takeaways

Compare models using the same test data and consistent metrics.

Use statistical tests to check if performance differences are meaningful.

Visualize experiment results to better understand model behavior.

Avoid comparing on different datasets or ignoring randomness.

Multiple metrics give a clearer evaluation than a single number.