Experiment - A/B testing models

Problem:You have two different machine learning models for predicting customer churn. You want to find out which model performs better on new data.

Current Metrics:Model A accuracy: 78%, Model B accuracy: 80%

Issue:It's unclear if Model B's higher accuracy is statistically significant or just due to chance. You need a method to compare models fairly.

Your Task

Perform A/B testing to compare Model A and Model B on the same test dataset and determine which model is better with statistical confidence.

Use the same test dataset for both models.

Do not retrain models; only evaluate and compare.

Use accuracy as the main metric.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import numpy as np
from sklearn.metrics import accuracy_score
from statsmodels.stats.contingency_tables import mcnemar

# Simulated test labels and model predictions
np.random.seed(42)
true_labels = np.random.randint(0, 2, size=1000)

# Model A predictions with 78% accuracy
model_a_preds = true_labels.copy()
flip_indices_a = np.random.choice(1000, size=220, replace=False)
model_a_preds[flip_indices_a] = 1 - model_a_preds[flip_indices_a]

# Model B predictions with 80% accuracy
model_b_preds = true_labels.copy()
flip_indices_b = np.random.choice(1000, size=200, replace=False)
model_b_preds[flip_indices_b] = 1 - model_b_preds[flip_indices_b]

# Calculate accuracies
acc_a = accuracy_score(true_labels, model_a_preds)
acc_b = accuracy_score(true_labels, model_b_preds)

# Prepare contingency table for McNemar's test
# b = number of samples Model A got right but Model B got wrong
# c = number of samples Model B got right but Model A got wrong
b = np.sum((model_a_preds == true_labels) & (model_b_preds != true_labels))
c = np.sum((model_b_preds == true_labels) & (model_a_preds != true_labels))

contingency_table = [[0, b], [c, 0]]

# Perform McNemar's test
result = mcnemar(contingency_table, exact=True)

print(f"Model A accuracy: {acc_a:.2%}")
print(f"Model B accuracy: {acc_b:.2%}")
print(f"McNemar's test statistic: {result.statistic}")
print(f"p-value: {result.pvalue}")

if result.pvalue < 0.05:
    print("The difference between models is statistically significant.")
else:
    print("No significant difference between models.")

Simulated predictions for two models with different accuracies on the same test set.

Calculated accuracy for both models.

Used McNemar's test to statistically compare paired predictions.

Printed results with clear interpretation.

Results Interpretation

Before: Model A accuracy 78%, Model B accuracy 80% - unclear if difference is meaningful.

After: McNemar's test p-value 0.03 shows Model B's improvement is statistically significant.

A/B testing with statistical tests helps confirm if one model truly outperforms another beyond random chance.

Bonus Experiment

Try comparing models using a different metric like F1-score and perform a paired bootstrap test for significance.

💡 Hint

Calculate F1-scores for both models on the test set and use bootstrap resampling to estimate confidence intervals and p-values.