Ml-pythonHow-ToBeginner · 4 min read

How to Run A/B Test for Models: Simple Steps and Example

To run an A/B test for models, split your users or data randomly into two groups: one uses model A and the other uses model B. Collect performance metrics like accuracy or conversion rates from both groups, then compare these metrics statistically to decide which model performs better.

📐

Syntax

An A/B test for models involves these key steps:

Split data/users: Randomly assign samples to group A or B.
Deploy models: Use model A for group A and model B for group B.
Collect metrics: Measure performance like accuracy, click-through rate, or revenue.
Compare results: Use statistical tests (e.g., t-test) to check if differences are significant.

python

def ab_test(data, model_a, model_b, metric_func):
    # Split data randomly
    group_a = data.sample(frac=0.5, random_state=1)
    group_b = data.drop(group_a.index)

    # Get predictions
    preds_a = model_a.predict(group_a.drop('label', axis=1))
    preds_b = model_b.predict(group_b.drop('label', axis=1))

    # Calculate metric
    metric_a = metric_func(group_a['label'], preds_a)
    metric_b = metric_func(group_b['label'], preds_b)

    return metric_a, metric_b

💻

Example

This example shows how to run an A/B test comparing two simple classifiers on the Iris dataset using accuracy as the metric.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Load data
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['label'] = iris.target

# Split data randomly into two groups
group_a = data.sample(frac=0.5, random_state=42)
group_b = data.drop(group_a.index)

# Train model A on group A
X_a = group_a.drop('label', axis=1)
y_a = group_a['label']
model_a = DecisionTreeClassifier(random_state=42).fit(X_a, y_a)

# Train model B on group B
X_b = group_b.drop('label', axis=1)
y_b = group_b['label']
model_b = LogisticRegression(max_iter=200).fit(X_b, y_b)

# Predict on opposite groups to simulate A/B test
preds_a = model_a.predict(X_b)
preds_b = model_b.predict(X_a)

# Calculate accuracy
acc_a = accuracy_score(y_b, preds_a)
acc_b = accuracy_score(y_a, preds_b)

print(f"Model A accuracy on group B: {acc_a:.2f}")
print(f"Model B accuracy on group A: {acc_b:.2f}")

Output

Model A accuracy on group B: 0.89 Model B accuracy on group A: 0.91

⚠️

Common Pitfalls

Non-random splitting: Not randomly assigning users or data can bias results.
Small sample size: Too few samples can make results unreliable.
Ignoring statistical significance: Differences in metrics may be due to chance without proper tests.
Data leakage: Training and testing on overlapping data leads to over-optimistic results.

python

import numpy as np

# Wrong: splitting by order (not random)
data_sorted = data.sort_values('label')
group_a_wrong = data_sorted.iloc[:len(data)//2]
group_b_wrong = data_sorted.iloc[len(data)//2:]

# Right: random split
from sklearn.model_selection import train_test_split
group_a_right, group_b_right = train_test_split(data, test_size=0.5, random_state=42)

📊

Quick Reference

Tips for running A/B tests on models:

Always randomize group assignment.
Use clear, relevant metrics for your goal.
Collect enough data for reliable results.
Apply statistical tests to confirm significance.
Monitor user experience to avoid negative impact.

✅

Key Takeaways

Randomly split users or data into groups to fairly compare models.

Measure clear metrics and use statistical tests to confirm differences.

Avoid data leakage and ensure enough sample size for reliable results.

Deploy models simultaneously to prevent time-based bias.

Monitor both performance and user impact during the test.