How to do cohort analysis python

Data-analysis-pythonHow-ToBeginner · 4 min read

How to Do Cohort Analysis in Python: Simple Guide with Example

To do cohort analysis in Python, use the pandas library to group users by their first activity date (cohort) and analyze their behavior over time. Create a cohort column, calculate time periods since the cohort start, then aggregate metrics like retention or revenue by cohort and period.

📐

Syntax

Cohort analysis in Python typically involves these steps:

Create a cohort column to mark the first activity date for each user.
Calculate the time difference (e.g., months) between each event and the cohort start.
Group data by cohort and time period to aggregate metrics.

This uses pandas functions like groupby(), transform(), and date operations.

python

import pandas as pd

# Example syntax outline
# 1. Assign cohort based on first activity
# 2. Calculate period since cohort
# 3. Group by cohort and period

# df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M')
# df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n)
# cohort_data = df.groupby(['cohort', 'period']).agg({'user_id': 'nunique'}).reset_index()

💻

Example

This example shows how to perform a simple cohort analysis to find user retention by month using pandas.

python

import pandas as pd

# Sample data: user_id and their activity dates
data = {
    'user_id': [1, 1, 2, 2, 3, 3, 3],
    'date': [
        '2023-01-01', '2023-02-01',
        '2023-01-15', '2023-03-01',
        '2023-02-10', '2023-03-10', '2023-04-10'
    ]
}

# Create DataFrame
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

# Step 1: Assign cohort as the first activity month per user
df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M')

# Step 2: Calculate months since cohort
df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n)

# Step 3: Count unique users per cohort and period
cohort_counts = df.groupby(['cohort', 'period'])['user_id'].nunique().reset_index()

# Step 4: Pivot for easier reading
cohort_pivot = cohort_counts.pivot(index='cohort', columns='period', values='user_id')

print(cohort_pivot.fillna(0).astype(int))

Output

period 0 1 2 cohort 2023-01 2 1 0 2023-02 1 2 1

⚠️

Common Pitfalls

Common mistakes when doing cohort analysis in Python include:

Not converting dates to datetime type, causing errors in date calculations.
Using raw dates instead of periods (like months) for cohorts, which makes grouping harder.
Counting total events instead of unique users, which can inflate metrics.
Forgetting to fill missing values after pivoting, leading to confusing NaN results.

python

import pandas as pd

data = {'user_id': [1, 1, 2], 'date': ['2023-01-01', '2023-02-01', '2023-01-15']}
df = pd.DataFrame(data)

# Wrong: Not converting date to datetime
# df['cohort'] = df.groupby('user_id')['date'].transform('min')  # stays string, error later

# Right: Convert date first
# df['date'] = pd.to_datetime(df['date'])
# df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M')

📊

Quick Reference

Cohort Analysis Steps Cheat Sheet:

df['date'] = pd.to_datetime(df['date']) — Convert dates to datetime.
df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M') — Assign cohort month.
df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n) — Calculate months since cohort.
df.groupby(['cohort', 'period'])['user_id'].nunique() — Count unique users per cohort and period.
pivot() — Reshape data for readability.

✅

Key Takeaways

Use pandas to assign cohorts by users' first activity date converted to periods.

Calculate time periods since cohort to track user behavior over time.

Aggregate unique users per cohort and period for accurate retention metrics.

Always convert date columns to datetime before calculations.

Pivot the grouped data for clear cohort retention tables.