How to Do Cohort Analysis in Python: Simple Guide with Example
To do
cohort analysis in Python, use the pandas library to group users by their first activity date (cohort) and analyze their behavior over time. Create a cohort column, calculate time periods since the cohort start, then aggregate metrics like retention or revenue by cohort and period.Syntax
Cohort analysis in Python typically involves these steps:
- Create a
cohortcolumn to mark the first activity date for each user. - Calculate the time difference (e.g., months) between each event and the cohort start.
- Group data by
cohortand time period to aggregate metrics.
This uses pandas functions like groupby(), transform(), and date operations.
python
import pandas as pd # Example syntax outline # 1. Assign cohort based on first activity # 2. Calculate period since cohort # 3. Group by cohort and period # df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M') # df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n) # cohort_data = df.groupby(['cohort', 'period']).agg({'user_id': 'nunique'}).reset_index()
Example
This example shows how to perform a simple cohort analysis to find user retention by month using pandas.
python
import pandas as pd # Sample data: user_id and their activity dates data = { 'user_id': [1, 1, 2, 2, 3, 3, 3], 'date': [ '2023-01-01', '2023-02-01', '2023-01-15', '2023-03-01', '2023-02-10', '2023-03-10', '2023-04-10' ] } # Create DataFrame df = pd.DataFrame(data) df['date'] = pd.to_datetime(df['date']) # Step 1: Assign cohort as the first activity month per user df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M') # Step 2: Calculate months since cohort df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n) # Step 3: Count unique users per cohort and period cohort_counts = df.groupby(['cohort', 'period'])['user_id'].nunique().reset_index() # Step 4: Pivot for easier reading cohort_pivot = cohort_counts.pivot(index='cohort', columns='period', values='user_id') print(cohort_pivot.fillna(0).astype(int))
Output
period 0 1 2
cohort
2023-01 2 1 0
2023-02 1 2 1
Common Pitfalls
Common mistakes when doing cohort analysis in Python include:
- Not converting dates to
datetimetype, causing errors in date calculations. - Using raw dates instead of periods (like months) for cohorts, which makes grouping harder.
- Counting total events instead of unique users, which can inflate metrics.
- Forgetting to fill missing values after pivoting, leading to confusing
NaNresults.
python
import pandas as pd data = {'user_id': [1, 1, 2], 'date': ['2023-01-01', '2023-02-01', '2023-01-15']} df = pd.DataFrame(data) # Wrong: Not converting date to datetime # df['cohort'] = df.groupby('user_id')['date'].transform('min') # stays string, error later # Right: Convert date first # df['date'] = pd.to_datetime(df['date']) # df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M')
Quick Reference
Cohort Analysis Steps Cheat Sheet:
df['date'] = pd.to_datetime(df['date'])โ Convert dates to datetime.df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M')โ Assign cohort month.df['period'] = (df['date'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n)โ Calculate months since cohort.df.groupby(['cohort', 'period'])['user_id'].nunique()โ Count unique users per cohort and period.pivot()โ Reshape data for readability.
Key Takeaways
Use pandas to assign cohorts by users' first activity date converted to periods.
Calculate time periods since cohort to track user behavior over time.
Aggregate unique users per cohort and period for accurate retention metrics.
Always convert date columns to datetime before calculations.
Pivot the grouped data for clear cohort retention tables.