0
0
Pandasdata~7 mins

Cross-tabulation advanced usage in Pandas

Choose your learning style9 modes available
Introduction

Cross-tabulation helps us see how two or more categories relate by counting their combinations. Advanced usage lets us explore data deeper with percentages, multiple variables, and custom calculations.

You want to compare sales counts by product category and region together.
You need to see the percentage distribution of survey answers by age group.
You want to add totals or margins to your category counts for better summary.
You want to analyze relationships between three or more categorical variables.
You want to apply custom aggregation like mean or sum instead of just counts.
Syntax
Pandas
import pandas as pd

# Basic cross-tab syntax with advanced options
pd.crosstab(index=data['Category1'],
            columns=data['Category2'],
            values=data['NumericColumn'],  # optional for aggregation
            aggfunc='sum',                  # aggregation function
            margins=True,                   # add totals row and column
            normalize='index',              # normalize counts by row
            dropna=False)                   # include missing values

index and columns define the categories to cross-tabulate.

values and aggfunc let you aggregate numeric data instead of just counting.

Examples
Simple count of how many males and females prefer A or B.
Pandas
import pandas as pd

data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Preference': ['A', 'B', 'A', 'B']
})

# Basic count cross-tab
ct = pd.crosstab(index=data['Gender'], columns=data['Preference'])
print(ct)
Sum of scores for each gender and preference combination.
Pandas
import pandas as pd

data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Preference': ['A', 'B', 'A', 'B'],
    'Score': [10, 20, 30, 40]
})

# Sum scores by gender and preference
ct = pd.crosstab(index=data['Gender'], columns=data['Preference'], values=data['Score'], aggfunc='sum')
print(ct)
Shows proportions of preferences per gender with totals included.
Pandas
import pandas as pd

data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Preference': ['A', 'B', 'A', 'B']
})

# Add totals and normalize by row
ct = pd.crosstab(index=data['Gender'], columns=data['Preference'], margins=True, normalize='index')
print(ct)
Cross-tab with two index categories to see combined effects.
Pandas
import pandas as pd

data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Preference': ['A', 'B', 'A', 'B'],
    'AgeGroup': ['Young', 'Young', 'Old', 'Old']
})

# Cross-tab with multiple index levels
ct = pd.crosstab(index=[data['Gender'], data['AgeGroup']], columns=data['Preference'])
print(ct)
Sample Program

This program creates a survey dataset with age groups, gender, satisfaction levels, and scores. It then shows three advanced cross-tabulations:

  • Count of satisfaction by age group and gender with totals.
  • Average score by age group and gender for each satisfaction level with totals.
  • Row-wise percentage distribution of satisfaction by age group with totals.
Pandas
import pandas as pd

# Create sample data
survey_data = pd.DataFrame({
    'AgeGroup': ['Young', 'Young', 'Middle', 'Middle', 'Old', 'Old', 'Young', 'Old'],
    'Gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'Satisfaction': ['High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'],
    'Score': [8, 5, 7, 9, 4, 6, 9, 3]
})

print("Original Data:\n", survey_data)

# Cross-tab: count of Satisfaction by AgeGroup and Gender
ct_count = pd.crosstab(index=[survey_data['AgeGroup'], survey_data['Gender']],
                       columns=survey_data['Satisfaction'],
                       margins=True,
                       dropna=False)
print("\nCount Cross-tab with Totals:\n", ct_count)

# Cross-tab: average Score by AgeGroup and Gender for each Satisfaction level
ct_avg_score = pd.crosstab(index=[survey_data['AgeGroup'], survey_data['Gender']],
                           columns=survey_data['Satisfaction'],
                           values=survey_data['Score'],
                           aggfunc='mean',
                           margins=True,
                           dropna=False)
print("\nAverage Score Cross-tab with Totals:\n", ct_avg_score)

# Cross-tab: normalized counts by AgeGroup (row-wise percentage)
ct_normalized = pd.crosstab(index=survey_data['AgeGroup'],
                            columns=survey_data['Satisfaction'],
                            normalize='index',
                            margins=True)
print("\nNormalized Cross-tab by AgeGroup (row %):\n", ct_normalized)
OutputSuccess
Important Notes

Time complexity depends on data size; typically O(n) to scan data once.

Space complexity is O(k*m) where k and m are unique categories in index and columns.

Common mistake: forgetting to use values and aggfunc when aggregating numeric data.

Use margins=True to add totals; use normalize to get proportions instead of counts.

Summary

Advanced cross-tabulation lets you count, aggregate, and normalize data across multiple categories.

You can add totals and handle multiple index levels for deeper insights.

Use values and aggfunc to summarize numeric data, not just counts.