Cross-tabulation helps us see how two or more categories relate by counting their combinations. Advanced usage lets us explore data deeper with percentages, multiple variables, and custom calculations.
Cross-tabulation advanced usage in Pandas
import pandas as pd # Basic cross-tab syntax with advanced options pd.crosstab(index=data['Category1'], columns=data['Category2'], values=data['NumericColumn'], # optional for aggregation aggfunc='sum', # aggregation function margins=True, # add totals row and column normalize='index', # normalize counts by row dropna=False) # include missing values
index and columns define the categories to cross-tabulate.
values and aggfunc let you aggregate numeric data instead of just counting.
import pandas as pd data = pd.DataFrame({ 'Gender': ['Male', 'Female', 'Female', 'Male'], 'Preference': ['A', 'B', 'A', 'B'] }) # Basic count cross-tab ct = pd.crosstab(index=data['Gender'], columns=data['Preference']) print(ct)
import pandas as pd data = pd.DataFrame({ 'Gender': ['Male', 'Female', 'Female', 'Male'], 'Preference': ['A', 'B', 'A', 'B'], 'Score': [10, 20, 30, 40] }) # Sum scores by gender and preference ct = pd.crosstab(index=data['Gender'], columns=data['Preference'], values=data['Score'], aggfunc='sum') print(ct)
import pandas as pd data = pd.DataFrame({ 'Gender': ['Male', 'Female', 'Female', 'Male'], 'Preference': ['A', 'B', 'A', 'B'] }) # Add totals and normalize by row ct = pd.crosstab(index=data['Gender'], columns=data['Preference'], margins=True, normalize='index') print(ct)
import pandas as pd data = pd.DataFrame({ 'Gender': ['Male', 'Female', 'Female', 'Male'], 'Preference': ['A', 'B', 'A', 'B'], 'AgeGroup': ['Young', 'Young', 'Old', 'Old'] }) # Cross-tab with multiple index levels ct = pd.crosstab(index=[data['Gender'], data['AgeGroup']], columns=data['Preference']) print(ct)
This program creates a survey dataset with age groups, gender, satisfaction levels, and scores. It then shows three advanced cross-tabulations:
- Count of satisfaction by age group and gender with totals.
- Average score by age group and gender for each satisfaction level with totals.
- Row-wise percentage distribution of satisfaction by age group with totals.
import pandas as pd # Create sample data survey_data = pd.DataFrame({ 'AgeGroup': ['Young', 'Young', 'Middle', 'Middle', 'Old', 'Old', 'Young', 'Old'], 'Gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'], 'Satisfaction': ['High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low'], 'Score': [8, 5, 7, 9, 4, 6, 9, 3] }) print("Original Data:\n", survey_data) # Cross-tab: count of Satisfaction by AgeGroup and Gender ct_count = pd.crosstab(index=[survey_data['AgeGroup'], survey_data['Gender']], columns=survey_data['Satisfaction'], margins=True, dropna=False) print("\nCount Cross-tab with Totals:\n", ct_count) # Cross-tab: average Score by AgeGroup and Gender for each Satisfaction level ct_avg_score = pd.crosstab(index=[survey_data['AgeGroup'], survey_data['Gender']], columns=survey_data['Satisfaction'], values=survey_data['Score'], aggfunc='mean', margins=True, dropna=False) print("\nAverage Score Cross-tab with Totals:\n", ct_avg_score) # Cross-tab: normalized counts by AgeGroup (row-wise percentage) ct_normalized = pd.crosstab(index=survey_data['AgeGroup'], columns=survey_data['Satisfaction'], normalize='index', margins=True) print("\nNormalized Cross-tab by AgeGroup (row %):\n", ct_normalized)
Time complexity depends on data size; typically O(n) to scan data once.
Space complexity is O(k*m) where k and m are unique categories in index and columns.
Common mistake: forgetting to use values and aggfunc when aggregating numeric data.
Use margins=True to add totals; use normalize to get proportions instead of counts.
Advanced cross-tabulation lets you count, aggregate, and normalize data across multiple categories.
You can add totals and handle multiple index levels for deeper insights.
Use values and aggfunc to summarize numeric data, not just counts.