Cross-tabulation with crosstab() in Data Analysis Python - Time & Space Complexity
We want to understand how the time to create a cross-tabulation table grows as the data size increases.
Specifically, how does the crosstab() function handle larger datasets?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'Category': ['A', 'B', 'A', 'C', 'B', 'A'],
'Type': ['X', 'Y', 'X', 'Y', 'X', 'Y']
})
result = pd.crosstab(data['Category'], data['Type'])
This code creates a table counting how many times each Category and Type pair appears.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each row in the data once to count pairs.
- How many times: Exactly once per row, so n times for n rows.
As the number of rows grows, the function counts pairs by checking each row once.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: The work grows directly with the number of rows, so doubling rows doubles work.
Time Complexity: O(n)
This means the time to create the crosstab grows in a straight line with the number of rows.
[X] Wrong: "crosstab() checks every possible pair of categories and types, so it takes n squared time."
[OK] Correct: Actually, crosstab() just scans each row once and updates counts, it does not compare all pairs against each other.
Knowing how crosstab() scales helps you explain data aggregation efficiency in interviews.
It shows you understand how counting operations relate to data size, a key skill in data science.
What if we added a third column to group by? How would the time complexity change?