Set operations on structured data in NumPy - Time & Space Complexity
We want to know how the time needed to do set operations on structured data changes as the data grows.
How does the work increase when we have more rows in our structured arrays?
Analyze the time complexity of the following code snippet.
import numpy as np
# Define two structured arrays
arr1 = np.array([(1, 'a'), (2, 'b'), (3, 'c')], dtype=[('id', int), ('val', 'U1')])
arr2 = np.array([(2, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int), ('val', 'U1')])
# Find intersection of arr1 and arr2
common = np.intersect1d(arr1, arr2)
print(common)
This code finds common rows between two structured arrays using numpy's intersect1d function.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing each element of one array to elements of the other to find matches.
- How many times: Each element in the first array is checked against elements in the second array.
As the number of rows grows, the comparisons needed increase because each row in one array is checked against rows in the other.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 comparisons |
| 100 | About 10,000 comparisons |
| 1000 | About 1,000,000 comparisons |
Pattern observation: The work grows quickly, roughly by the square of the input size.
Time Complexity: O(n²)
This means if you double the number of rows, the work to find common rows roughly quadruples.
[X] Wrong: "Set operations on structured arrays run in linear time because they are like simple lists."
[OK] Correct: Structured arrays require comparing multiple fields per row, and numpy checks many pairs, so the time grows faster than just the number of rows.
Understanding how set operations scale helps you explain performance when working with complex data, a useful skill in many data science tasks.
"What if we sorted both structured arrays before finding their intersection? How would the time complexity change?"