Structured arrays vs DataFrames in NumPy - Performance Comparison
We want to see how fast operations run when using structured arrays compared to DataFrames.
How does the time needed grow as the data size gets bigger?
Analyze the time complexity of the following code snippet.
import numpy as np
import pandas as pd
# Create structured array
data_np = np.zeros(1000000, dtype=[('id', 'i4'), ('value', 'f4')])
# Create DataFrame
data_pd = pd.DataFrame({'id': np.arange(1000000), 'value': np.zeros(1000000)})
# Access 'value' column
vals_np = data_np['value']
vals_pd = data_pd['value']
# Sum values
sum_np = np.sum(vals_np)
sum_pd = data_pd['value'].sum()
This code creates a large structured array and a DataFrame, then accesses and sums a column.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Summing all elements in the 'value' column.
- How many times: Once over all elements (1,000,000 times).
As the number of rows grows, the time to sum grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 sums |
| 100 | 100 sums |
| 1000 | 1000 sums |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to sum grows linearly with the number of rows.
[X] Wrong: "DataFrames are always slower than structured arrays because they are more complex."
[OK] Correct: Both use efficient underlying code for operations like sum, so their time complexity is similar; differences are often small and depend on implementation details, not complexity.
Understanding how data structures affect operation speed helps you choose the right tool and explain your choices clearly in interviews.
"What if we replaced the sum operation with a group-by aggregation? How would the time complexity change?"