Pandas and NumPy connection - Time & Space Complexity
We want to see how fast pandas works when it uses NumPy arrays inside. This helps us know how the time to run grows when data gets bigger.
How does the time to do operations change as the data size grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
import numpy as np
arr = np.arange(1000)
df = pd.DataFrame({'numbers': arr})
df['squared'] = df['numbers'] ** 2
This code creates a pandas DataFrame from a NumPy array and adds a new column by squaring the numbers.
- Primary operation: Squaring each number in the 'numbers' column.
- How many times: Once for each element in the array (n times).
As the number of rows grows, the time to square each number grows too, because each number needs to be processed.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 squaring operations |
| 100 | 100 squaring operations |
| 1000 | 1000 squaring operations |
Pattern observation: The operations grow directly with the number of items; doubling the data doubles the work.
Time Complexity: O(n)
This means the time to run grows in a straight line with the number of rows in the DataFrame.
[X] Wrong: "Using pandas with NumPy arrays makes operations instant, no matter the size."
[OK] Correct: Even though pandas uses fast NumPy arrays, it still needs to do work for each item, so time grows with data size.
Understanding how pandas and NumPy work together helps you explain data processing speed clearly. This skill shows you know what happens behind the scenes when working with data.
"What if we used a vectorized NumPy function directly on the array instead of pandas? How would the time complexity change?"