Why Pandas for data analysis - Performance Analysis
We want to understand how the time it takes to analyze data with pandas changes as the data grows.
How does pandas handle bigger data and what costs come with it?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Example size
data = pd.DataFrame({
'A': range(n),
'B': range(n, 0, -1)
})
result = data['A'] + data['B']
This code creates a table with two columns and adds them together element-wise.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Adding each pair of numbers from columns 'A' and 'B'.
- How many times: Once for each row in the data, so n times.
As the number of rows grows, the number of additions grows the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 additions |
| 100 | 100 additions |
| 1000 | 1000 additions |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to add columns grows in a straight line as the data gets bigger.
[X] Wrong: "Adding two columns is instant no matter the size."
[OK] Correct: Each row must be processed, so bigger data takes more time.
Understanding how pandas handles data size helps you explain your choices clearly and shows you know what happens behind the scenes.
"What if we added a new column by combining three existing columns instead of two? How would the time complexity change?"