Series vs DataFrame relationship in Pandas - Performance Comparison
We want to understand how the time it takes to work with Series and DataFrames changes as their size grows.
Specifically, how operations involving Series and DataFrames relate in terms of time cost.
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Define n before using it
# Create a DataFrame with n rows and 3 columns
df = pd.DataFrame({
'A': range(n),
'B': range(n, 2*n),
'C': range(2*n, 3*n)
})
# Extract a single column as a Series
series = df['A']
# Sum values in the Series
result = series.sum()
This code creates a DataFrame, extracts one column as a Series, and sums its values.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Summing all values in the Series.
- How many times: The sum operation visits each element once, so n times.
As the number of rows n grows, the sum operation takes longer because it adds more numbers.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 additions |
| 100 | 100 additions |
| 1000 | 1000 additions |
Pattern observation: The number of operations grows directly with n, so doubling n doubles the work.
Time Complexity: O(n)
This means the time to sum the Series grows linearly with the number of rows.
[X] Wrong: "The sum operation takes constant time regardless of the size."
[OK] Correct: The sum must visit each of the n elements in the Series, performing an addition for each, so it scales linearly with n.
Understanding how Series and DataFrame operations scale helps you write efficient data code and explain your choices clearly in interviews.
What if we summed all columns in the DataFrame instead of just one Series? How would the time complexity change?