Why built-in plotting matters in Pandas - Performance Analysis
We want to see how fast pandas built-in plotting works as data grows.
How does the time to create a plot change when we add more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
import numpy as np
# Create a DataFrame with n rows
n = 1000
data = pd.DataFrame({
'x': np.arange(n),
'y': np.random.randn(n)
})
# Plot y vs x
plot = data.plot(x='x', y='y')
This code creates a DataFrame with n rows and plots the y values against x.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The plotting function processes each data point to draw it on the graph.
- How many times: Once for each row in the DataFrame (n times).
As the number of rows increases, the plotting work grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 drawing steps |
| 100 | About 100 drawing steps |
| 1000 | About 1000 drawing steps |
Pattern observation: Doubling the data roughly doubles the work needed to plot.
Time Complexity: O(n)
This means the time to plot grows linearly with the number of data points.
[X] Wrong: "Plotting time stays the same no matter how much data there is."
[OK] Correct: Each data point needs to be drawn, so more points mean more work and more time.
Understanding how plotting time grows helps you explain performance in data visualization tasks clearly and confidently.
What if we changed the plot to show only a summary (like averages) instead of every point? How would the time complexity change?