Box plots in Pandas - Time & Space Complexity
We want to understand how the time to create a box plot changes as the data size grows.
How does the work needed scale when we have more data points?
Analyze the time complexity of the following code snippet.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'values': np.random.randn(1000) # 1000 random numbers
})
boxplot = data.boxplot(column='values')
This code creates a box plot for a column of 1000 numbers in a pandas DataFrame.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all data points to find minimum, first quartile, median, third quartile, and maximum.
- How many times: Each data point is visited once or a few times during these calculations.
As the number of data points increases, the time to compute the statistics grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to scan data |
| 100 | About 100 operations to scan data |
| 1000 | About 1000 operations to scan data |
Pattern observation: The work grows linearly as the data size grows.
Time Complexity: O(n)
This means the time to create a box plot grows roughly in direct proportion to the number of data points.
[X] Wrong: "Creating a box plot takes the same time no matter how many data points there are."
[OK] Correct: The box plot needs to look at each data point to find key statistics, so more data means more work.
Understanding how data size affects plotting helps you explain performance in real projects and shows you think about efficiency.
"What if we grouped the data by a category and made box plots for each group? How would the time complexity change?"