KDE overlay concept in Matplotlib - Time & Space Complexity
We want to understand how the time to create a KDE overlay plot changes as we add more data points.
How does the plotting time grow when we increase the data size?
Analyze the time complexity of the following matplotlib code snippet.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
n = 1000 # Added definition for n
data1 = np.random.normal(0, 1, n)
data2 = np.random.normal(1, 1, n)
kde1 = gaussian_kde(data1)
kde2 = gaussian_kde(data2)
x = np.linspace(-5, 5, 1000)
plt.plot(x, kde1(x))
plt.plot(x, kde2(x))
plt.show()
This code creates two KDE curves from two datasets and overlays them on one plot.
Look for loops or repeated calculations in the code.
- Primary operation: Calculating KDE values for each point in
xfor both datasets. - How many times: For each of the 1000 points in
x, KDE evaluates a sum over allndata points, done twice (once per dataset).
The time to compute KDE grows as we increase n, the number of data points.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20,000 (2 datasets x 1000 points x 10 data points) |
| 100 | About 200,000 |
| 1000 | About 2,000,000 |
Pattern observation: The operations grow roughly in proportion to n, multiplied by the fixed 1000 points and 2 datasets.
Time Complexity: O(n)
This means the time to compute the KDE overlay grows linearly as the number of data points increases.
[X] Wrong: "The time to plot KDE does not depend on the number of data points because the x-axis points are fixed."
[OK] Correct: Even though the x-axis points are fixed, KDE calculation sums over all data points for each x value, so more data means more work.
Understanding how plotting time grows with data size helps you explain performance in data visualization tasks clearly and confidently.
What if we increased the number of x points from 1000 to 10,000? How would the time complexity change?