Memory usage analysis in Data Analysis Python - Time & Space Complexity
When we analyze time complexity, we want to understand how the runtime grows as the data size grows.
We ask: How does the work increase as the input size grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
def calculate_mean(df):
means = {}
for col in df.columns:
means[col] = df[col].mean()
return means
# df is a DataFrame with n rows and m columns
This code calculates the average value for each column in a data table.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Looping over each column and computing the mean of all rows in that column.
- How many times: The loop runs once for each column (m times), and inside each, the mean function processes all rows (n times).
As the number of rows (n) or columns (m) grows, the work grows too.
| Input Size (n rows, m columns) | Approx. Operations |
|---|---|
| 10 rows, 5 columns | About 50 operations (10*5) |
| 100 rows, 5 columns | About 500 operations (100*5) |
| 1000 rows, 10 columns | About 10,000 operations (1000*10) |
Pattern observation: The total work grows by multiplying rows and columns, so doubling either doubles the work.
Time Complexity: O(n * m)
This means the time to calculate all means grows proportionally with the number of rows times the number of columns.
[X] Wrong: "Calculating the mean for each column is just O(m) because we loop over columns only."
[OK] Correct: Each mean calculation looks at all rows, so the work inside the loop depends on n, making total work depend on both n and m.
Understanding how data size affects processing time is key in data science. This skill helps you explain and improve data handling in real projects.
"What if we used a built-in function that calculates means for all columns at once? How would the time complexity change?"