Exploratory data analysis workflow in Pandas - Time & Space Complexity
When we explore data using pandas, we run several steps to understand it better.
We want to know how the time needed grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.read_csv('data.csv')
summary = df.describe()
missing = df.isnull().sum()
value_counts = df['category'].value_counts()
correlations = df.corr()
This code loads data, summarizes it, counts missing values, counts categories, and finds correlations.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: pandas scans through each column and row to compute statistics.
- How many times: Each operation touches all rows once or twice depending on the method.
As the number of rows grows, the time to compute summaries and counts grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 times the work for each column |
| 100 | About 100 times the work for each column |
| 1000 | About 1000 times the work for each column |
Pattern observation: The work grows linearly with the number of rows.
Time Complexity: O(n)
This means the time needed grows roughly in direct proportion to the number of rows in the data.
[X] Wrong: "The time to get summaries stays the same no matter how big the data is."
[OK] Correct: Each summary needs to look at every row, so more rows mean more work and more time.
Understanding how data size affects analysis time helps you explain your approach clearly and shows you know how tools work under the hood.
"What if we added a step that compares every row to every other row? How would the time complexity change?"