Data validation checks in Pandas - Time & Space Complexity
When we check data for errors or missing values using pandas, we want to know how long these checks take as data grows.
We ask: How does the time to validate data change when the data size increases?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'age': [25, 30, None, 22, 40],
'salary': [50000, 60000, 55000, None, 70000]
})
missing_age = df['age'].isnull().sum()
valid_salary = (df['salary'] > 0).all()
This code checks how many missing values are in the 'age' column and verifies if all 'salary' values are positive.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each value in the 'age' and 'salary' columns once.
- How many times: Once per column, so twice total, each over all rows.
As the number of rows grows, the time to check missing or positive values grows roughly the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 checks (2 columns x 10 rows) |
| 100 | About 200 checks |
| 1000 | About 2000 checks |
Pattern observation: The number of checks grows directly with the number of rows.
Time Complexity: O(n)
This means the time to validate data grows in a straight line as the data size grows.
[X] Wrong: "Checking for missing values is instant no matter how big the data is."
[OK] Correct: Each value must be checked once, so more data means more time.
Understanding how data checks scale helps you write efficient code and explain your choices clearly in real projects.
"What if we checked multiple columns instead of two? How would the time complexity change?"