0
0
Pandasdata~5 mins

Forward fill and backward fill in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Forward fill and backward fill
O(n)
Understanding Time Complexity

We want to understand how the time it takes to fill missing data grows as the data size increases.

Specifically, how does forward fill or backward fill behave when applied to larger data?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
n = 1000
data = pd.DataFrame({
    'A': np.random.choice([1, np.nan], size=n)
})

# Forward fill missing values
filled = data.ffill()

# Backward fill missing values
filled_back = data.bfill()

This code creates a column with some missing values and fills them forward and backward.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Scanning the column values once to fill missing data.
  • How many times: Each element is visited once in order for forward fill, and once for backward fill.
How Execution Grows With Input

As the number of rows increases, the time to fill missing values grows roughly in direct proportion.

Input Size (n)Approx. Operations
10About 10 steps to fill
100About 100 steps to fill
1000About 1000 steps to fill

Pattern observation: The work grows linearly as the data size grows.

Final Time Complexity

Time Complexity: O(n)

This means the time to fill missing values grows directly with the number of rows.

Common Mistake

[X] Wrong: "Forward fill checks every previous value multiple times, so it takes much longer than linear time."

[OK] Correct: The fill methods scan the data only once in order, so each element is visited once, making it linear time.

Interview Connect

Understanding how filling missing data scales helps you handle larger datasets efficiently and shows you know how pandas works under the hood.

Self-Check

What if we applied forward fill on multiple columns at once? How would the time complexity change?