Handling missing values (drop_na, fill) in R Programming - Time & Space Complexity
When working with data, we often need to handle missing values. This topic looks at how the time to clean data grows as the data size increases.
We want to know how long it takes to remove or fill missing values as the data gets bigger.
Analyze the time complexity of the following code snippet.
library(dplyr)
library(tidyr)
data <- tibble(
x = c(1, NA, 3, NA, 5),
y = c(NA, 2, 3, 4, NA)
)
# Remove rows with any NA
clean_data <- drop_na(data)
# Fill NA with zero
filled_data <- data %>% mutate(across(everything(), ~replace_na(.x, 0)))
This code removes rows with missing values and fills missing values with zero in a data frame.
- Primary operation: Checking each element in the data frame for missing values.
- How many times: Once for each element in the data (rows x columns).
As the number of rows grows, the program checks more elements to find missing values.
| Input Size (rows x columns) | Approx. Operations |
|---|---|
| 10 x 2 = 20 | About 20 checks |
| 100 x 2 = 200 | About 200 checks |
| 1000 x 2 = 2000 | About 2000 checks |
Pattern observation: The number of checks grows directly with the number of elements in the data.
Time Complexity: O(n)
This means the time to handle missing values grows in a straight line as the data size grows.
[X] Wrong: "Handling missing values takes the same time no matter how big the data is."
[OK] Correct: The program must check every element, so bigger data means more work and more time.
Understanding how data cleaning time grows helps you write efficient code and explain your choices clearly in real projects.
"What if we only fill missing values in one column instead of all columns? How would the time complexity change?"