Overview - Handling missing values

What is it?

Handling missing values means dealing with gaps or blanks in data where information is not recorded or lost. These missing parts can happen for many reasons, like errors in data collection or people skipping questions. Since machine learning models need complete data to learn well, we must find ways to fill in or manage these gaps. This process helps keep our models accurate and reliable.

Why it matters

Without handling missing values, machine learning models can make wrong guesses or fail to learn patterns properly. Imagine trying to solve a puzzle with missing pieces; the picture won't be clear. In real life, this could mean bad decisions in healthcare, finance, or any field relying on data. Handling missing values ensures we use all available information wisely and avoid misleading results.

Where it fits

Before learning this, you should understand basic data types and how machine learning models work with data. After mastering missing value handling, you can explore feature engineering and advanced data cleaning techniques. This topic is a key step in preparing data for any machine learning project.

Mental Model

Core Idea

Handling missing values means smartly filling or managing gaps in data so models can learn without confusion or bias.

Think of it like...

It's like fixing holes in a road before driving a car; if you don't fix them, the ride will be bumpy or even break the car.

Data with missing values:
┌─────────────┬─────────────┬─────────────┐
│ Feature A   │ Feature B   │ Feature C   │
├─────────────┼─────────────┼─────────────┤
│ 5           │ 3           │ missing     │
│ missing     │ 7           │ 2           │
│ 4           │ missing     │ 6           │
└─────────────┴─────────────┴─────────────┘

Handling methods:
[Remove rows]  [Fill with mean]  [Fill with constant]  [Predict missing]

Resulting data:
┌─────────────┬─────────────┬─────────────┐
│ 5           │ 3           │ 4           │
│ 4.5 (mean)  │ 7           │ 2           │
│ 4           │ 5 (mean)    │ 6           │
└─────────────┴─────────────┴─────────────┘

Build-Up - 7 Steps

1

FoundationWhat Are Missing Values?

Concept: Introduce what missing values are and why they appear in data.

Missing values are spots in data where no information is recorded. They can happen because of mistakes, skipped questions, broken sensors, or lost files. For example, in a survey, some people might not answer all questions, leaving blanks.

Result

You understand that missing values are common and natural in real-world data.

Knowing what missing values are helps you recognize why data cleaning is necessary before analysis or modeling.

2

FoundationTypes of Missing Data

3

IntermediateSimple Handling: Removing Missing Data

4

IntermediateFilling Missing Values with Simple Imputation

5

IntermediateAdvanced Imputation: Predicting Missing Values

6

AdvancedUsing Missingness as a Feature

7

ExpertMultiple Imputation and Uncertainty Handling

Under the Hood

When data has missing values, many machine learning algorithms cannot process them directly because they expect complete input. Internally, missing values can cause errors or biased calculations. Handling missing values modifies the dataset so algorithms receive complete data. Simple methods replace missing spots with fixed values, while advanced methods use statistical models or machine learning to estimate missing entries. Multiple imputation creates several complete datasets to reflect uncertainty, combining results later. This process ensures models learn from the best possible data representation.

Why designed this way?

Handling missing values was designed to solve the problem that real-world data is often incomplete, but models need full data to work. Early approaches simply removed missing data, but this wasted information. Filling missing values with averages was a quick fix but ignored uncertainty. More advanced methods like predictive imputation and multiple imputation were developed to use all data wisely and reflect the unknowns. These methods balance simplicity, accuracy, and computational cost.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Raw Data     │──────▶│ Detect Missing│──────▶│ Handle Missing│
│ (with gaps)  │       │ Values        │       │ Values       │
└───────────────┘       └───────────────┘       └───────────────┘
                                   │                      │
                                   │                      ▼
                                   │             ┌─────────────────┐
                                   │             │ Methods:        │
                                   │             │ - Remove rows   │
                                   │             │ - Fill mean    │
                                   │             │ - Predict      │
                                   │             │ - Multiple Impute│
                                   │             └─────────────────┘
                                   │                      │
                                   ▼                      ▼
                          ┌─────────────────┐     ┌───────────────┐
                          │ Cleaned Data    │◀────│ Model Training│
                          │ (complete)      │     │ & Prediction  │
                          └─────────────────┘     └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does removing all rows with missing values always improve model accuracy? Commit to yes or no.

Common Belief:Removing all rows with missing values is the safest and best way to handle missing data.

Tap to reveal reality

Quick: Is filling missing values with the mean always the best choice? Commit to yes or no.

Common Belief:Filling missing values with the mean is always a good and neutral choice.

Tap to reveal reality

Quick: Does predicting missing values with other features guarantee perfect guesses? Commit to yes or no.

Common Belief:Predictive imputation always gives accurate missing value estimates.

Tap to reveal reality

Quick: Can missingness itself be useful information for models? Commit to yes or no.

Common Belief:Missing values are just errors and should always be fixed or removed.

Tap to reveal reality

Expert Zone

1

Imputation methods can interact with feature scaling and encoding, affecting model behavior subtly.

2

Multiple imputation requires careful combination of results to avoid underestimating uncertainty.

3

Handling missing values differently for training and test sets can cause data leakage and biased evaluation.

When NOT to use

Handling missing values by simple imputation is not suitable when missingness is MNAR or when data is highly skewed; in such cases, consider model-based methods or specialized algorithms that handle missing data natively, like some tree-based models.

Production Patterns

In production, pipelines often include automated missing value detection and imputation steps, sometimes combined with monitoring missingness patterns over time to detect data quality issues or concept drift.

Connections

Data Cleaning

Builds-on

Handling missing values is a core part of data cleaning, ensuring data quality before any analysis or modeling.

Uncertainty Quantification

Builds-on

Multiple imputation connects to uncertainty quantification by reflecting the unknowns in missing data, improving trust in model results.

Medical Diagnosis

Builds-on

In medicine, missing patient data is common; handling missing values properly can mean the difference between correct and wrong diagnoses.

Common Pitfalls

#1Removing too many rows with missing values, shrinking dataset size.

Wrong approach:data = data.dropna()

Correct approach:data = data.dropna(thresh=int(len(data.columns)*0.7)) # keep rows with at least 70% data

Root cause:Assuming all missing data is bad and must be removed without considering data loss.

#2Filling missing values with mean without checking data distribution.

Wrong approach:data['age'] = data['age'].fillna(data['age'].mean())

Correct approach:data['age'] = data['age'].fillna(data['age'].median()) # better for skewed data

Root cause:Not considering data skewness and distribution before choosing imputation method.

#3Predicting missing values using the whole dataset including the target variable.

Wrong approach:model.fit(data.drop('target', axis=1), data['age']) # but using target in features

Correct approach:model.fit(data.drop(['target', 'age'], axis=1), data['age']) # exclude target from features

Root cause:Data leakage by using target information to predict missing features.

Key Takeaways

Missing values are common in real-world data and must be handled carefully to build reliable machine learning models.

There are different types of missing data, and understanding them guides the choice of handling methods.

Simple methods like removing or filling missing values are easy but can introduce bias or lose information.

Advanced methods like predictive and multiple imputation better capture data patterns and uncertainty.

Sometimes, the fact that data is missing carries useful information and should be used as a feature.