0
0
ML Pythonprogramming~15 mins

Handling missing values in ML Python - Deep Dive

Choose your learning style9 modes available
Overview - Handling missing values
What is it?
Handling missing values means dealing with gaps or blanks in data where information is not recorded or lost. These missing parts can happen for many reasons, like errors in data collection or people skipping questions. Since machine learning models need complete data to learn well, we must find ways to fill in or manage these gaps. This process helps keep our models accurate and reliable.
Why it matters
Without handling missing values, machine learning models can make wrong guesses or fail to learn patterns properly. Imagine trying to solve a puzzle with missing pieces; the picture won't be clear. In real life, this could mean bad decisions in healthcare, finance, or any field relying on data. Handling missing values ensures we use all available information wisely and avoid misleading results.
Where it fits
Before learning this, you should understand basic data types and how machine learning models work with data. After mastering missing value handling, you can explore feature engineering and advanced data cleaning techniques. This topic is a key step in preparing data for any machine learning project.
Mental Model
Core Idea
Handling missing values means smartly filling or managing gaps in data so models can learn without confusion or bias.
Think of it like...
It's like fixing holes in a road before driving a car; if you don't fix them, the ride will be bumpy or even break the car.
Data with missing values:
┌─────────────┬─────────────┬─────────────┐
│ Feature A   │ Feature B   │ Feature C   │
├─────────────┼─────────────┼─────────────┤
│ 5           │ 3           │ missing     │
│ missing     │ 7           │ 2           │
│ 4           │ missing     │ 6           │
└─────────────┴─────────────┴─────────────┘

Handling methods:
[Remove rows]  [Fill with mean]  [Fill with constant]  [Predict missing]

Resulting data:
┌─────────────┬─────────────┬─────────────┐
│ 5           │ 3           │ 4           │
│ 4.5 (mean)  │ 7           │ 2           │
│ 4           │ 5 (mean)    │ 6           │
└─────────────┴─────────────┴─────────────┘
Build-Up - 7 Steps
1
FoundationWhat Are Missing Values?
Concept: Introduce what missing values are and why they appear in data.
Missing values are spots in data where no information is recorded. They can happen because of mistakes, skipped questions, broken sensors, or lost files. For example, in a survey, some people might not answer all questions, leaving blanks.
Result
You understand that missing values are common and natural in real-world data.
Knowing what missing values are helps you recognize why data cleaning is necessary before analysis or modeling.
2
FoundationTypes of Missing Data
Concept: Learn the three main types of missing data: Missing Completely at Random, Missing at Random, and Missing Not at Random.
1. Missing Completely at Random (MCAR): Missing values happen by pure chance, unrelated to any data. 2. Missing at Random (MAR): Missingness depends on other observed data but not on the missing value itself. 3. Missing Not at Random (MNAR): Missingness depends on the missing value itself, like people hiding sensitive info.
Result
You can classify missing data types, which guides how to handle them.
Understanding missing data types helps choose the right method to fill or manage missing values.
3
IntermediateSimple Handling: Removing Missing Data
🤔Before reading on: Do you think removing rows with missing values always improves model accuracy? Commit to yes or no.
Concept: Learn how to remove rows or columns with missing values and when this is appropriate.
One way to handle missing values is to delete any row or column that contains them. This is easy but can waste data if many values are missing. For example, if only a few rows have missing values, removing them might be fine. But if many rows are missing data, this can reduce your dataset too much.
Result
You can clean data by removing incomplete parts but risk losing valuable information.
Knowing when to remove missing data prevents losing too much useful information and hurting model performance.
4
IntermediateFilling Missing Values with Simple Imputation
🤔Before reading on: Is filling missing values with the mean always the best choice? Commit to yes or no.
Concept: Introduce simple imputation methods like filling missing values with mean, median, or a constant.
Instead of removing data, you can fill missing spots with a value. Common choices are the mean (average) or median of the column, or a fixed number like zero. This keeps all rows but can introduce bias if the missing data is not random.
Result
You can keep all data by filling gaps, but must choose filling values carefully.
Understanding simple imputation helps maintain dataset size while managing missing values, but also warns about potential bias.
5
IntermediateAdvanced Imputation: Predicting Missing Values
🤔Before reading on: Do you think predicting missing values using other features always gives perfect results? Commit to yes or no.
Concept: Learn how to use machine learning models to predict missing values based on other data.
Instead of simple filling, you can train a model to guess missing values using other features. For example, if someone's age is missing, you might predict it from their job and income. This method can be more accurate but requires extra work and care to avoid overfitting.
Result
You can fill missing values more smartly by predicting them, improving data quality.
Knowing predictive imputation allows better handling of missing data, especially when simple methods fail.
6
AdvancedUsing Missingness as a Feature
🤔Before reading on: Can the fact that a value is missing itself carry useful information? Commit to yes or no.
Concept: Learn that sometimes missing values themselves tell a story and can be used as a feature in models.
Instead of just filling or removing missing values, you can create a new feature that marks whether a value was missing. For example, if missing income often means low income, this flag helps the model learn that pattern. This approach can improve model accuracy.
Result
You can use missingness as a signal, not just a problem to fix.
Understanding missingness as information helps build smarter models that capture hidden patterns.
7
ExpertMultiple Imputation and Uncertainty Handling
🤔Before reading on: Do you think filling missing values once captures all uncertainty about them? Commit to yes or no.
Concept: Explore multiple imputation, where missing values are filled several times to reflect uncertainty, improving statistical validity.
Instead of filling missing values once, multiple imputation fills them several times with different plausible values. Then, models are trained on each filled dataset, and results are combined. This approach captures uncertainty about missing data and leads to more reliable conclusions, especially in sensitive fields like medicine.
Result
You can handle missing data with a richer approach that respects uncertainty and variability.
Knowing multiple imputation helps avoid overconfidence in filled data and improves trustworthiness of results.
Under the Hood
When data has missing values, many machine learning algorithms cannot process them directly because they expect complete input. Internally, missing values can cause errors or biased calculations. Handling missing values modifies the dataset so algorithms receive complete data. Simple methods replace missing spots with fixed values, while advanced methods use statistical models or machine learning to estimate missing entries. Multiple imputation creates several complete datasets to reflect uncertainty, combining results later. This process ensures models learn from the best possible data representation.
Why designed this way?
Handling missing values was designed to solve the problem that real-world data is often incomplete, but models need full data to work. Early approaches simply removed missing data, but this wasted information. Filling missing values with averages was a quick fix but ignored uncertainty. More advanced methods like predictive imputation and multiple imputation were developed to use all data wisely and reflect the unknowns. These methods balance simplicity, accuracy, and computational cost.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Raw Data     │──────▶│ Detect Missing│──────▶│ Handle Missing│
│ (with gaps)  │       │ Values        │       │ Values       │
└───────────────┘       └───────────────┘       └───────────────┘
                                   │                      │
                                   │                      ▼
                                   │             ┌─────────────────┐
                                   │             │ Methods:        │
                                   │             │ - Remove rows   │
                                   │             │ - Fill mean    │
                                   │             │ - Predict      │
                                   │             │ - Multiple Impute│
                                   │             └─────────────────┘
                                   │                      │
                                   ▼                      ▼
                          ┌─────────────────┐     ┌───────────────┐
                          │ Cleaned Data    │◀────│ Model Training│
                          │ (complete)      │     │ & Prediction  │
                          └─────────────────┘     └───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does removing all rows with missing values always improve model accuracy? Commit to yes or no.
Common Belief:Removing all rows with missing values is the safest and best way to handle missing data.
Tap to reveal reality
Reality:Removing rows can cause loss of valuable data and bias if missingness is not random.
Why it matters:Blindly removing data can reduce dataset size drastically and lead to models that don't generalize well.
Quick: Is filling missing values with the mean always the best choice? Commit to yes or no.
Common Belief:Filling missing values with the mean is always a good and neutral choice.
Tap to reveal reality
Reality:Mean imputation can distort data distribution and hide important patterns, especially if data is skewed.
Why it matters:Using mean imputation blindly can reduce model accuracy and mislead analysis.
Quick: Does predicting missing values with other features guarantee perfect guesses? Commit to yes or no.
Common Belief:Predictive imputation always gives accurate missing value estimates.
Tap to reveal reality
Reality:Predictions can be wrong and introduce bias or overfitting if not done carefully.
Why it matters:Overconfidence in predicted values can cause models to learn incorrect patterns.
Quick: Can missingness itself be useful information for models? Commit to yes or no.
Common Belief:Missing values are just errors and should always be fixed or removed.
Tap to reveal reality
Reality:Sometimes missingness signals important information and can improve model predictions.
Why it matters:Ignoring missingness as a feature can miss hidden patterns and reduce model power.
Expert Zone
1
Imputation methods can interact with feature scaling and encoding, affecting model behavior subtly.
2
Multiple imputation requires careful combination of results to avoid underestimating uncertainty.
3
Handling missing values differently for training and test sets can cause data leakage and biased evaluation.
When NOT to use
Handling missing values by simple imputation is not suitable when missingness is MNAR or when data is highly skewed; in such cases, consider model-based methods or specialized algorithms that handle missing data natively, like some tree-based models.
Production Patterns
In production, pipelines often include automated missing value detection and imputation steps, sometimes combined with monitoring missingness patterns over time to detect data quality issues or concept drift.
Connections
Data Cleaning
Builds-on
Handling missing values is a core part of data cleaning, ensuring data quality before any analysis or modeling.
Uncertainty Quantification
Builds-on
Multiple imputation connects to uncertainty quantification by reflecting the unknowns in missing data, improving trust in model results.
Medical Diagnosis
Builds-on
In medicine, missing patient data is common; handling missing values properly can mean the difference between correct and wrong diagnoses.
Common Pitfalls
#1Removing too many rows with missing values, shrinking dataset size.
Wrong approach:data = data.dropna()
Correct approach:data = data.dropna(thresh=int(len(data.columns)*0.7)) # keep rows with at least 70% data
Root cause:Assuming all missing data is bad and must be removed without considering data loss.
#2Filling missing values with mean without checking data distribution.
Wrong approach:data['age'] = data['age'].fillna(data['age'].mean())
Correct approach:data['age'] = data['age'].fillna(data['age'].median()) # better for skewed data
Root cause:Not considering data skewness and distribution before choosing imputation method.
#3Predicting missing values using the whole dataset including the target variable.
Wrong approach:model.fit(data.drop('target', axis=1), data['age']) # but using target in features
Correct approach:model.fit(data.drop(['target', 'age'], axis=1), data['age']) # exclude target from features
Root cause:Data leakage by using target information to predict missing features.
Key Takeaways
Missing values are common in real-world data and must be handled carefully to build reliable machine learning models.
There are different types of missing data, and understanding them guides the choice of handling methods.
Simple methods like removing or filling missing values are easy but can introduce bias or lose information.
Advanced methods like predictive and multiple imputation better capture data patterns and uncertainty.
Sometimes, the fact that data is missing carries useful information and should be used as a feature.