ML Pythonprogramming~5 mins

Why data preparation consumes most ML time in ML Python

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Data preparation takes most of the time because real-world data is messy and needs cleaning before it can be used. This step makes sure the machine learning model learns from good, useful information.

When you get data from different sources and need to combine it.

When your data has missing or wrong values that need fixing.

When you want to change data into a format the model can understand.

When you need to remove noise or errors from the data.

When you want to select only the important parts of the data for training.

Syntax

ML Python

No fixed code syntax because data preparation involves many steps like cleaning, transforming, and selecting data.

Data preparation is not a single command but a set of tasks done before training.

Common tools include pandas for cleaning and sklearn for transforming data.

Examples

Example of removing missing data using pandas.

ML Python

import pandas as pd

df = pd.read_csv('data.csv')
df = df.dropna()  # Remove rows with missing values

Example of scaling data to have mean 0 and variance 1.

ML Python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

Example of converting text categories into numbers.

ML Python

df['category'] = df['category'].astype('category').cat.codes

Sample Program

This example shows cleaning missing data, converting categories, training a simple model, and checking accuracy.

ML Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data with missing values and categories
data = {'age': [25, 30, None, 22, 40],
        'income': [50000, 60000, 55000, None, 65000],
        'gender': ['M', 'F', 'F', 'M', 'F'],
        'purchased': [0, 1, 0, 1, 1]}

df = pd.DataFrame(data)

# Data preparation
# 1. Remove rows with missing values
clean_df = df.dropna()

# 2. Convert gender to numeric
clean_df['gender'] = clean_df['gender'].astype('category').cat.codes

# 3. Split features and target
X = clean_df[['age', 'income', 'gender']]
y = clean_df['purchased']

# 4. Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# 5. Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# 6. Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Predictions: {predictions}")
print(f"Accuracy: {accuracy:.2f}")

OutputSuccess

Important Notes

Data preparation often takes 70-80% of the total project time.

Good data preparation improves model accuracy and reliability.

Skipping data cleaning can cause wrong or poor model results.

Summary

Data preparation cleans and organizes raw data for machine learning.

This step takes most time because real data is often messy and incomplete.

Proper preparation leads to better and more trustworthy models.