0
0
Data Analysis Pythondata~10 mins

Encoding categorical variables in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Encoding categorical variables
Start with categorical data
Choose encoding method
Label Encoding
Map categories
Replace original
Encoded data ready
Use in model
This flow shows how categorical data is transformed by choosing an encoding method, applying it, and preparing data for modeling.
Execution Sample
Data Analysis Python
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})
le = LabelEncoder()
data['color_encoded'] = le.fit_transform(data['color'])
print(data)
This code converts color names into numbers using label encoding.
Execution Table
StepActionInput DataEncoding ResultOutput Data
1Start with data{'color': ['red', 'blue', 'green', 'blue']}None{'color': ['red', 'blue', 'green', 'blue']}
2Initialize LabelEncoderNoneReady to encodeNone
3Fit and transform 'color'['red', 'blue', 'green', 'blue']Map: {'blue':0, 'green':1, 'red':2}[2, 0, 1, 0]
4Add encoded columnOriginal + encodedEncoded column added{'color': ['red', 'blue', 'green', 'blue'], 'color_encoded': [2, 0, 1, 0]}
5Print final dataDataFrame with encodedShows encoded numbers{'color': ['red', 'blue', 'green', 'blue'], 'color_encoded': [2, 0, 1, 0]}
6EndEncoding completeData ready for modelSame as step 5
💡 Encoding finished after adding the encoded column to the data.
Variable Tracker
VariableStartAfter Step 3After Step 4Final
data['color']['red', 'blue', 'green', 'blue']SameSameSame
leUninitializedLabelEncoder fittedSameSame
data['color_encoded']Not present[2, 0, 1, 0][2, 0, 1, 0][2, 0, 1, 0]
Key Moments - 3 Insights
Why does 'blue' get encoded as 0 and not 1 or 2?
LabelEncoder assigns numbers based on alphabetical order. 'blue' comes before 'green' and 'red', so it gets 0 (see execution_table step 3).
Is the original 'color' column changed after encoding?
No, the original column stays the same; encoding adds a new column (see execution_table step 4).
Can we use these encoded numbers directly in all models?
Not always. Some models treat numbers as ordered values, so one-hot encoding might be better for non-ordered categories.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3. What number is assigned to 'green'?
A1
B0
C2
D3
💡 Hint
Check the 'Encoding Result' column at step 3 for the mapping.
At which step is the encoded column added to the data?
AStep 2
BStep 3
CStep 4
DStep 5
💡 Hint
Look for the action 'Add encoded column' in the execution_table.
If we replaced LabelEncoder with OneHotEncoder, what would change in the output data?
AThe 'color_encoded' column would have numbers 0,1,2
BMultiple new columns with 0/1 values would be added
CThe original 'color' column would be removed
DNo change in data
💡 Hint
One-hot encoding creates separate binary columns for each category.
Concept Snapshot
Encoding categorical variables:
- Convert text categories to numbers for models
- Label Encoding: assigns integer labels alphabetically
- One-Hot Encoding: creates binary columns per category
- Use LabelEncoder for ordinal data
- Use OneHotEncoder for nominal data
- Keep original data unless replacing
Full Transcript
Encoding categorical variables means changing text labels into numbers so computers can understand them. We start with data that has categories like colors. We pick a method: label encoding or one-hot encoding. Label encoding changes each category to a number based on alphabetical order. For example, 'blue' becomes 0, 'green' 1, and 'red' 2. We add these numbers as a new column next to the original. This helps models use the data. One-hot encoding makes new columns for each category with 0 or 1 to show presence. This is better when categories have no order. The code example shows label encoding step by step, adding a new column with numbers. Remember, the original data stays the same unless you replace it. This process prepares categorical data for machine learning.