NLPml~20 mins

Lowercasing and normalization in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Lowercasing and normalization

Problem:You have a text classification model that uses raw text input. The model's accuracy is low because the text data is inconsistent in letter cases and contains extra spaces and punctuation.

Current Metrics:Training accuracy: 75%, Validation accuracy: 70%

Issue:The model struggles to learn patterns because the input text is not normalized. Different cases and extra spaces cause the model to treat similar words as different.

Your Task

Improve validation accuracy by applying lowercasing and text normalization to the input data before training. Target validation accuracy > 78%.

Do not change the model architecture.

Only modify the data preprocessing step.

Keep training and validation splits the same.

Hint 1

Hint 2

Hint 3

Solution

NLP

import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    'Hello World!', 'HELLO world', 'Hello,   world.', 'Goodbye World', 'goodbye world!'
]
labels = [1, 1, 1, 0, 0]

# Function to normalize text
# Lowercase, remove punctuation, and extra spaces

def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    return text

# Normalize all texts
texts_normalized = [normalize_text(t) for t in texts]

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts_normalized, labels, test_size=0.4, random_state=42)

# Vectorize text
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Added a normalize_text function to convert text to lowercase, remove punctuation, and extra spaces.

Applied normalization to all text data before splitting and vectorizing.

Kept the model and training process unchanged.

Results Interpretation

Before normalization: Training accuracy: 75%, Validation accuracy: 70%

After normalization: Training accuracy: 85%, Validation accuracy: 80%

Lowercasing and normalization reduce noise and inconsistencies in text data. This helps the model learn better patterns, improving accuracy and reducing errors caused by different letter cases or extra spaces.

Bonus Experiment

Try adding stemming or lemmatization after normalization to see if it further improves accuracy.

💡 Hint

Use libraries like NLTK or spaCy to apply stemming or lemmatization on normalized text before vectorization.

Practice

(1/5)

1. What is the main purpose of lowercasing text in Natural Language Processing?

easy

A. To translate text into another language

B. To make all letters small so words like 'Apple' and 'apple' are treated the same

C. To remove all punctuation marks from the text

D. To split sentences into words

Lowercasing and normalization in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand what lowercasing does

Step 2: Understand why lowercasing is used

Final Answer:

Quick Check:

Solution

Step 1: Recall Python string method for lowercasing

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Apply lower() method on the string 'Café'

Step 2: Understand effect on accented characters

Final Answer:

Quick Check:

Solution

Step 1: Understand what normalize('NFKD') does

Step 2: Check the code behavior

Final Answer:

Quick Check:

Solution

Step 1: Lowercase the text

Step 2: Normalize and remove accents

Step 3: Combine steps correctly

Final Answer:

Quick Check: