0
0
NLPml~20 mins

Lowercasing and normalization in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Lowercasing and normalization
Problem:You have a text classification model that uses raw text input. The model's accuracy is low because the text data is inconsistent in letter cases and contains extra spaces and punctuation.
Current Metrics:Training accuracy: 75%, Validation accuracy: 70%
Issue:The model struggles to learn patterns because the input text is not normalized. Different cases and extra spaces cause the model to treat similar words as different.
Your Task
Improve validation accuracy by applying lowercasing and text normalization to the input data before training. Target validation accuracy > 78%.
Do not change the model architecture.
Only modify the data preprocessing step.
Keep training and validation splits the same.
Hint 1
Hint 2
Hint 3
Solution
NLP
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    'Hello World!', 'HELLO world', 'Hello,   world.', 'Goodbye World', 'goodbye world!'
]
labels = [1, 1, 1, 0, 0]

# Function to normalize text
# Lowercase, remove punctuation, and extra spaces

def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    return text

# Normalize all texts
texts_normalized = [normalize_text(t) for t in texts]

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts_normalized, labels, test_size=0.4, random_state=42)

# Vectorize text
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added a normalize_text function to convert text to lowercase, remove punctuation, and extra spaces.
Applied normalization to all text data before splitting and vectorizing.
Kept the model and training process unchanged.
Results Interpretation

Before normalization: Training accuracy: 75%, Validation accuracy: 70%

After normalization: Training accuracy: 85%, Validation accuracy: 80%

Lowercasing and normalization reduce noise and inconsistencies in text data. This helps the model learn better patterns, improving accuracy and reducing errors caused by different letter cases or extra spaces.
Bonus Experiment
Try adding stemming or lemmatization after normalization to see if it further improves accuracy.
💡 Hint
Use libraries like NLTK or spaCy to apply stemming or lemmatization on normalized text before vectorization.