Bird
Raised Fist0
NLPml~20 mins

Lowercasing and normalization in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Lowercasing and normalization
Problem:You have a text classification model that uses raw text input. The model's accuracy is low because the text data is inconsistent in letter cases and contains extra spaces and punctuation.
Current Metrics:Training accuracy: 75%, Validation accuracy: 70%
Issue:The model struggles to learn patterns because the input text is not normalized. Different cases and extra spaces cause the model to treat similar words as different.
Your Task
Improve validation accuracy by applying lowercasing and text normalization to the input data before training. Target validation accuracy > 78%.
Do not change the model architecture.
Only modify the data preprocessing step.
Keep training and validation splits the same.
Hint 1
Hint 2
Hint 3
Solution
NLP
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    'Hello World!', 'HELLO world', 'Hello,   world.', 'Goodbye World', 'goodbye world!'
]
labels = [1, 1, 1, 0, 0]

# Function to normalize text
# Lowercase, remove punctuation, and extra spaces

def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    return text

# Normalize all texts
texts_normalized = [normalize_text(t) for t in texts]

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts_normalized, labels, test_size=0.4, random_state=42)

# Vectorize text
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added a normalize_text function to convert text to lowercase, remove punctuation, and extra spaces.
Applied normalization to all text data before splitting and vectorizing.
Kept the model and training process unchanged.
Results Interpretation

Before normalization: Training accuracy: 75%, Validation accuracy: 70%

After normalization: Training accuracy: 85%, Validation accuracy: 80%

Lowercasing and normalization reduce noise and inconsistencies in text data. This helps the model learn better patterns, improving accuracy and reducing errors caused by different letter cases or extra spaces.
Bonus Experiment
Try adding stemming or lemmatization after normalization to see if it further improves accuracy.
💡 Hint
Use libraries like NLTK or spaCy to apply stemming or lemmatization on normalized text before vectorization.

Practice

(1/5)
1. What is the main purpose of lowercasing text in Natural Language Processing?
easy
A. To translate text into another language
B. To make all letters small so words like 'Apple' and 'apple' are treated the same
C. To remove all punctuation marks from the text
D. To split sentences into words

Solution

  1. Step 1: Understand what lowercasing does

    Lowercasing changes all letters in text to small letters.
  2. Step 2: Understand why lowercasing is used

    This helps treat words like 'Apple' and 'apple' as the same word, improving consistency.
  3. Final Answer:

    To make all letters small so words like 'Apple' and 'apple' are treated the same -> Option B
  4. Quick Check:

    Lowercasing = uniform word form [OK]
Hint: Lowercase to treat same words equally [OK]
Common Mistakes:
  • Confusing lowercasing with removing punctuation
  • Thinking lowercasing translates text
  • Believing lowercasing splits sentences
2. Which of the following Python code snippets correctly converts a string text to lowercase?
easy
A. text.lowercase()
B. lower(text)
C. text.toLowerCase()
D. text.lower()

Solution

  1. Step 1: Recall Python string method for lowercasing

    Python strings have a method called lower() to convert text to lowercase.
  2. Step 2: Check each option

    text.lower() uses text.lower(), which is correct. lower(text) is not a Python function. text.toLowerCase() is JavaScript style. text.lowercase() is not a valid method.
  3. Final Answer:

    text.lower() -> Option D
  4. Quick Check:

    Python lowercase method = lower() [OK]
Hint: Python lowercase method is .lower() [OK]
Common Mistakes:
  • Using JavaScript syntax in Python
  • Calling non-existent methods like lowercase()
  • Trying to use a function named lower() instead of method
3. What will be the output of this Python code?
text = 'Café'
normalized = text.lower()
print(normalized)
medium
A. 'café'
B. 'cafe'
C. 'CAFÉ'
D. 'Cafe'

Solution

  1. Step 1: Apply lower() method on the string 'Café'

    The lower() method converts all uppercase letters to lowercase but does not remove accents.
  2. Step 2: Understand effect on accented characters

    The accented 'é' remains unchanged because lower() does not normalize accents.
  3. Final Answer:

    'café' -> Option A
  4. Quick Check:

    lower() keeps accents, just lowers letters [OK]
Hint: lower() changes case but keeps accents [OK]
Common Mistakes:
  • Assuming accents are removed by lower()
  • Expecting uppercase output
  • Confusing normalization with lowercasing
4. The following code aims to lowercase and normalize text but has an error:
import unicodedata
text = 'Café'
normalized = unicodedata.normalize('NFKD', text).lower()
print(normalized)

What is the error and how to fix it?
medium
A. normalize returns a string with accents separated; fix by removing accents after normalization
B. Calling lower() before normalize; fix by swapping the calls
C. lower() returns a string; normalize expects bytes, fix by encoding first
D. No error; code works correctly

Solution

  1. Step 1: Understand what normalize('NFKD') does

    It decomposes accented characters into base character plus accent marks.
  2. Step 2: Check the code behavior

    After normalization, accents are separate characters, so lower() works but accents remain. To remove accents, you must filter out combining marks after normalization.
  3. Final Answer:

    normalize returns a string with accents separated; fix by removing accents after normalization -> Option A
  4. Quick Check:

    Normalization decomposes accents; remove them explicitly [OK]
Hint: Normalize then remove accents explicitly [OK]
Common Mistakes:
  • Thinking lower() removes accents
  • Swapping normalize and lower() calls incorrectly
  • Assuming no extra step needed to remove accents
5. You want to preprocess text data by lowercasing and removing accents for a machine learning model. Which Python code snippet correctly does this?
hard
A. import unicodedata text = 'Café' text = unicodedata.normalize('NFKD', text) print(text)
B. text = 'Café' text = text.lower() print(text)
C. import unicodedata text = 'Café' text = text.lower() text = ''.join(c for c in unicodedata.normalize('NFKD', text) if not unicodedata.combining(c)) print(text)
D. text = 'Café' text = text.upper() print(text)

Solution

  1. Step 1: Lowercase the text

    Use text.lower() to convert all letters to lowercase.
  2. Step 2: Normalize and remove accents

    Use unicodedata.normalize('NFKD', text) to decompose accents, then remove combining characters to strip accents.
  3. Step 3: Combine steps correctly

    import unicodedata text = 'Café' text = text.lower() text = ''.join(c for c in unicodedata.normalize('NFKD', text) if not unicodedata.combining(c)) print(text) does both steps properly: lowercasing first, then normalization and accent removal.
  4. Final Answer:

    import unicodedata text = 'Café' text = text.lower() text = ''.join(c for c in unicodedata.normalize('NFKD', text) if not unicodedata.combining(c)) print(text) -> Option C
  5. Quick Check:

    Lowercase + normalize + remove accents = clean text [OK]
Hint: Lowercase first, then normalize and remove accents [OK]
Common Mistakes:
  • Skipping accent removal after normalization
  • Using upper() instead of lower()
  • Normalizing without removing combining characters