0
0
NLPml~20 mins

Regular expressions for text cleaning in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Regular expressions for text cleaning
Problem:You have a text dataset with noisy data including extra spaces, special characters, and inconsistent capitalization. This noise makes it hard for your model to learn well.
Current Metrics:Text cleaning accuracy: 70% (measured by how well cleaned text matches expected clean text samples)
Issue:The current cleaning method misses many unwanted characters and does not normalize text well, causing poor data quality.
Your Task
Improve text cleaning by using regular expressions to remove unwanted characters, extra spaces, and normalize text to lowercase, aiming for at least 90% cleaning accuracy.
Use Python's re module for regular expressions
Do not use external text cleaning libraries
Keep the cleaning function simple and efficient
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import re

def clean_text(text: str) -> str:
    # Remove special characters except letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading and trailing spaces
    return text.strip()

# Example usage
sample_text = "Hello!!! This   is a sample... Text, with #noisy *characters* & extra spaces."
cleaned = clean_text(sample_text)
print(cleaned)  # Output: 'hello this is a sample text with noisy characters extra spaces'
Used re.sub() to remove all characters except letters and spaces
Converted all text to lowercase for normalization
Replaced multiple spaces with a single space
Trimmed leading and trailing spaces
Results Interpretation

Before: Text cleaning accuracy was 70%, with many special characters and inconsistent spacing remaining.

After: Accuracy improved to 92%, with text normalized to lowercase, special characters removed, and spacing fixed.

Regular expressions are powerful tools to clean and normalize text data, which improves data quality and helps machine learning models perform better.
Bonus Experiment
Try extending the cleaning function to also remove common stopwords (like 'the', 'is', 'and') using a simple list.
💡 Hint
After cleaning, split text into words, filter out stopwords, then join back into a string.