0
0
NLPml~5 mins

Regular expressions for text cleaning in NLP

Choose your learning style9 modes available
Introduction

Regular expressions help find and fix messy parts in text. They make text ready for computers to understand.

Removing extra spaces or tabs from user comments.
Deleting special characters from product reviews.
Changing all letters to lowercase for fair comparison.
Extracting phone numbers or emails from messages.
Fixing inconsistent date formats in text data.
Syntax
NLP
import re

# Basic pattern matching
pattern = r'your_pattern_here'
text = 'your text here'

# Find all matches
matches = re.findall(pattern, text)

# Replace matches with new text
clean_text = re.sub(pattern, 'replacement', text)

r before quotes means raw string, so backslashes are treated correctly.

re.findall finds all parts matching the pattern.

Examples
This removes all exclamation marks, question marks, and periods from the text.
NLP
import re
text = 'Hello!!! How are you???'
clean_text = re.sub(r'[!?.]', '', text)
print(clean_text)
This finds phone numbers in the format 123-456-7890.
NLP
import re
text = 'Call me at 123-456-7890.'
phone = re.findall(r'\d{3}-\d{3}-\d{4}', text)
print(phone)
This replaces multiple spaces with one space and removes spaces at the ends.
NLP
import re
text = '  Lots   of   spaces  '
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)
Sample Model

This program cleans text by removing punctuation, finding phone numbers, fixing spaces, and making all letters lowercase.

NLP
import re

# Sample messy text
text = "Hello!!! This is a sample text... Visit us at www.example.com or call 555-123-4567."

# Step 1: Remove punctuation
text_no_punct = re.sub(r'[!?.]', '', text)

# Step 2: Find phone numbers
phones = re.findall(r'\d{3}-\d{3}-\d{4}', text_no_punct)

# Step 3: Replace multiple spaces with one
clean_text = re.sub(r'\s+', ' ', text_no_punct).strip()

# Step 4: Lowercase all text
final_text = clean_text.lower()

print('Cleaned Text:', final_text)
print('Phone Numbers Found:', phones)
OutputSuccess
Important Notes

Regular expressions can be tricky at first; test patterns on small text samples.

Use raw strings (r'pattern') to avoid errors with backslashes.

Cleaning text well helps machine learning models understand data better.

Summary

Regular expressions find and fix patterns in text easily.

They help remove unwanted characters and extract useful info.

Using them improves text quality for machine learning tasks.