How to clean text data python in nlp

NlpHow-ToBeginner · 3 min read

How to Clean Text Data in Python for NLP Tasks

To clean text data in Python for NLP, use string methods and libraries like re for removing unwanted characters, convert text to lowercase with .lower(), and tokenize using split() or nltk.word_tokenize(). This prepares text for better analysis and model training.

📐

Syntax

Cleaning text usually involves these steps:

Lowercasing: Convert all text to lowercase using text.lower().
Removing punctuation: Use re.sub() to delete punctuation marks.
Removing numbers: Use regular expressions to remove digits.
Tokenization: Split text into words using text.split() or nltk.word_tokenize().
Removing stopwords: Filter out common words like 'the', 'and' using a stopword list.

python

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "Hello World! This is a sample text, with numbers 123 and punctuation."

# Lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Remove numbers
text = re.sub(r'\d+', '', text)

# Tokenize
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

print(tokens)

Output

['hello', 'world', 'sample', 'text', 'numbers', 'punctuation']

💻

Example

This example shows how to clean a sentence by lowercasing, removing punctuation and numbers, tokenizing, and removing stopwords using Python and NLTK.

python

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

text = "Hello World! This is a sample text, with numbers 123 and punctuation."

# Step 1: Lowercase
text = text.lower()

# Step 2: Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Step 3: Remove numbers
text = re.sub(r'\d+', '', text)

# Step 4: Tokenize
tokens = word_tokenize(text)

# Step 5: Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

print(tokens)

Output

['hello', 'world', 'sample', 'text', 'numbers', 'punctuation']

⚠️

Common Pitfalls

Common mistakes when cleaning text data include:

Not lowercasing text, causing duplicates like 'Hello' and 'hello'.
Removing punctuation without considering contractions (e.g., "don't" becomes "dont").
Removing stopwords blindly, which might remove important words depending on context.
Not handling special characters or emojis that may affect tokenization.

python

import re

text = "Don't remove contractions carelessly!"

# Wrong: removing all punctuation removes apostrophes
wrong_clean = re.sub(r'[^\w\s]', '', text.lower())

# Right: keep apostrophes for contractions
right_clean = re.sub(r"[^\w\s']", '', text.lower())

print('Wrong:', wrong_clean)
print('Right:', right_clean)

Output

Wrong: dont remove contractions carelessly Right: don't remove contractions carelessly

📊

Quick Reference

Summary tips for cleaning text data in Python:

Always lowercase text to unify words.
Use re.sub() to remove unwanted characters.
Tokenize text to work with words individually.
Use stopword lists carefully based on your task.
Test cleaning steps on sample data to avoid losing important info.

✅

Key Takeaways

Lowercase all text to avoid case mismatches.

Remove punctuation and numbers using regular expressions.

Tokenize text to split it into meaningful words.

Remove stopwords only if they do not carry important meaning.

Test your cleaning steps to keep useful information.