Regular expressions help find and fix messy parts in text. They make text ready for computers to understand.
Regular expressions for text cleaning in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
import re # Basic pattern matching pattern = r'your_pattern_here' text = 'your text here' # Find all matches matches = re.findall(pattern, text) # Replace matches with new text clean_text = re.sub(pattern, 'replacement', text)
r before quotes means raw string, so backslashes are treated correctly.
re.findall finds all parts matching the pattern.
Examples
NLP
import re text = 'Hello!!! How are you???' clean_text = re.sub(r'[!?.]', '', text) print(clean_text)
NLP
import re text = 'Call me at 123-456-7890.' phone = re.findall(r'\d{3}-\d{3}-\d{4}', text) print(phone)
NLP
import re text = ' Lots of spaces ' clean_text = re.sub(r'\s+', ' ', text).strip() print(clean_text)
Sample Model
This program cleans text by removing punctuation, finding phone numbers, fixing spaces, and making all letters lowercase.
NLP
import re # Sample messy text text = "Hello!!! This is a sample text... Visit us at www.example.com or call 555-123-4567." # Step 1: Remove punctuation text_no_punct = re.sub(r'[!?.]', '', text) # Step 2: Find phone numbers phones = re.findall(r'\d{3}-\d{3}-\d{4}', text_no_punct) # Step 3: Replace multiple spaces with one clean_text = re.sub(r'\s+', ' ', text_no_punct).strip() # Step 4: Lowercase all text final_text = clean_text.lower() print('Cleaned Text:', final_text) print('Phone Numbers Found:', phones)
Important Notes
Regular expressions can be tricky at first; test patterns on small text samples.
Use raw strings (r'pattern') to avoid errors with backslashes.
Cleaning text well helps machine learning models understand data better.
Summary
Regular expressions find and fix patterns in text easily.
They help remove unwanted characters and extract useful info.
Using them improves text quality for machine learning tasks.
Practice
1. What is the main purpose of using regular expressions in text cleaning for NLP?
easy
Solution
Step 1: Understand the role of regular expressions
Regular expressions are used to identify patterns in text, such as unwanted characters or specific sequences.Step 2: Connect to text cleaning
Text cleaning involves removing or replacing unwanted parts of text to prepare it for analysis or modeling.Final Answer:
To find and remove unwanted patterns or characters in text -> Option AQuick Check:
Regular expressions clean text by pattern matching [OK]
Hint: Regular expressions = pattern search and replace in text [OK]
Common Mistakes:
- Confusing regex with model training
- Thinking regex stores data
- Assuming regex creates visualizations
2. Which of the following is the correct Python syntax to import the regular expression module?
easy
Solution
Step 1: Recall Python's regex module name
Python's built-in module for regular expressions is named 're'.Step 2: Check syntax correctness
The correct import statement is 'import re' to use regex functions.Final Answer:
import re -> Option CQuick Check:
Python regex module = re [OK]
Hint: Remember: Python regex module is 're' not 'regex' [OK]
Common Mistakes:
- Using 'import regex' which is not standard
- Trying to import non-existent modules
- Confusing module names with function names
3. What will be the output of this Python code snippet?
import re text = "Hello, World! 123" cleaned = re.sub(r'[^a-zA-Z ]', '', text) print(cleaned)
medium
Solution
Step 1: Understand the regex pattern used
The pattern '[^a-zA-Z ]' means any character NOT a letter (a-z or A-Z) or space.Step 2: Apply re.sub to remove unwanted characters
All characters except letters and spaces are removed, so commas, exclamation marks, and digits are deleted.Final Answer:
Hello World -> Option AQuick Check:
Regex removes non-letters/spaces = 'Hello World ' [OK]
Hint: [^...] means NOT those characters, so it removes digits and punctuation [OK]
Common Mistakes:
- Thinking digits remain after substitution
- Confusing character classes with ranges
- Ignoring spaces in the pattern
4. Identify the error in this regex code snippet for removing digits from text:
import re text = "Price: 100 dollars" cleaned = re.sub(r'\d', '', text) print(cleaned)
medium
Solution
Step 1: Check regex pattern correctness
The pattern r'\d' correctly matches digits (0-9).Step 2: Verify code syntax and function usage
The code uses raw string r'\d' which properly escapes the backslash, so digits are removed as intended.Final Answer:
The code will run correctly and remove digits -> Option DQuick Check:
r'\d' matches digits; re.sub removes them correctly [OK]
Hint: In raw strings, r'\d' matches digits; no extra escaping needed [OK]
Common Mistakes:
- Thinking '\d' needs double escaping outside raw strings
- Confusing '\d' with '\D' (non-digit)
- Assuming re.sub syntax is wrong
5. You want to clean a text dataset by removing all URLs and extra spaces. Which regex pattern and code snippet correctly achieves this in Python?
import re text = "Visit https://example.com now! Enjoy!" cleaned = re.sub(_____, ' ', text) cleaned = re.sub(r'\s+', ' ', cleaned).strip() print(cleaned)
hard
Solution
Step 1: Identify a regex pattern that matches URLs
The pattern 'https?://' matches 'http://' or 'https://', and '\S+' matches non-space characters following it, capturing full URLs.Step 2: Understand the code's cleaning steps
First, URLs are replaced by a space, then multiple spaces are reduced to one, and leading/trailing spaces removed.Final Answer:
r'https?://\S+' -> Option BQuick Check:
Use 'https?://\S+' to remove URLs effectively [OK]
Hint: Use 'https?://' plus '\S+' to match full URLs [OK]
Common Mistakes:
- Using too narrow patterns missing https or full URL
- Not removing extra spaces after substitution
- Using patterns that match only partial URLs
