0
0
NLPml~20 mins

Regular expressions for text cleaning in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Regex Text Cleaning Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this regex substitution?
Given the text "Hello!!! Are you #1?", what is the result after applying re.sub(r'[^a-zA-Z0-9 ]', '', text)?
NLP
import re
text = "Hello!!! Are you #1?"
result = re.sub(r'[^a-zA-Z0-9 ]', '', text)
print(result)
AHello Are you 1?
BHello!!! Are you 1
CHello Are you #1
DHello Are you 1
Attempts:
2 left
💡 Hint
The regex removes all characters except letters, digits, and spaces.
Model Choice
intermediate
1:30remaining
Which regex pattern removes all digits from a string?
You want to remove all digits from a text string using re.sub. Which pattern should you use?
Ar'\d+'
Br'\D+'
Cr'\w+'
Dr'\s+'
Attempts:
2 left
💡 Hint
Digits are represented by \d in regex.
Metrics
advanced
2:00remaining
How many tokens remain after cleaning?
Given the text "Data science 101: Clean, analyze, & visualize!", after applying re.sub(r'[^a-zA-Z ]', '', text).lower().split(), how many tokens are in the resulting list?
NLP
import re
text = "Data science 101: Clean, analyze, & visualize!"
cleaned = re.sub(r'[^a-zA-Z ]', '', text).lower().split()
print(len(cleaned))
A7
B5
C6
D4
Attempts:
2 left
💡 Hint
Digits and punctuation are removed before splitting by spaces.
🔧 Debug
advanced
2:30remaining
Why does this regex fail to remove punctuation?
This code aims to remove punctuation but does not work as expected:
import re
text = "Hello, world!"
cleaned = re.sub(r'[\w]', '', text)
print(cleaned)
Why?
NLP
import re
text = "Hello, world!"
cleaned = re.sub(r'[\w]', '', text)
print(cleaned)
AThe pattern '[\w]' matches letters and digits, so it removes them instead of punctuation.
BThe pattern '[\w]' matches punctuation only, so letters remain.
CThe pattern is missing a quantifier like '+' to match multiple characters.
DThe pattern should be '[^\w]' to remove punctuation.
Attempts:
2 left
💡 Hint
Check what \w matches in regex.
🧠 Conceptual
expert
3:00remaining
Which regex pattern best cleans URLs from text?
You want to remove URLs from text data using regex. Which pattern is most effective?
Ar'http://'
Br'www\.\w+'
Cr'https?://\S+'
Dr'\S+\.com'
Attempts:
2 left
💡 Hint
URLs often start with http or https and continue until a space.