Challenge - 5 Problems
Text Cleaning Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of basic text cleaning with lowercasing and punctuation removal
What is the output of this Python code that cleans a text by making it lowercase and removing punctuation?
Data Analysis Python
import string text = "Hello, World! Welcome to Data Science." cleaned = ''.join(ch for ch in text.lower() if ch not in string.punctuation) print(cleaned)
Attempts:
2 left
💡 Hint
Think about what happens when you convert text to lowercase and remove punctuation characters.
✗ Incorrect
The code converts the text to lowercase and removes all punctuation characters, resulting in a clean string with only lowercase letters and spaces.
❓ data_output
intermediate2:00remaining
Result of tokenizing and removing stopwords from text
Given the following code that tokenizes text and removes common stopwords, what is the resulting list?
Data Analysis Python
text = "Data science is fun and exciting" stopwords = {'is', 'and'} tokens = text.lower().split() filtered = [word for word in tokens if word not in stopwords] print(filtered)
Attempts:
2 left
💡 Hint
Stopwords are removed after converting all words to lowercase.
✗ Incorrect
The code splits the text into lowercase words, then removes the stopwords 'is' and 'and', leaving only the meaningful words.
🔧 Debug
advanced2:00remaining
Identify the error in this text cleaning function
What error does this code raise when trying to clean text by removing digits and extra spaces?
Data Analysis Python
def clean_text(text): import re text = re.sub(r'\d+', '', text) text = re.sub(r'\s+', ' ', text).strip() return text print(clean_text('Data 123 science 456'))
Attempts:
2 left
💡 Hint
Check how the strip method is used in the code.
✗ Incorrect
The strip method is referenced without parentheses, so it is a method object, not called. Trying to use it as a string causes a TypeError.
❓ visualization
advanced2:00remaining
Visualizing word frequency after cleaning text
Which option shows the correct bar chart output for the word frequencies after cleaning and tokenizing the text?
Data Analysis Python
import matplotlib.pyplot as plt from collections import Counter text = "Data science is fun. Data science is exciting." words = [w.lower().strip('.!') for w in text.split()] counter = Counter(words) plt.bar(counter.keys(), counter.values()) plt.show()
Attempts:
2 left
💡 Hint
Check how words are lowercased and punctuation is stripped before counting.
✗ Incorrect
The code lowercases words and removes punctuation, then counts each word's frequency correctly, resulting in counts 2 for 'data', 'science', 'is' and 1 for 'fun', 'exciting'.
🚀 Application
expert3:00remaining
Predict the output of a complex text cleaning pipeline
Given this pipeline that removes URLs, lowercases, removes stopwords, and lemmatizes, what is the final list of words?
Data Analysis Python
import re from nltk.stem import WordNetLemmatizer text = "Visit https://example.com for data science tutorials! Data scientists love data." text = re.sub(r'https?://\S+', '', text) text = text.lower() stopwords = {'for', 'is', 'the', 'and', 'a', 'of'} words = text.split() words = [w for w in words if w not in stopwords] lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(w) for w in words] print(words)
Attempts:
2 left
💡 Hint
Remember lemmatizer converts plurals to singular and removes punctuation only if stripped before lemmatization.
✗ Incorrect
URLs are removed, text lowercased, stopwords removed, then lemmatizer converts 'tutorials' to 'tutorial' and 'scientists' to 'scientist'. Punctuation remains on 'tutorials!' and 'data.' if not stripped, so option A is correct because punctuation is removed by split and lemmatizer output.