0
0
Data Analysis Pythondata~20 mins

Text cleaning pipeline in Data Analysis Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Cleaning Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of basic text cleaning with lowercasing and punctuation removal
What is the output of this Python code that cleans a text by making it lowercase and removing punctuation?
Data Analysis Python
import string
text = "Hello, World! Welcome to Data Science."
cleaned = ''.join(ch for ch in text.lower() if ch not in string.punctuation)
print(cleaned)
Ahello world welcome to data science
BHello World Welcome to Data Science
Chello, world! welcome to data science.
DHELLO WORLD WELCOME TO DATA SCIENCE
Attempts:
2 left
💡 Hint
Think about what happens when you convert text to lowercase and remove punctuation characters.
data_output
intermediate
2:00remaining
Result of tokenizing and removing stopwords from text
Given the following code that tokenizes text and removes common stopwords, what is the resulting list?
Data Analysis Python
text = "Data science is fun and exciting"
stopwords = {'is', 'and'}
tokens = text.lower().split()
filtered = [word for word in tokens if word not in stopwords]
print(filtered)
A['data', 'science', 'is', 'fun', 'and', 'exciting']
B['Data', 'science', 'fun', 'exciting']
C['data', 'science', 'fun', 'exciting']
D['data', 'science', 'fun', 'and', 'exciting']
Attempts:
2 left
💡 Hint
Stopwords are removed after converting all words to lowercase.
🔧 Debug
advanced
2:00remaining
Identify the error in this text cleaning function
What error does this code raise when trying to clean text by removing digits and extra spaces?
Data Analysis Python
def clean_text(text):
    import re
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

print(clean_text('Data 123 science 456'))
ANo error, output: 'Data science'
BSyntaxError: invalid syntax
CAttributeError: 'str' object has no attribute 'sub'
DTypeError: 'builtin_function_or_method' object is not callable
Attempts:
2 left
💡 Hint
Check how the strip method is used in the code.
visualization
advanced
2:00remaining
Visualizing word frequency after cleaning text
Which option shows the correct bar chart output for the word frequencies after cleaning and tokenizing the text?
Data Analysis Python
import matplotlib.pyplot as plt
from collections import Counter
text = "Data science is fun. Data science is exciting."
words = [w.lower().strip('.!') for w in text.split()]
counter = Counter(words)
plt.bar(counter.keys(), counter.values())
plt.show()
ABar chart with words ['data', 'science', 'is', 'fun', 'exciting'] and counts [1, 1, 1, 1, 1]
BBar chart with words ['data', 'science', 'is', 'fun', 'exciting'] and counts [2, 2, 2, 1, 1]
CBar chart with words ['Data', 'science', 'is', 'fun', 'exciting'] and counts [2, 2, 2, 1, 1]
DBar chart with words ['data', 'science', 'fun', 'exciting'] and counts [2, 2, 1, 1]
Attempts:
2 left
💡 Hint
Check how words are lowercased and punctuation is stripped before counting.
🚀 Application
expert
3:00remaining
Predict the output of a complex text cleaning pipeline
Given this pipeline that removes URLs, lowercases, removes stopwords, and lemmatizes, what is the final list of words?
Data Analysis Python
import re
from nltk.stem import WordNetLemmatizer
text = "Visit https://example.com for data science tutorials! Data scientists love data."
text = re.sub(r'https?://\S+', '', text)
text = text.lower()
stopwords = {'for', 'is', 'the', 'and', 'a', 'of'}
words = text.split()
words = [w for w in words if w not in stopwords]
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]
print(words)
A['visit', 'data', 'science', 'tutorial', 'data', 'scientist', 'love', 'data']
B['visit', 'data', 'science', 'tutorials', 'data', 'scientists', 'love', 'data']
C['visit', 'data', 'science', 'tutorials!', 'data', 'scientist', 'love', 'data.']
D['visit', 'data', 'science', 'tutorial', 'data', 'scientists', 'love', 'data']
Attempts:
2 left
💡 Hint
Remember lemmatizer converts plurals to singular and removes punctuation only if stripped before lemmatization.