Challenge - 5 Problems

🎖️

Text Cleaning Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of basic text cleaning with lowercasing and punctuation removal

What is the output of this Python code that cleans a text by making it lowercase and removing punctuation?

Data Analysis Python

import string
text = "Hello, World! Welcome to Data Science."
cleaned = ''.join(ch for ch in text.lower() if ch not in string.punctuation)
print(cleaned)

Ahello world welcome to data science

BHello World Welcome to Data Science

Chello, world! welcome to data science.

DHELLO WORLD WELCOME TO DATA SCIENCE

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of tokenizing and removing stopwords from text

Given the following code that tokenizes text and removes common stopwords, what is the resulting list?

Data Analysis Python

text = "Data science is fun and exciting"
stopwords = {'is', 'and'}
tokens = text.lower().split()
filtered = [word for word in tokens if word not in stopwords]
print(filtered)

A['data', 'science', 'is', 'fun', 'and', 'exciting']

B['Data', 'science', 'fun', 'exciting']

C['data', 'science', 'fun', 'exciting']

D['data', 'science', 'fun', 'and', 'exciting']

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this text cleaning function

What error does this code raise when trying to clean text by removing digits and extra spaces?

Data Analysis Python

def clean_text(text):
    import re
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

print(clean_text('Data 123 science 456'))

ANo error, output: 'Data science'

BSyntaxError: invalid syntax

CAttributeError: 'str' object has no attribute 'sub'

DTypeError: 'builtin_function_or_method' object is not callable

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Visualizing word frequency after cleaning text

Which option shows the correct bar chart output for the word frequencies after cleaning and tokenizing the text?

Data Analysis Python

import matplotlib.pyplot as plt
from collections import Counter
text = "Data science is fun. Data science is exciting."
words = [w.lower().strip('.!') for w in text.split()]
counter = Counter(words)
plt.bar(counter.keys(), counter.values())
plt.show()

ABar chart with words ['data', 'science', 'is', 'fun', 'exciting'] and counts [1, 1, 1, 1, 1]

BBar chart with words ['data', 'science', 'is', 'fun', 'exciting'] and counts [2, 2, 2, 1, 1]

CBar chart with words ['Data', 'science', 'is', 'fun', 'exciting'] and counts [2, 2, 2, 1, 1]

DBar chart with words ['data', 'science', 'fun', 'exciting'] and counts [2, 2, 1, 1]

Attempts:

2 left

🚀 Application

expert

3:00remaining

Predict the output of a complex text cleaning pipeline

Given this pipeline that removes URLs, lowercases, removes stopwords, and lemmatizes, what is the final list of words?

Data Analysis Python

import re
from nltk.stem import WordNetLemmatizer
text = "Visit https://example.com for data science tutorials! Data scientists love data."
text = re.sub(r'https?://\S+', '', text)
text = text.lower()
stopwords = {'for', 'is', 'the', 'and', 'a', 'of'}
words = text.split()
words = [w for w in words if w not in stopwords]
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]
print(words)

A['visit', 'data', 'science', 'tutorial', 'data', 'scientist', 'love', 'data']

B['visit', 'data', 'science', 'tutorials', 'data', 'scientists', 'love', 'data']

C['visit', 'data', 'science', 'tutorials!', 'data', 'scientist', 'love', 'data.']

D['visit', 'data', 'science', 'tutorial', 'data', 'scientists', 'love', 'data']

Attempts:

2 left