Challenge - 5 Problems

🎖️

Tokenization Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of simple whitespace tokenization

What is the output of this Python code that splits a sentence into tokens using whitespace?

Data Analysis Python

sentence = "Data science is fun and exciting"
tokens = sentence.split()
print(tokens)

A['Data', 'science', 'is', 'fun', 'and', 'exciting']

B['Data,', 'science,', 'is,', 'fun,', 'and,', 'exciting']

C['Data science is fun and exciting']

D['Data', 'science', 'is', 'fun', 'and']

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Tokenizing with punctuation removal

Given the code below, what is the resulting list of tokens after removing punctuation and splitting?

Data Analysis Python

import string
text = "Hello, world! Let's learn tokenization."
tokens = text.translate(str.maketrans('', '', string.punctuation)).split()
print(tokens)

A['Hello', 'world', 'Lets', 'learn', 'tokenization']

B['Hello,', 'world!', "Let's", 'learn', 'tokenization.']

C['Hello', 'world!', "Let's", 'learn', 'tokenization']

D['Hello', 'world', "Let's", 'learn', 'tokenization']

Attempts:

2 left

❓ visualization

advanced

3:00remaining

Visualizing token frequency distribution

Which option shows the correct bar chart of token frequencies from the given text?

Data Analysis Python

from collections import Counter
import matplotlib.pyplot as plt
text = "apple banana apple orange banana apple"
tokens = text.split()
counter = Counter(tokens)
plt.bar(counter.keys(), counter.values())
plt.show()

ABar chart with all tokens equal height

BBar chart with 'banana' highest, then 'apple', then 'orange'

CBar chart with 'orange' highest, then 'banana', then 'apple'

DBar chart with 'apple' highest, then 'banana', then 'orange'

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Understanding tokenization challenges

Which option best describes a common challenge in tokenization for natural language processing?

ARemoving all stopwords before tokenization

BHandling contractions and punctuation correctly to avoid losing meaning

CConverting all tokens to uppercase before analysis

DSorting tokens alphabetically after splitting

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Identify the error in tokenization code

What error does this code produce when trying to tokenize text by splitting on spaces?

Data Analysis Python

text = None
tokens = text.split()
print(tokens)

ASyntaxError: invalid syntax

BTypeError: split() missing 1 required positional argument

CAttributeError: 'NoneType' object has no attribute 'split'

DValueError: empty string cannot be split

Attempts:

2 left