Challenge - 5 Problems

🎖️

Tokenization Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the output of this word tokenization code?

Given the following Python code using NLTK for word tokenization, what is the output list?

NLP

from nltk.tokenize import word_tokenize
text = "Hello world! Let's test tokenization."
tokens = word_tokenize(text)
print(tokens)

A['Hello', 'world', '!', 'Let's', 'test', 'tokenization', '.']

B['Hello', 'world!', "Let's", 'test', 'tokenization.']

C['Hello', 'world', '!', 'Let', "'s", 'test', 'tokenization', '.']

D['Hello world!', "Let's test tokenization."]

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Which option correctly describes sentence tokenization?

What does sentence tokenization do in Natural Language Processing?

AConverts words into numerical vectors for machine learning.

BSplits text into individual words, separating punctuation.

CRemoves stopwords from the text.

DSplits text into sentences based on punctuation and capitalization.

Attempts:

2 left

❓ Metrics

advanced

1:30remaining

How many tokens are produced by this sentence tokenizer?

Using NLTK's sent_tokenize on the text below, how many sentences are produced? "Dr. Smith loves AI. He works at OpenAI! Do you know him?"

NLP

from nltk.tokenize import sent_tokenize
text = "Dr. Smith loves AI. He works at OpenAI! Do you know him?"
sentences = sent_tokenize(text)
print(len(sentences))

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

What error does this tokenization code raise?

Consider this code snippet: from nltk.tokenize import word_tokenize text = None tokens = word_tokenize(text) print(tokens) What error will this code raise?

ATypeError: expected string or bytes-like object

BNameError: name 'word_tokenize' is not defined

CAttributeError: 'NoneType' object has no attribute 'split'

DValueError: empty string passed to tokenizer

Attempts:

2 left

❓ Model Choice

expert

2:00remaining

Which tokenizer is best for splitting text into subword units for transformer models?

You want to prepare text input for a transformer-based language model that uses subword tokenization. Which tokenizer type should you choose?

ACharacter tokenizer that splits text into individual characters

BByte-Pair Encoding (BPE) tokenizer that splits words into subword units

CWhitespace tokenizer that splits text only on spaces

DSentence tokenizer that splits text into sentences

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of tokenization in natural language processing?

easy

A. To remove stop words from text

B. To translate text into another language

C. To split text into smaller units like words or sentences

D. To generate new sentences from text

Tokenization (word and sentence) in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand tokenization

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Check correct import and function

Step 2: Verify code correctness

Final Answer:

Quick Check:

Solution

Step 1: Understand sent_tokenize function

Step 2: Apply sent_tokenize to the text

Final Answer:

Quick Check:

Solution

Step 1: Check how word_tokenize is imported

Step 2: Identify correct import

Final Answer:

Quick Check:

Solution

Step 1: Understand the need to preserve sentence boundaries

Step 2: Apply sent_tokenize then word_tokenize

Final Answer:

Quick Check: