Recall & Review
beginner
What is tokenization in Natural Language Processing?
Tokenization is the process of breaking down text into smaller pieces called tokens, which can be words, sentences, or subwords. It helps computers understand and analyze text.
Click to reveal answer
beginner
What is the difference between word tokenization and sentence tokenization?
Word tokenization splits text into individual words, while sentence tokenization splits text into sentences. Both help organize text for easier processing.
Click to reveal answer
intermediate
Why is tokenization important before training an NLP model?
Tokenization converts raw text into manageable pieces so models can learn patterns. Without tokenization, models can't understand the structure of language.
Click to reveal answer
beginner
Example: Tokenize the sentence 'Hello world! How are you?' into words.
The word tokens are: ['Hello', 'world', '!', 'How', 'are', 'you', '?']
Click to reveal answer
intermediate
What challenges can arise during tokenization?
Challenges include handling punctuation, contractions (like "don't"), abbreviations, and languages without spaces. Good tokenization handles these well.
Click to reveal answer
What does sentence tokenization do?
✗ Incorrect
Sentence tokenization breaks text into sentences, helping analyze text at the sentence level.
Which of these is a word token from the sentence 'I can't go'?
✗ Incorrect
The word token 'can't' keeps the contraction intact in many tokenization methods.
Why do we tokenize text before feeding it to an NLP model?
✗ Incorrect
Tokenization breaks text into tokens so models can process and learn from the text.
Which punctuation is usually treated as a separate token in word tokenization?
✗ Incorrect
Punctuation like commas, periods, and question marks are often separate tokens.
Which is NOT a common challenge in tokenization?
✗ Incorrect
English uses spaces to separate words, so handling spaces is straightforward compared to other challenges.
Explain what tokenization is and why it is important in NLP.
Think about how computers read text and why smaller pieces help.
You got /3 concepts.
Describe the difference between word tokenization and sentence tokenization with examples.
Consider how you would split 'Hello world! How are you?'
You got /3 concepts.