Recall & Review
beginner
What is tokenization in text preprocessing?
Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or sentences, to make it easier for a computer to understand and analyze.
Click to reveal answer
beginner
Explain stemming in simple terms.
Stemming cuts words down to their root form by removing endings. For example, 'running' becomes 'run'. It helps group similar words together but may not always produce real words.
Click to reveal answer
intermediate
What does lemmatization do differently from stemming?
Lemmatization reduces words to their base or dictionary form called lemma, considering the word's meaning and context. For example, 'better' becomes 'good'. It produces real words unlike stemming.
Click to reveal answer
beginner
Why is text preprocessing important before training a machine learning model?
Text preprocessing cleans and simplifies text data, making it easier for models to learn patterns. It reduces noise, handles variations of words, and improves model accuracy.
Click to reveal answer
intermediate
Give an example of tokenization, stemming, and lemmatization for the sentence: 'The cats are running happily'.
Tokenization: ['The', 'cats', 'are', 'running', 'happily']<br>Stemming: ['The', 'cat', 'are', 'run', 'happili']<br>Lemmatization: ['The', 'cat', 'be', 'run', 'happily']
Click to reveal answer
What is the main goal of tokenization?
✗ Incorrect
Tokenization breaks text into tokens like words or sentences to prepare for analysis.
Which technique removes word endings without considering meaning?
✗ Incorrect
Stemming cuts word endings to get root forms but may not produce real words.
Which method uses word meaning and context to find the base form?
✗ Incorrect
Lemmatization considers meaning and context to return dictionary forms of words.
Why do we preprocess text before machine learning?
✗ Incorrect
Preprocessing cleans and simplifies text so models can learn patterns more effectively.
Which of these is NOT a typical step in text preprocessing?
✗ Incorrect
Image resizing is unrelated to text preprocessing.
Describe the differences between tokenization, stemming, and lemmatization in text preprocessing.
Think about how each step changes the text and why.
You got /3 concepts.
Explain why text preprocessing is important before feeding text data into a machine learning model.
Consider what happens if raw text is used directly.
You got /3 concepts.