0
0
ML Pythonml~5 mins

Text preprocessing (tokenization, stemming, lemmatization) in ML Python - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is tokenization in text preprocessing?
Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or sentences, to make it easier for a computer to understand and analyze.
Click to reveal answer
beginner
Explain stemming in simple terms.
Stemming cuts words down to their root form by removing endings. For example, 'running' becomes 'run'. It helps group similar words together but may not always produce real words.
Click to reveal answer
intermediate
What does lemmatization do differently from stemming?
Lemmatization reduces words to their base or dictionary form called lemma, considering the word's meaning and context. For example, 'better' becomes 'good'. It produces real words unlike stemming.
Click to reveal answer
beginner
Why is text preprocessing important before training a machine learning model?
Text preprocessing cleans and simplifies text data, making it easier for models to learn patterns. It reduces noise, handles variations of words, and improves model accuracy.
Click to reveal answer
intermediate
Give an example of tokenization, stemming, and lemmatization for the sentence: 'The cats are running happily'.
Tokenization: ['The', 'cats', 'are', 'running', 'happily']<br>Stemming: ['The', 'cat', 'are', 'run', 'happili']<br>Lemmatization: ['The', 'cat', 'be', 'run', 'happily']
Click to reveal answer
What is the main goal of tokenization?
ACalculate word frequency
BSplit text into smaller pieces called tokens
CConvert words to their base form
DRemove stop words from text
Which technique removes word endings without considering meaning?
AStemming
BLemmatization
CTokenization
DParsing
Which method uses word meaning and context to find the base form?
ANormalization
BStemming
CTokenization
DLemmatization
Why do we preprocess text before machine learning?
ATo make text longer
BTo translate text to another language
CTo clean and simplify text for better learning
DTo encrypt the text
Which of these is NOT a typical step in text preprocessing?
AImage resizing
BStemming
CLemmatization
DTokenization
Describe the differences between tokenization, stemming, and lemmatization in text preprocessing.
Think about how each step changes the text and why.
You got /3 concepts.
    Explain why text preprocessing is important before feeding text data into a machine learning model.
    Consider what happens if raw text is used directly.
    You got /3 concepts.