beginner

What is tokenization in text preprocessing?

Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or sentences, to make it easier for a computer to understand and analyze.

Click to reveal answer

beginner

Explain stemming in simple terms.

Stemming cuts words down to their root form by removing endings. For example, 'running' becomes 'run'. It helps group similar words together but may not always produce real words.

Click to reveal answer

intermediate

What does lemmatization do differently from stemming?

Lemmatization reduces words to their base or dictionary form called lemma, considering the word's meaning and context. For example, 'better' becomes 'good'. It produces real words unlike stemming.

Click to reveal answer

beginner

Why is text preprocessing important before training a machine learning model?

Text preprocessing cleans and simplifies text data, making it easier for models to learn patterns. It reduces noise, handles variations of words, and improves model accuracy.

Click to reveal answer

intermediate

Give an example of tokenization, stemming, and lemmatization for the sentence: 'The cats are running happily'.

Tokenization: ['The', 'cats', 'are', 'running', 'happily']<br>Stemming: ['The', 'cat', 'are', 'run', 'happili']<br>Lemmatization: ['The', 'cat', 'be', 'run', 'happily']

Click to reveal answer

What is the main goal of tokenization?

ACalculate word frequency

BSplit text into smaller pieces called tokens

CConvert words to their base form

DRemove stop words from text

Which technique removes word endings without considering meaning?

AStemming

BLemmatization

CTokenization

DParsing

Which method uses word meaning and context to find the base form?

ANormalization

BStemming

CTokenization

DLemmatization

Why do we preprocess text before machine learning?

ATo make text longer

BTo translate text to another language

CTo clean and simplify text for better learning

DTo encrypt the text

Which of these is NOT a typical step in text preprocessing?

AImage resizing

BStemming

CLemmatization

DTokenization

Describe the differences between tokenization, stemming, and lemmatization in text preprocessing.

Explain why text preprocessing is important before feeding text data into a machine learning model.