ML Pythonml~8 mins

Text preprocessing (tokenization, stemming, lemmatization) in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Text preprocessing (tokenization, stemming, lemmatization)

Which metric matters for this concept and WHY

For text preprocessing steps like tokenization, stemming, and lemmatization, the key metrics focus on accuracy of text normalization. This means how well the processed text matches the intended meaning or structure for the model. Metrics include tokenization accuracy (correctly splitting text), and normalization accuracy (correctly reducing words to base forms). These matter because poor preprocessing leads to noisy or misleading input, hurting model performance.

Confusion matrix or equivalent visualization

While confusion matrices are common for classification, here we use a token-level accuracy table to show preprocessing quality.

Tokenization Results:
+----------------+----------------+----------------+
| Original Token | Correct Split? | Count          |
+----------------+----------------+----------------+
| "running"     | Yes            | 100            |
| "runing"      | No             | 5              |
+----------------+----------------+----------------+

Stemming/Lemmatization Results:
+----------------+----------------+----------------+
| Original Word  | Correct Base?  | Count          |
+----------------+----------------+----------------+
| "running"     | "run" (Yes)   | 100            |
| "ran"         | "run" (Yes)   | 50             |
| "runs"        | "run" (Yes)   | 80             |
| "runned"      | Incorrect      | 3              |
+----------------+----------------+----------------+

Precision vs Recall tradeoff with concrete examples

In text preprocessing, the tradeoff is between over-normalization and under-normalization.

Over-normalization (High Precision, Low Recall): Only very sure words are stemmed or lemmatized, so fewer errors but many words remain unprocessed. This keeps precision high (correct base forms) but recall low (many words not normalized).
Under-normalization (Low Precision, High Recall): Aggressive stemming changes many words, catching most variants but also making mistakes (wrong base forms). This increases recall but lowers precision.

Example: For a search engine, high recall is better to find all relevant documents, even if some words are wrongly normalized. For a grammar checker, high precision is better to avoid false corrections.

What "good" vs "bad" metric values look like for this use case

Good preprocessing:

Tokenization accuracy > 98% (almost all words split correctly)
Stemming/Lemmatization accuracy > 90% (most words reduced correctly)
Low error rate in base form assignment

Bad preprocessing:

Tokenization accuracy < 90% (many words incorrectly split or joined)
Stemming/Lemmatization accuracy < 70% (many words wrongly changed)
High noise in processed text causing model confusion

Metrics pitfalls

Ignoring context: Stemming can cut words wrongly without context, hurting meaning.
Overfitting preprocessing: Tailoring stemming rules too much to training data can fail on new text.
Data leakage: Using test data to tune preprocessing can inflate accuracy falsely.
Accuracy paradox: High tokenization accuracy may hide poor lemmatization quality.
Ignoring downstream impact: Good preprocessing metrics alone don't guarantee better model results.

Self-check question

Your text preprocessing pipeline has 99% tokenization accuracy but only 60% lemmatization accuracy. Is this good enough for a sentiment analysis model? Why or why not?

Answer: Not necessarily. While tokenization is excellent, low lemmatization accuracy means many words are not properly normalized. This can confuse the model by treating similar words as different, hurting sentiment detection. Improving lemmatization or using better normalization is important.

Key Result

High tokenization and lemmatization accuracy are key to clean text input, directly impacting model quality.