0
0
NLPml~8 mins

Why preprocessing cleans raw text in NLP - Why Metrics Matter

Choose your learning style9 modes available
Metrics & Evaluation - Why preprocessing cleans raw text
Which metric matters for this concept and WHY

When working with raw text, preprocessing helps improve model accuracy and F1 score. These metrics show how well the model understands cleaned text versus messy input. Preprocessing removes noise like typos, extra spaces, or irrelevant symbols, making the text clearer. This clarity helps the model make better predictions, so accuracy and F1 score rise.

Confusion matrix or equivalent visualization (ASCII)
      Confusion Matrix Example Before and After Preprocessing:

      Before Preprocessing:
      ---------------------
      | TP=70 | FP=30 |
      | FN=40 | TN=60 |
      ---------------------

      After Preprocessing:
      ---------------------
      | TP=85 | FP=15 |
      | FN=20 | TN=80 |
      ---------------------

      Total samples = 200

      Explanation:
      - TP: True Positives (correct positive predictions)
      - FP: False Positives (wrong positive predictions)
      - FN: False Negatives (missed positive cases)
      - TN: True Negatives (correct negative predictions)

      Preprocessing reduces errors (FP and FN), improving model results.
    
Precision vs Recall tradeoff with concrete examples

Preprocessing affects precision and recall in text models:

  • Precision means how many predicted positive texts are actually positive. Cleaning text reduces false alarms (FP), so precision improves.
  • Recall means how many actual positive texts the model finds. Removing noise helps the model catch more true positives, raising recall.

Example: In spam detection, preprocessing removes weird characters and fixes typos. This helps the model avoid marking good emails as spam (higher precision) and catch more spam emails (higher recall).

What "good" vs "bad" metric values look like for this use case

Good metrics after preprocessing:

  • Accuracy > 85%
  • Precision > 80%
  • Recall > 75%
  • F1 score > 77%

Bad metrics without preprocessing:

  • Accuracy < 70%
  • Precision < 60%
  • Recall < 50%
  • F1 score < 55%

Low scores mean the model struggles with messy text and makes many mistakes.

Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)
  • Accuracy paradox: High accuracy can be misleading if the dataset is unbalanced. For example, if most texts are negative, a model guessing "negative" always looks accurate but fails to find positives.
  • Data leakage: If preprocessing uses information from test data, metrics look better but the model won't work well on new data.
  • Overfitting: Over-cleaning text (like removing too many words) can cause the model to memorize training data and perform poorly on new text, lowering recall and F1 score.
Self-check: Your model has 98% accuracy but 12% recall on spam. Is it good?

No, this model is not good for spam detection. The very low recall (12%) means it misses most spam emails, even though accuracy is high. This happens if most emails are not spam, so the model guesses "not spam" often. Preprocessing can help improve recall by cleaning text so the model better detects spam.

Key Result
Preprocessing raw text improves model accuracy and F1 score by reducing noise, which helps the model make better predictions.