When preparing text for RNNs, the key metrics to watch are sequence length consistency and vocabulary coverage. These ensure the model receives clean, uniform input sequences and understands the words it sees. For model evaluation, accuracy or loss during training shows if preprocessing helped the RNN learn well.
Text preprocessing for RNNs in PyTorch - Model Metrics & Evaluation
Example confusion matrix for text classification after preprocessing:
Predicted
Pos Neg
Actual
Pos 85 15
Neg 10 90
TP=85, FP=10, TN=90, FN=15
Total samples = 85+10+90+15 = 200
In text tasks, like spam detection, precision means how many flagged messages are truly spam. High precision avoids marking good emails as spam.
Recall means how many actual spam messages are caught. High recall avoids missing spam.
Preprocessing affects this tradeoff: poor tokenization or missing words can lower recall by hiding spam clues. Overly aggressive cleaning might remove important words, hurting precision.
Good preprocessing leads to:
- High accuracy (e.g., >85%) on validation data
- Balanced precision and recall (both >80%)
- Stable loss decreasing over epochs
Bad preprocessing causes:
- Low accuracy (<60%) or unstable training
- Very low recall or precision (e.g., <50%)
- Overfitting or underfitting signs
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced (e.g., many non-spam emails).
- Data leakage: Using test data during preprocessing (like fitting tokenizer on all data) inflates metrics falsely.
- Overfitting: Very low training loss but high validation loss means preprocessing or model is too tailored to training data.
- Ignoring sequence length: Not padding/truncating sequences properly can cause inconsistent input and poor model performance.
Your RNN text classifier has 98% accuracy but only 12% recall on spam messages. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most spam messages (low recall), which is critical for spam detection. High accuracy is misleading here because most emails are not spam, so the model just predicts non-spam well but fails to catch spam.