When preparing text for RNNs, the key metrics to watch are sequence length consistency and vocabulary coverage. These ensure the model receives clean, uniform input sequences and understands the words it sees. For model evaluation, accuracy or loss during training shows if preprocessing helped the RNN learn well.
Text preprocessing for RNNs in PyTorch - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Example confusion matrix for text classification after preprocessing:
Predicted
Pos Neg
Actual
Pos 85 15
Neg 10 90
TP=85, FP=10, TN=90, FN=15
Total samples = 85+10+90+15 = 200
In text tasks, like spam detection, precision means how many flagged messages are truly spam. High precision avoids marking good emails as spam.
Recall means how many actual spam messages are caught. High recall avoids missing spam.
Preprocessing affects this tradeoff: poor tokenization or missing words can lower recall by hiding spam clues. Overly aggressive cleaning might remove important words, hurting precision.
Good preprocessing leads to:
- High accuracy (e.g., >85%) on validation data
- Balanced precision and recall (both >80%)
- Stable loss decreasing over epochs
Bad preprocessing causes:
- Low accuracy (<60%) or unstable training
- Very low recall or precision (e.g., <50%)
- Overfitting or underfitting signs
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced (e.g., many non-spam emails).
- Data leakage: Using test data during preprocessing (like fitting tokenizer on all data) inflates metrics falsely.
- Overfitting: Very low training loss but high validation loss means preprocessing or model is too tailored to training data.
- Ignoring sequence length: Not padding/truncating sequences properly can cause inconsistent input and poor model performance.
Your RNN text classifier has 98% accuracy but only 12% recall on spam messages. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most spam messages (low recall), which is critical for spam detection. High accuracy is misleading here because most emails are not spam, so the model just predicts non-spam well but fails to catch spam.
Practice
Solution
Step 1: Understand RNN input requirements
RNNs work with sequences of numbers, not raw text strings.Step 2: Role of tokenization
Splitting text into tokens converts sentences into smaller units that can be mapped to numbers.Final Answer:
Because RNNs process sequences of numbers, not raw text -> Option AQuick Check:
Tokenization = Convert text to numbers [OK]
- Thinking tokens are for making text prettier
- Believing tokenization reduces dataset size
- Confusing tokens with characters
Solution
Step 1: Identify PyTorch padding utilities
PyTorch provides pad_sequence in torch.nn.utils.rnn to pad variable-length sequences.Step 2: Check other options
Functions like torch.tensor.pad or torch.nn.pad do not exist; torch.pad_sequences is not a PyTorch function.Final Answer:
torch.nn.utils.rnn.pad_sequence -> Option AQuick Check:
Use pad_sequence to pad RNN inputs [OK]
- Using non-existent torch.pad_sequences
- Confusing tensor.pad with pad_sequence
- Trying to pad manually without this function
import torch from torch.nn.utils.rnn import pad_sequence seq1 = torch.tensor([1, 2, 3]) seq2 = torch.tensor([4, 5]) seq3 = torch.tensor([6]) batch = pad_sequence([seq1, seq2, seq3], batch_first=True, padding_value=0) print(batch.shape)
Solution
Step 1: Understand input sequences
Sequences have lengths 3, 2, and 1 respectively.Step 2: pad_sequence with batch_first=true
All sequences are padded to length 3 (max length), batch dimension is first, so shape is (3 sequences, 3 elements each).Final Answer:
(3, 3) -> Option CQuick Check:
Batch size = 3, max seq length = 3 [OK]
- Confusing batch_first=true with false
- Assuming padding adds length beyond max sequence
- Mixing up batch and sequence dimensions
import torch from torch.nn.utils.rnn import pad_sequence sentences = [[1, 2, 3, 4], [5, 6], [7]] tensors = [torch.tensor(s) for s in sentences] padded = pad_sequence(tensors) print(padded.shape)
Solution
Step 1: Check pad_sequence default behavior
By default, pad_sequence returns tensor with shape (max_seq_len, batch_size), not batch first.Step 2: Effect on output shape
Without batch_first=true, the printed shape will be (4, 3) instead of expected batch-first (3, 4) shape.Final Answer:
pad_sequence is missing batch_first=true, so shape is unexpected -> Option BQuick Check:
Use batch_first=true for (batch, seq_len) shape [OK]
- Assuming pad_sequence pads automatically without batch_first
- Thinking torch.tensor can't convert lists
- Believing padding_value is mandatory
Solution
Step 1: Tokenize text and convert tokens to integers
First, split text into tokens, then map tokens to integers using a vocabulary.Step 2: Pad sequences and prepare batch tensor
Pad integer sequences to equal length using pad_sequence with batch_first=true, then feed the tensor batch to the RNN.Final Answer:
Tokenize text -> Convert tokens to integers -> Pad sequences with pad_sequence(batch_first=true) -> Convert to tensor batch -> Option DQuick Check:
Tokenize -> Integer map -> Pad -> Batch tensor [OK]
- Padding raw text instead of token integers
- Converting raw text directly to tensor
- Padding before converting tokens to integers
