Bird
Raised Fist0
PyTorchml~20 mins

Text preprocessing for RNNs in PyTorch - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Text preprocessing for RNNs
Problem:You want to prepare text data so an RNN model can learn from it. Currently, the text is raw and not ready for the model.
Current Metrics:N/A - preprocessing stage before training
Issue:The text data is not converted into numerical sequences, and sequences are not padded to the same length. This causes errors or poor training results.
Your Task
Convert raw text sentences into padded numerical sequences suitable for RNN input.
Use PyTorch and standard Python libraries only.
Do not change the model architecture or training code.
Focus only on preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
PyTorch
import torch
from torch.nn.utils.rnn import pad_sequence

# Sample raw text data
sentences = [
    "hello how are you",
    "I am fine thank you",
    "how about you",
    "I am doing well"
]

# Step 1: Tokenize sentences into words
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Step 2: Build vocabulary
vocab = {"<PAD>": 0, "<UNK>": 1}  # Padding and unknown tokens
for sentence in tokenized_sentences:
    for word in sentence:
        if word not in vocab:
            vocab[word] = len(vocab)

# Step 3: Convert sentences to sequences of integers
sequences = []
for sentence in tokenized_sentences:
    seq = [vocab.get(word, vocab["<UNK>"]) for word in sentence]
    sequences.append(torch.tensor(seq, dtype=torch.long))

# Step 4: Pad sequences to the same length
padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=vocab["<PAD>"])

# Output the padded sequences tensor
print("Vocabulary:", vocab)
print("Padded sequences tensor shape:", padded_sequences.shape)
print(padded_sequences)
Tokenized raw text into word lists.
Created a vocabulary dictionary mapping words to unique integers.
Converted each sentence into a sequence of integers using the vocabulary.
Padded all sequences to the same length with a padding token.
Converted sequences into PyTorch tensors for model input.
Results Interpretation

Before preprocessing, the model cannot understand raw text strings.
After preprocessing, text is converted into padded integer sequences, ready for RNN input.

Example:

Vocabulary: {'': 0, '': 1, 'hello': 2, 'how': 3, 'are': 4, 'you': 5, 'i': 6, 'am': 7, 'fine': 8, 'thank': 9, 'about': 10, 'doing': 11, 'well': 12}

Padded sequences tensor shape: (4, 5)

Tensor example:
[[ 2 3 4 5 0]
[ 6 7 8 9 5]
[ 3 10 5 0 0]
[ 6 7 11 12 0]]

Text must be converted into numerical sequences of equal length before feeding into RNNs. This allows the model to process batches efficiently and learn from the data.
Bonus Experiment
Try adding a step to convert the padded sequences into embeddings using PyTorch's nn.Embedding layer before feeding them to the RNN.
💡 Hint
Create an nn.Embedding with vocabulary size and embedding dimension, then pass the padded sequences tensor through it to get embedded inputs.

Practice

(1/5)
1. Why do we split text into tokens before feeding it to an RNN?
easy
A. Because RNNs process sequences of numbers, not raw text
B. To reduce the size of the dataset
C. To make the text look nicer
D. Because tokens are easier to print

Solution

  1. Step 1: Understand RNN input requirements

    RNNs work with sequences of numbers, not raw text strings.
  2. Step 2: Role of tokenization

    Splitting text into tokens converts sentences into smaller units that can be mapped to numbers.
  3. Final Answer:

    Because RNNs process sequences of numbers, not raw text -> Option A
  4. Quick Check:

    Tokenization = Convert text to numbers [OK]
Hint: RNNs need numbers, so split text into tokens first [OK]
Common Mistakes:
  • Thinking tokens are for making text prettier
  • Believing tokenization reduces dataset size
  • Confusing tokens with characters
2. Which PyTorch function is commonly used to pad sequences to the same length for batch processing?
easy
A. torch.nn.utils.rnn.pad_sequence
B. torch.tensor.pad
C. torch.pad_sequences
D. torch.nn.pad

Solution

  1. Step 1: Identify PyTorch padding utilities

    PyTorch provides pad_sequence in torch.nn.utils.rnn to pad variable-length sequences.
  2. Step 2: Check other options

    Functions like torch.tensor.pad or torch.nn.pad do not exist; torch.pad_sequences is not a PyTorch function.
  3. Final Answer:

    torch.nn.utils.rnn.pad_sequence -> Option A
  4. Quick Check:

    Use pad_sequence to pad RNN inputs [OK]
Hint: Remember: pad_sequence is in torch.nn.utils.rnn [OK]
Common Mistakes:
  • Using non-existent torch.pad_sequences
  • Confusing tensor.pad with pad_sequence
  • Trying to pad manually without this function
3. Given the following code, what is the shape of the padded batch tensor?
import torch
from torch.nn.utils.rnn import pad_sequence

seq1 = torch.tensor([1, 2, 3])
seq2 = torch.tensor([4, 5])
seq3 = torch.tensor([6])
batch = pad_sequence([seq1, seq2, seq3], batch_first=True, padding_value=0)
print(batch.shape)
medium
A. (1, 3)
B. (3, 1)
C. (3, 3)
D. (3, 6)

Solution

  1. Step 1: Understand input sequences

    Sequences have lengths 3, 2, and 1 respectively.
  2. Step 2: pad_sequence with batch_first=true

    All sequences are padded to length 3 (max length), batch dimension is first, so shape is (3 sequences, 3 elements each).
  3. Final Answer:

    (3, 3) -> Option C
  4. Quick Check:

    Batch size = 3, max seq length = 3 [OK]
Hint: Batch shape = (number sequences, max sequence length) [OK]
Common Mistakes:
  • Confusing batch_first=true with false
  • Assuming padding adds length beyond max sequence
  • Mixing up batch and sequence dimensions
4. What is wrong with this code snippet for preparing text sequences for an RNN?
import torch
from torch.nn.utils.rnn import pad_sequence

sentences = [[1, 2, 3, 4], [5, 6], [7]]
tensors = [torch.tensor(s) for s in sentences]
padded = pad_sequence(tensors)
print(padded.shape)
medium
A. torch.tensor cannot convert lists to tensors
B. pad_sequence is missing batch_first=true, so shape is unexpected
C. pad_sequence requires padding_value argument
D. The input lists must be numpy arrays, not lists

Solution

  1. Step 1: Check pad_sequence default behavior

    By default, pad_sequence returns tensor with shape (max_seq_len, batch_size), not batch first.
  2. Step 2: Effect on output shape

    Without batch_first=true, the printed shape will be (4, 3) instead of expected batch-first (3, 4) shape.
  3. Final Answer:

    pad_sequence is missing batch_first=true, so shape is unexpected -> Option B
  4. Quick Check:

    Use batch_first=true for (batch, seq_len) shape [OK]
Hint: Always add batch_first=true for batch as first dimension [OK]
Common Mistakes:
  • Assuming pad_sequence pads automatically without batch_first
  • Thinking torch.tensor can't convert lists
  • Believing padding_value is mandatory
5. You have a batch of sentences tokenized as integer lists of different lengths. You want to feed them into an RNN in PyTorch. Which sequence of steps is correct for preprocessing?
hard
A. Tokenize text -> Pad sequences -> Convert tokens to integers -> Feed to RNN
B. Pad raw text strings -> Tokenize padded strings -> Convert tokens to integers -> Feed to RNN
C. Convert raw text to tensor -> Tokenize tensor -> Pad sequences -> Feed to RNN
D. Tokenize text -> Convert tokens to integers -> Pad sequences with pad_sequence(batch_first=true) -> Convert to tensor batch

Solution

  1. Step 1: Tokenize text and convert tokens to integers

    First, split text into tokens, then map tokens to integers using a vocabulary.
  2. Step 2: Pad sequences and prepare batch tensor

    Pad integer sequences to equal length using pad_sequence with batch_first=true, then feed the tensor batch to the RNN.
  3. Final Answer:

    Tokenize text -> Convert tokens to integers -> Pad sequences with pad_sequence(batch_first=true) -> Convert to tensor batch -> Option D
  4. Quick Check:

    Tokenize -> Integer map -> Pad -> Batch tensor [OK]
Hint: Tokenize first, then integer map, then pad sequences [OK]
Common Mistakes:
  • Padding raw text instead of token integers
  • Converting raw text directly to tensor
  • Padding before converting tokens to integers