PyTorchml~20 mins

Text preprocessing for RNNs in PyTorch - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Text preprocessing for RNNs

Problem:You want to prepare text data so an RNN model can learn from it. Currently, the text is raw and not ready for the model.

Current Metrics:N/A - preprocessing stage before training

Issue:The text data is not converted into numerical sequences, and sequences are not padded to the same length. This causes errors or poor training results.

Your Task

Convert raw text sentences into padded numerical sequences suitable for RNN input.

Use PyTorch and standard Python libraries only.

Do not change the model architecture or training code.

Focus only on preprocessing steps.

Hint 1

Hint 2

Hint 3

Hint 4

Hint 5

Solution

PyTorch

import torch
from torch.nn.utils.rnn import pad_sequence

# Sample raw text data
sentences = [
    "hello how are you",
    "I am fine thank you",
    "how about you",
    "I am doing well"
]

# Step 1: Tokenize sentences into words
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Step 2: Build vocabulary
vocab = {"<PAD>": 0, "<UNK>": 1}  # Padding and unknown tokens
for sentence in tokenized_sentences:
    for word in sentence:
        if word not in vocab:
            vocab[word] = len(vocab)

# Step 3: Convert sentences to sequences of integers
sequences = []
for sentence in tokenized_sentences:
    seq = [vocab.get(word, vocab["<UNK>"]) for word in sentence]
    sequences.append(torch.tensor(seq, dtype=torch.long))

# Step 4: Pad sequences to the same length
padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=vocab["<PAD>"])

# Output the padded sequences tensor
print("Vocabulary:", vocab)
print("Padded sequences tensor shape:", padded_sequences.shape)
print(padded_sequences)

Tokenized raw text into word lists.

Created a vocabulary dictionary mapping words to unique integers.

Converted each sentence into a sequence of integers using the vocabulary.

Padded all sequences to the same length with a padding token.

Converted sequences into PyTorch tensors for model input.

Results Interpretation

Before preprocessing, the model cannot understand raw text strings.
After preprocessing, text is converted into padded integer sequences, ready for RNN input.

Example:

Vocabulary: {'': 0, '': 1, 'hello': 2, 'how': 3, 'are': 4, 'you': 5, 'i': 6, 'am': 7, 'fine': 8, 'thank': 9, 'about': 10, 'doing': 11, 'well': 12}

Padded sequences tensor shape: (4, 5)

Tensor example:
[[ 2 3 4 5 0]
[ 6 7 8 9 5]
[ 3 10 5 0 0]
[ 6 7 11 12 0]]

Text must be converted into numerical sequences of equal length before feeding into RNNs. This allows the model to process batches efficiently and learn from the data.

Bonus Experiment

Try adding a step to convert the padded sequences into embeddings using PyTorch's nn.Embedding layer before feeding them to the RNN.

💡 Hint

Create an nn.Embedding with vocabulary size and embedding dimension, then pass the padded sequences tensor through it to get embedded inputs.

Practice

(1/5)

1. Why do we split text into tokens before feeding it to an RNN?

easy

A. Because RNNs process sequences of numbers, not raw text

B. To reduce the size of the dataset

C. To make the text look nicer

D. Because tokens are easier to print

Text preprocessing for RNNs in PyTorch - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand RNN input requirements

Step 2: Role of tokenization

Final Answer:

Quick Check:

Solution

Step 1: Identify PyTorch padding utilities

Step 2: Check other options

Final Answer:

Quick Check:

Solution

Step 1: Understand input sequences

Step 2: pad_sequence with batch_first=true

Final Answer:

Quick Check:

Solution

Step 1: Check pad_sequence default behavior

Step 2: Effect on output shape

Final Answer:

Quick Check:

Solution

Step 1: Tokenize text and convert tokens to integers

Step 2: Pad sequences and prepare batch tensor

Final Answer:

Quick Check: