0
0
PyTorchml~20 mins

Text preprocessing for RNNs in PyTorch - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Text preprocessing for RNNs
Problem:You want to prepare text data so an RNN model can learn from it. Currently, the text is raw and not ready for the model.
Current Metrics:N/A - preprocessing stage before training
Issue:The text data is not converted into numerical sequences, and sequences are not padded to the same length. This causes errors or poor training results.
Your Task
Convert raw text sentences into padded numerical sequences suitable for RNN input.
Use PyTorch and standard Python libraries only.
Do not change the model architecture or training code.
Focus only on preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
PyTorch
import torch
from torch.nn.utils.rnn import pad_sequence

# Sample raw text data
sentences = [
    "hello how are you",
    "I am fine thank you",
    "how about you",
    "I am doing well"
]

# Step 1: Tokenize sentences into words
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Step 2: Build vocabulary
vocab = {"<PAD>": 0, "<UNK>": 1}  # Padding and unknown tokens
for sentence in tokenized_sentences:
    for word in sentence:
        if word not in vocab:
            vocab[word] = len(vocab)

# Step 3: Convert sentences to sequences of integers
sequences = []
for sentence in tokenized_sentences:
    seq = [vocab.get(word, vocab["<UNK>"]) for word in sentence]
    sequences.append(torch.tensor(seq, dtype=torch.long))

# Step 4: Pad sequences to the same length
padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=vocab["<PAD>"])

# Output the padded sequences tensor
print("Vocabulary:", vocab)
print("Padded sequences tensor shape:", padded_sequences.shape)
print(padded_sequences)
Tokenized raw text into word lists.
Created a vocabulary dictionary mapping words to unique integers.
Converted each sentence into a sequence of integers using the vocabulary.
Padded all sequences to the same length with a padding token.
Converted sequences into PyTorch tensors for model input.
Results Interpretation

Before preprocessing, the model cannot understand raw text strings.
After preprocessing, text is converted into padded integer sequences, ready for RNN input.

Example:

Vocabulary: {'': 0, '': 1, 'hello': 2, 'how': 3, 'are': 4, 'you': 5, 'i': 6, 'am': 7, 'fine': 8, 'thank': 9, 'about': 10, 'doing': 11, 'well': 12}

Padded sequences tensor shape: (4, 5)

Tensor example:
[[ 2 3 4 5 0]
[ 6 7 8 9 5]
[ 3 10 5 0 0]
[ 6 7 11 12 0]]

Text must be converted into numerical sequences of equal length before feeding into RNNs. This allows the model to process batches efficiently and learn from the data.
Bonus Experiment
Try adding a step to convert the padded sequences into embeddings using PyTorch's nn.Embedding layer before feeding them to the RNN.
💡 Hint
Create an nn.Embedding with vocabulary size and embedding dimension, then pass the padded sequences tensor through it to get embedded inputs.