0
0
PyTorchml~5 mins

Text preprocessing for RNNs in PyTorch

Choose your learning style9 modes available
Introduction
Text preprocessing helps turn words into numbers so RNNs can understand and learn from text.
When you want to teach a computer to understand sentences.
When building chatbots that reply to messages.
When analyzing customer reviews to find feelings.
When translating languages using a neural network.
When predicting the next word in a sentence.
Syntax
PyTorch
import torch
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

def tokenize(text):
    return text.lower().split()

# Build vocabulary from tokenized texts
vocab = build_vocab_from_iterator(tokenize(text) for text in dataset, specials=["<pad>"])

# Convert text to list of token ids
def text_to_ids(text):
    return [vocab[token] for token in tokenize(text)]

# Pad sequences to same length
padded_batch = pad_sequence([torch.tensor(text_to_ids(t)) for t in batch], batch_first=True)
Tokenization splits text into words or pieces.
Padding makes all sequences the same length for batch processing.
Examples
Splits the sentence into words: ['hello', 'world']
PyTorch
text = "Hello world"
tokens = tokenize(text)
print(tokens)
Converts words to numbers using the vocabulary.
PyTorch
ids = text_to_ids("Hello world")
print(ids)
Pads shorter sequences with zeros to match the longest.
PyTorch
batch = ["Hello world", "Hi"]
padded = pad_sequence([torch.tensor(text_to_ids(t)) for t in batch], batch_first=True)
print(padded)
Sample Model
This code shows how to turn sentences into numbers and pad them so RNNs can process batches.
PyTorch
import torch
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

def tokenize(text):
    return text.lower().split()

# Sample dataset
dataset = ["I love machine learning", "Deep learning is fun", "RNNs handle sequences"]

# Build vocabulary
vocab = build_vocab_from_iterator(tokenize(text) for text in dataset, specials=["<pad>"])

# Convert text to token ids

def text_to_ids(text):
    return [vocab[token] for token in tokenize(text)]

# Prepare batch
batch = ["I love machine learning", "RNNs handle sequences"]
token_ids = [torch.tensor(text_to_ids(text)) for text in batch]

# Pad sequences
padded_batch = pad_sequence(token_ids, batch_first=True, padding_value=vocab['<pad>'])

print("Vocabulary tokens:", vocab.get_itos())
print("Token IDs for batch:", token_ids)
print("Padded batch tensor:", padded_batch)
OutputSuccess
Important Notes
Always lowercase text to reduce vocabulary size.
Padding uses zeros by default; make sure your model ignores padding tokens.
Vocabulary maps words to unique numbers for the model.
Summary
Text must be split into tokens before feeding to RNNs.
Tokens are converted to numbers using a vocabulary.
Sequences are padded to the same length for batch processing.