PytorchHow-ToBeginner · 4 min read

How to Use TorchText for NLP with PyTorch

Use torchtext to load and preprocess text data for NLP by creating datasets and iterators with built-in tokenizers and vocabularies. It simplifies text handling by providing tools to tokenize, numericalize, and batch text data for PyTorch models.

📐

Syntax

The basic syntax for using torchtext involves these steps:

Import modules: Import dataset, transforms, and data utilities.
Load data: Use built-in datasets or create custom datasets.
Tokenize: Apply tokenization to split text into words or tokens.
Build vocabulary: Create a vocabulary mapping tokens to integers.
Numericalize: Convert tokens to integer indices.
Create dataloaders: Batch and pad sequences for training.

python

import torch
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

# Step 1: Load dataset
train_iter = AG_NEWS(split='train')

# Step 2: Tokenizer
tokenizer = get_tokenizer('basic_english')

# Step 3: Build vocabulary
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Step 4: Numericalize and batch
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

# Example batch function
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list, text_list, offsets

💻

Example

This example shows how to load the AG_NEWS dataset, tokenize text, build a vocabulary, and prepare batches for training a text classification model.

python

import torch
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

# Load training data
train_iter = AG_NEWS(split='train')

# Tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

# Pipelines
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

# Collate function for DataLoader
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list, text_list, offsets

# Create DataLoader
train_iter = AG_NEWS(split='train')  # reload iterator
train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, collate_fn=collate_batch)

# Fetch one batch
labels, texts, offsets = next(iter(train_dataloader))
print('Labels:', labels)
print('Texts:', texts)
print('Offsets:', offsets)

Output

Labels: tensor([1, 1, 2, 0, 2, 1, 0, 1]) Texts: tensor([ 9, 16, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]) Offsets: tensor([ 0, 7, 14, 21, 28, 35, 42, 49])

⚠️

Common Pitfalls

1. Reusing iterators: Torchtext dataset iterators can only be used once. Always reload or convert to list before multiple passes.

2. Missing collate_fn in DataLoader: Without a proper collate function, batching variable-length text sequences will fail.

3. Vocabulary unknown tokens: Always set a default index for unknown tokens to avoid errors during numericalization.

4. Tokenizer mismatch: Use the same tokenizer for building vocabulary and processing data to keep consistency.

python

from torchtext.datasets import AG_NEWS
from torch.utils.data import DataLoader

# Wrong: Using iterator twice
train_iter = AG_NEWS(split='train')
list1 = list(train_iter)
list2 = list(train_iter)  # This will be empty

print('Length first list:', len(list1))
print('Length second list:', len(list2))  # 0

# Right: Reload iterator
train_iter = AG_NEWS(split='train')
list2 = list(train_iter)
print('Length reloaded list:', len(list2))

Output

Length first list: 120000 Length second list: 0 Length reloaded list: 120000

📊

Quick Reference

Key torchtext components for NLP:

Component	Purpose	Example Usage
Datasets	Load standard NLP datasets	AG_NEWS(split='train')
Tokenizer	Split text into tokens	get_tokenizer('basic_english')
Vocabulary	Map tokens to integers	build_vocab_from_iterator(...)
Transforms	Apply tokenization and numericalization	text_pipeline = lambda x: vocab(tokenizer(x))
DataLoader	Batch and pad sequences	DataLoader(dataset, batch_size=8, collate_fn=collate_batch)

✅

Key Takeaways

Use torchtext datasets and tokenizers to easily load and preprocess text data.

Build a vocabulary from your dataset tokens to convert text into numbers for models.

Always use a collate function in DataLoader to batch variable-length text sequences.

Reload dataset iterators before multiple passes to avoid empty data.

Set a default index for unknown tokens to handle unseen words gracefully.