How to Use TorchText for NLP with PyTorch
Use
torchtext to load and preprocess text data for NLP by creating datasets and iterators with built-in tokenizers and vocabularies. It simplifies text handling by providing tools to tokenize, numericalize, and batch text data for PyTorch models.Syntax
The basic syntax for using torchtext involves these steps:
- Import modules: Import dataset, transforms, and data utilities.
- Load data: Use built-in datasets or create custom datasets.
- Tokenize: Apply tokenization to split text into words or tokens.
- Build vocabulary: Create a vocabulary mapping tokens to integers.
- Numericalize: Convert tokens to integer indices.
- Create dataloaders: Batch and pad sequences for training.
python
import torch from torchtext.datasets import AG_NEWS from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator from torch.utils.data import DataLoader # Step 1: Load dataset train_iter = AG_NEWS(split='train') # Step 2: Tokenizer tokenizer = get_tokenizer('basic_english') # Step 3: Build vocabulary def yield_tokens(data_iter): for _, text in data_iter: yield tokenizer(text) vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>']) vocab.set_default_index(vocab['<unk>']) # Step 4: Numericalize and batch text_pipeline = lambda x: vocab(tokenizer(x)) label_pipeline = lambda x: int(x) - 1 # Example batch function def collate_batch(batch): label_list, text_list, offsets = [], [], [0] for (_label, _text) in batch: label_list.append(label_pipeline(_label)) processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64) text_list.append(processed_text) offsets.append(processed_text.size(0)) label_list = torch.tensor(label_list, dtype=torch.int64) offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) text_list = torch.cat(text_list) return label_list, text_list, offsets
Example
This example shows how to load the AG_NEWS dataset, tokenize text, build a vocabulary, and prepare batches for training a text classification model.
python
import torch from torchtext.datasets import AG_NEWS from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator from torch.utils.data import DataLoader # Load training data train_iter = AG_NEWS(split='train') # Tokenizer tokenizer = get_tokenizer('basic_english') # Build vocabulary def yield_tokens(data_iter): for _, text in data_iter: yield tokenizer(text) vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>']) vocab.set_default_index(vocab['<unk>']) # Pipelines text_pipeline = lambda x: vocab(tokenizer(x)) label_pipeline = lambda x: int(x) - 1 # Collate function for DataLoader def collate_batch(batch): label_list, text_list, offsets = [], [], [0] for (_label, _text) in batch: label_list.append(label_pipeline(_label)) processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64) text_list.append(processed_text) offsets.append(processed_text.size(0)) label_list = torch.tensor(label_list, dtype=torch.int64) offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) text_list = torch.cat(text_list) return label_list, text_list, offsets # Create DataLoader train_iter = AG_NEWS(split='train') # reload iterator train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, collate_fn=collate_batch) # Fetch one batch labels, texts, offsets = next(iter(train_dataloader)) print('Labels:', labels) print('Texts:', texts) print('Offsets:', offsets)
Output
Labels: tensor([1, 1, 2, 0, 2, 1, 0, 1])
Texts: tensor([ 9, 16, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77])
Offsets: tensor([ 0, 7, 14, 21, 28, 35, 42, 49])
Common Pitfalls
1. Reusing iterators: Torchtext dataset iterators can only be used once. Always reload or convert to list before multiple passes.
2. Missing collate_fn in DataLoader: Without a proper collate function, batching variable-length text sequences will fail.
3. Vocabulary unknown tokens: Always set a default index for unknown tokens to avoid errors during numericalization.
4. Tokenizer mismatch: Use the same tokenizer for building vocabulary and processing data to keep consistency.
python
from torchtext.datasets import AG_NEWS from torch.utils.data import DataLoader # Wrong: Using iterator twice train_iter = AG_NEWS(split='train') list1 = list(train_iter) list2 = list(train_iter) # This will be empty print('Length first list:', len(list1)) print('Length second list:', len(list2)) # 0 # Right: Reload iterator train_iter = AG_NEWS(split='train') list2 = list(train_iter) print('Length reloaded list:', len(list2))
Output
Length first list: 120000
Length second list: 0
Length reloaded list: 120000
Quick Reference
Key torchtext components for NLP:
| Component | Purpose | Example Usage |
|---|---|---|
| Datasets | Load standard NLP datasets | AG_NEWS(split='train') |
| Tokenizer | Split text into tokens | get_tokenizer('basic_english') |
| Vocabulary | Map tokens to integers | build_vocab_from_iterator(...) |
| Transforms | Apply tokenization and numericalization | text_pipeline = lambda x: vocab(tokenizer(x)) |
| DataLoader | Batch and pad sequences | DataLoader(dataset, batch_size=8, collate_fn=collate_batch) |
Key Takeaways
Use torchtext datasets and tokenizers to easily load and preprocess text data.
Build a vocabulary from your dataset tokens to convert text into numbers for models.
Always use a collate function in DataLoader to batch variable-length text sequences.
Reload dataset iterators before multiple passes to avoid empty data.
Set a default index for unknown tokens to handle unseen words gracefully.