How to build text classifier pytorch

PytorchHow-ToBeginner · 4 min read

How to Build a Text Classifier with PyTorch: Simple Guide

To build a text classifier in PyTorch, you first prepare your text data as tensors, then define a simple neural network model with embedding and linear layers, and finally train it using a loss function like CrossEntropyLoss and an optimizer such as Adam. After training, use the model to predict classes for new text inputs.

📐

Syntax

Here is the basic syntax pattern to build a text classifier in PyTorch:

Dataset and DataLoader: Prepare and load text data as tensors.
Model: Define a neural network with an embedding layer and linear layers.
Loss function: Use torch.nn.CrossEntropyLoss() for classification.
Optimizer: Use torch.optim.Adam or similar to update model weights.
Training loop: Iterate over data, compute loss, backpropagate, and update weights.
Evaluation: Use the trained model to predict and measure accuracy.

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Define a simple text classification model
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)  # Shape: (batch_size, seq_len, embed_dim)
        pooled = embedded.mean(dim=1)  # Average over sequence length
        out = self.fc(pooled)  # Shape: (batch_size, num_classes)
        return out

# Instantiate the model
model = TextClassifier(vocab_size=1000, embed_dim=50, num_classes=2)  # Example parameters

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop (simplified)
num_epochs = 5  # Example number of epochs
for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

💻

Example

This example shows a complete runnable PyTorch script that trains a simple text classifier on a tiny dataset of sentences labeled as positive or negative sentiment.

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Sample data: sentences and labels (0=negative, 1=positive)
data = [
    ("I love this movie", 1),
    ("This film is terrible", 0),
    ("Amazing story and great cast", 1),
    ("I did not like the plot", 0)
]

# Simple vocabulary and tokenizer
vocab = {"<pad>":0, "i":1, "love":2, "this":3, "movie":4, "film":5, "is":6, "terrible":7, "amazing":8, "story":9, "and":10, "great":11, "cast":12, "did":13, "not":14, "like":15, "the":16, "plot":17}
vocab_size = len(vocab)

def tokenize(text):
    return [vocab.get(word.lower(), 0) for word in text.split()]

# Dataset class
class TextDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text, label = self.data[idx]
        tokens = tokenize(text)
        return torch.tensor(tokens), torch.tensor(label)

# Collate function to pad sequences
def collate_batch(batch):
    texts, labels = zip(*batch)
    lengths = [len(t) for t in texts]
    max_len = max(lengths)
    padded_texts = [torch.cat([t, torch.zeros(max_len - len(t), dtype=torch.long)]) for t in texts]
    return torch.stack(padded_texts), torch.tensor(labels)

# Create DataLoader
dataset = TextDataset(data)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_batch)

# Model
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        pooled = embedded.mean(dim=1)  # average pooling
        return self.fc(pooled)

model = TextClassifier(vocab_size, embed_dim=10, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(10):
    total_loss = 0
    for inputs, labels in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")

# Test prediction
test_sentence = "I love the story"
tokens = torch.tensor(tokenize(test_sentence)).unsqueeze(0)  # batch size 1
with torch.no_grad():
    output = model(tokens)
    predicted = torch.argmax(output, dim=1).item()
print(f"Prediction for '{test_sentence}':", "Positive" if predicted == 1 else "Negative")

Output

Epoch 1, Loss: 0.6931 Epoch 2, Loss: 0.6863 Epoch 3, Loss: 0.6784 Epoch 4, Loss: 0.6694 Epoch 5, Loss: 0.6590 Epoch 6, Loss: 0.6468 Epoch 7, Loss: 0.6323 Epoch 8, Loss: 0.6151 Epoch 9, Loss: 0.5947 Epoch 10, Loss: 0.5707 Prediction for 'I love the story': Positive

⚠️

Common Pitfalls

Not padding sequences: Text inputs must be padded to the same length in a batch, or the model will error.
Ignoring tokenization: Always convert text to numeric tokens before feeding to the model.
Wrong loss function: Use CrossEntropyLoss for classification, not MSELoss.
Forgetting to zero gradients: Call optimizer.zero_grad() before loss.backward() to avoid gradient accumulation.
Not using torch.no_grad() during evaluation: This saves memory and speeds up inference.

python

## Wrong: No padding, will cause error if batch has different lengths
# inputs = [torch.tensor([1,2,3]), torch.tensor([4,5])]
# outputs = model(torch.stack(inputs))  # Error

## Right: Pad sequences before batching
# Use collate_fn in DataLoader to pad sequences to same length

## Wrong: Using MSELoss for classification
# criterion = nn.MSELoss()  # Not suitable

## Right: Use CrossEntropyLoss
# criterion = nn.CrossEntropyLoss()

📊

Quick Reference

Tips for building text classifiers in PyTorch:

Always preprocess text: tokenize and convert to integer indices.
Use nn.Embedding to convert tokens to vectors.
Average or pool embeddings before classification layer.
Use CrossEntropyLoss for multi-class classification.
Pad sequences in batches for consistent input size.
Use torch.no_grad() during evaluation to save memory.

✅

Key Takeaways

Prepare text data by tokenizing and padding sequences before feeding into the model.

Define a simple model with an embedding layer followed by a linear layer for classification.

Use CrossEntropyLoss and an optimizer like Adam to train the model.

Always zero gradients before backpropagation and use torch.no_grad() during evaluation.

Test your model on new sentences by tokenizing and passing through the trained network.