How to Build a Text Classifier with PyTorch: Simple Guide
To build a text classifier in
PyTorch, you first prepare your text data as tensors, then define a simple neural network model with embedding and linear layers, and finally train it using a loss function like CrossEntropyLoss and an optimizer such as Adam. After training, use the model to predict classes for new text inputs.Syntax
Here is the basic syntax pattern to build a text classifier in PyTorch:
Dataset and DataLoader: Prepare and load text data as tensors.Model: Define a neural network with an embedding layer and linear layers.Loss function: Usetorch.nn.CrossEntropyLoss()for classification.Optimizer: Usetorch.optim.Adamor similar to update model weights.Training loop: Iterate over data, compute loss, backpropagate, and update weights.Evaluation: Use the trained model to predict and measure accuracy.
python
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, Dataset # Define a simple text classification model class TextClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.fc = nn.Linear(embed_dim, num_classes) def forward(self, x): embedded = self.embedding(x) # Shape: (batch_size, seq_len, embed_dim) pooled = embedded.mean(dim=1) # Average over sequence length out = self.fc(pooled) # Shape: (batch_size, num_classes) return out # Instantiate the model model = TextClassifier(vocab_size=1000, embed_dim=50, num_classes=2) # Example parameters # Loss and optimizer criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters()) # Training loop (simplified) num_epochs = 5 # Example number of epochs for epoch in range(num_epochs): for inputs, labels in dataloader: outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step()
Example
This example shows a complete runnable PyTorch script that trains a simple text classifier on a tiny dataset of sentences labeled as positive or negative sentiment.
python
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader # Sample data: sentences and labels (0=negative, 1=positive) data = [ ("I love this movie", 1), ("This film is terrible", 0), ("Amazing story and great cast", 1), ("I did not like the plot", 0) ] # Simple vocabulary and tokenizer vocab = {"<pad>":0, "i":1, "love":2, "this":3, "movie":4, "film":5, "is":6, "terrible":7, "amazing":8, "story":9, "and":10, "great":11, "cast":12, "did":13, "not":14, "like":15, "the":16, "plot":17} vocab_size = len(vocab) def tokenize(text): return [vocab.get(word.lower(), 0) for word in text.split()] # Dataset class class TextDataset(Dataset): def __init__(self, data): self.data = data def __len__(self): return len(self.data) def __getitem__(self, idx): text, label = self.data[idx] tokens = tokenize(text) return torch.tensor(tokens), torch.tensor(label) # Collate function to pad sequences def collate_batch(batch): texts, labels = zip(*batch) lengths = [len(t) for t in texts] max_len = max(lengths) padded_texts = [torch.cat([t, torch.zeros(max_len - len(t), dtype=torch.long)]) for t in texts] return torch.stack(padded_texts), torch.tensor(labels) # Create DataLoader dataset = TextDataset(data) dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_batch) # Model class TextClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.fc = nn.Linear(embed_dim, num_classes) def forward(self, x): embedded = self.embedding(x) # (batch, seq_len, embed_dim) pooled = embedded.mean(dim=1) # average pooling return self.fc(pooled) model = TextClassifier(vocab_size, embed_dim=10, num_classes=2) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.01) # Training loop for epoch in range(10): total_loss = 0 for inputs, labels in dataloader: outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}") # Test prediction test_sentence = "I love the story" tokens = torch.tensor(tokenize(test_sentence)).unsqueeze(0) # batch size 1 with torch.no_grad(): output = model(tokens) predicted = torch.argmax(output, dim=1).item() print(f"Prediction for '{test_sentence}':", "Positive" if predicted == 1 else "Negative")
Output
Epoch 1, Loss: 0.6931
Epoch 2, Loss: 0.6863
Epoch 3, Loss: 0.6784
Epoch 4, Loss: 0.6694
Epoch 5, Loss: 0.6590
Epoch 6, Loss: 0.6468
Epoch 7, Loss: 0.6323
Epoch 8, Loss: 0.6151
Epoch 9, Loss: 0.5947
Epoch 10, Loss: 0.5707
Prediction for 'I love the story': Positive
Common Pitfalls
- Not padding sequences: Text inputs must be padded to the same length in a batch, or the model will error.
- Ignoring tokenization: Always convert text to numeric tokens before feeding to the model.
- Wrong loss function: Use
CrossEntropyLossfor classification, not MSELoss. - Forgetting to zero gradients: Call
optimizer.zero_grad()beforeloss.backward()to avoid gradient accumulation. - Not using
torch.no_grad()during evaluation: This saves memory and speeds up inference.
python
## Wrong: No padding, will cause error if batch has different lengths # inputs = [torch.tensor([1,2,3]), torch.tensor([4,5])] # outputs = model(torch.stack(inputs)) # Error ## Right: Pad sequences before batching # Use collate_fn in DataLoader to pad sequences to same length ## Wrong: Using MSELoss for classification # criterion = nn.MSELoss() # Not suitable ## Right: Use CrossEntropyLoss # criterion = nn.CrossEntropyLoss()
Quick Reference
Tips for building text classifiers in PyTorch:
- Always preprocess text: tokenize and convert to integer indices.
- Use
nn.Embeddingto convert tokens to vectors. - Average or pool embeddings before classification layer.
- Use
CrossEntropyLossfor multi-class classification. - Pad sequences in batches for consistent input size.
- Use
torch.no_grad()during evaluation to save memory.
Key Takeaways
Prepare text data by tokenizing and padding sequences before feeding into the model.
Define a simple model with an embedding layer followed by a linear layer for classification.
Use CrossEntropyLoss and an optimizer like Adam to train the model.
Always zero gradients before backpropagation and use torch.no_grad() during evaluation.
Test your model on new sentences by tokenizing and passing through the trained network.