How to Use BERT for NLP Tasks: Simple Guide and Example
To use
BERT for NLP, load a pre-trained BERT model and its tokenizer, then convert text into tokens that the model understands. Pass these tokens to the model to get meaningful outputs like embeddings or predictions for tasks such as classification or question answering.Syntax
Using BERT involves these main steps:
- Load tokenizer: Converts text into tokens BERT understands.
- Tokenize input: Prepare text as input IDs and attention masks.
- Load model: Pre-trained BERT model for your NLP task.
- Run model: Pass tokens to get outputs like embeddings or predictions.
python
from transformers import BertTokenizer, BertModel # Load pre-trained tokenizer and model tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # Example text text = "Hello, how are you?" # Tokenize text inputs = tokenizer(text, return_tensors='pt') # Get model outputs outputs = model(**inputs) # Extract last hidden states (embeddings) embeddings = outputs.last_hidden_state
Example
This example shows how to use BERT to get word embeddings from a sentence. These embeddings can be used for many NLP tasks like classification or similarity.
python
from transformers import BertTokenizer, BertModel import torch # Load tokenizer and model tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # Input sentence sentence = "BERT helps computers understand language." # Tokenize and get input tensors inputs = tokenizer(sentence, return_tensors='pt') # Run model to get outputs with torch.no_grad(): outputs = model(**inputs) # Get embeddings for each token embeddings = outputs.last_hidden_state # Print shape of embeddings tensor print(f"Embeddings shape: {embeddings.shape}")
Output
Embeddings shape: torch.Size([1, 9, 768])
Common Pitfalls
Common mistakes when using BERT include:
- Not using the correct tokenizer matching the model.
- Feeding raw text directly to the model without tokenization.
- Ignoring attention masks, which tell the model which tokens to focus on.
- Not setting the model to evaluation mode during inference, which can affect results.
Always use the tokenizer from the same model checkpoint and pass attention masks to the model.
python
from transformers import BertTokenizer, BertModel # Wrong way: feeding raw text directly # model('Hello world') # This will cause an error # Right way: tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') text = "Hello world" inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs)
Quick Reference
Key points to remember when using BERT:
- Always load tokenizer and model from the same pre-trained checkpoint.
- Use
tokenizer(text, return_tensors='pt')to prepare inputs. - Pass
input_idsandattention_maskto the model. - Use
outputs.last_hidden_statefor embeddings. - Set model to
eval()mode during inference.
Key Takeaways
Load BERT tokenizer and model from the same pre-trained checkpoint for compatibility.
Always tokenize text before passing it to the BERT model using the tokenizer.
Use attention masks to help BERT focus on real tokens, ignoring padding.
Extract embeddings from the model's last hidden state for downstream NLP tasks.
Set the model to evaluation mode during inference to get consistent results.
