How to Use Pretrained Models from Hugging Face in NLP
To use a pretrained model from Hugging Face in NLP, install the
transformers library, then load a model and tokenizer with AutoModelForSequenceClassification and AutoTokenizer. Use the tokenizer to prepare text input and the model to get predictions easily.Syntax
Here is the basic syntax to load and use a pretrained Hugging Face model for NLP tasks:
from transformers import AutoTokenizer, AutoModelForSequenceClassification: Import classes to load tokenizer and model.tokenizer = AutoTokenizer.from_pretrained('model-name'): Load the tokenizer for the model.model = AutoModelForSequenceClassification.from_pretrained('model-name'): Load the pretrained model.inputs = tokenizer(text, return_tensors='pt'): Convert text to model input tensors.outputs = model(**inputs): Get model predictions.
python
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english') text = "I love using pretrained models!" inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs)
Example
This example shows how to load a pretrained sentiment analysis model from Hugging Face, tokenize a sentence, and get the predicted sentiment scores.
python
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load tokenizer and model model_name = 'distilbert-base-uncased-finetuned-sst-2-english' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Input text text = "I love using pretrained models!" # Tokenize input inputs = tokenizer(text, return_tensors='pt') # Get model outputs outputs = model(**inputs) # Convert logits to probabilities probs = torch.nn.functional.softmax(outputs.logits, dim=-1) # Print probabilities for each class print(f"Negative: {probs[0][0].item():.4f}") print(f"Positive: {probs[0][1].item():.4f}")
Output
Negative: 0.0001
Positive: 0.9999
Common Pitfalls
- Not installing the
transformerslibrary before use causes import errors. - Using a model name that does not exist or is misspelled leads to loading errors.
- Forgetting to tokenize input text properly results in model errors.
- Not using
return_tensors='pt'or'tf'causes input format issues. - Confusing model types: use
AutoModelForSequenceClassificationfor classification tasks, notAutoModel.
python
from transformers import AutoTokenizer, AutoModel # Wrong: Using AutoModel instead of AutoModelForSequenceClassification model_name = 'distilbert-base-uncased-finetuned-sst-2-english' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # This will not output classification logits text = "Test" inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs) print(outputs.last_hidden_state.shape) # Outputs embeddings, not classification scores # Right way: from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(model_name) outputs = model(**inputs) print(outputs.logits.shape) # Outputs classification logits
Output
torch.Size([1, 8, 768])
torch.Size([1, 2])
Quick Reference
Summary tips for using pretrained Hugging Face models in NLP:
- Always install
transformerswithpip install transformers. - Choose the right model for your task (e.g., classification, generation).
- Load both tokenizer and model with the same model name.
- Tokenize input text with
return_tensors='pt'for PyTorch or'tf'for TensorFlow. - Use model outputs correctly:
outputs.logitsfor classification scores.
Key Takeaways
Install the transformers library and import AutoTokenizer and AutoModelForSequenceClassification.
Load the pretrained model and tokenizer by specifying the model name from Hugging Face.
Tokenize your input text with return_tensors set to 'pt' or 'tf' before passing to the model.
Use model outputs.logits to get prediction scores for classification tasks.
Avoid using the wrong model class or skipping tokenization to prevent errors.
