How to Use AutoTokenizer from Hugging Face in NLP
Use
AutoTokenizer.from_pretrained() to load a tokenizer for any Hugging Face model by specifying its name. Then, call tokenizer() on your text to convert it into tokens ready for model input.Syntax
The basic syntax to use AutoTokenizer is:
AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.tokenizer(text): Tokenizes the input text into tokens or token IDs.
This lets you easily switch between models without changing tokenization code.
python
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') tokens = tokenizer('Hello world!')
Example
This example shows how to load the BERT tokenizer and tokenize a sentence. It prints the token IDs and tokens.
python
from transformers import AutoTokenizer # Load tokenizer for BERT base uncased tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Tokenize a sample sentence text = 'Hello, Hugging Face!' tokens = tokenizer(text) print('Token IDs:', tokens['input_ids']) print('Tokens:', tokenizer.convert_ids_to_tokens(tokens['input_ids']))
Output
Token IDs: [101, 7592, 1010, 17662, 2224, 999, 102]
Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]']
Common Pitfalls
Common mistakes include:
- Not calling
from_pretrained()and trying to instantiateAutoTokenizer()directly. - Passing raw text to the model instead of tokenized inputs.
- Ignoring special tokens like
[CLS]and[SEP]which are added automatically. - Not using
return_tensors='pt'or'tf'when feeding tokens to PyTorch or TensorFlow models.
python
from transformers import AutoTokenizer # Wrong: Instantiating without from_pretrained # tokenizer = AutoTokenizer() # This will raise an error # Right way: tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Quick Reference
| Method | Description |
|---|---|
| from_pretrained(model_name) | Load tokenizer for a specific model |
| tokenizer(text) | Tokenize input text into tokens and IDs |
| convert_ids_to_tokens(ids) | Convert token IDs back to readable tokens |
| tokenizer(text, return_tensors='pt') | Return PyTorch tensors for model input |
| tokenizer(text, return_tensors='tf') | Return TensorFlow tensors for model input |
Key Takeaways
Always load tokenizer with AutoTokenizer.from_pretrained(model_name) before use.
Tokenize text by calling tokenizer(text) to get token IDs and attention masks.
Use return_tensors='pt' or 'tf' to get tensors compatible with your model framework.
Special tokens like [CLS] and [SEP] are added automatically by the tokenizer.
Avoid instantiating AutoTokenizer directly without from_pretrained to prevent errors.
