How to Use Hugging Face Tokenizer in NLP: Simple Guide
To use a
tokenizer from Hugging Face in NLP, first load it with AutoTokenizer.from_pretrained() using a model name. Then, apply tokenizer() on your text to convert it into tokens or input IDs ready for models.Syntax
The basic syntax involves importing AutoTokenizer from transformers, loading a tokenizer with from_pretrained(), and then calling the tokenizer on your text.
- AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.
- tokenizer(text): Tokenizes the input text into tokens and input IDs.
- return_tensors='pt': Optional argument to get output as PyTorch tensors.
python
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') encoded_input = tokenizer('Hello, Hugging Face!', return_tensors='pt')
Example
This example shows how to load the BERT tokenizer, tokenize a sentence, and print the tokens and their corresponding input IDs.
python
from transformers import AutoTokenizer # Load tokenizer for BERT tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Text to tokenize text = "Hello, Hugging Face!" # Tokenize text encoded = tokenizer(text) # Print tokens and input IDs print('Tokens:', tokenizer.convert_ids_to_tokens(encoded['input_ids'])) print('Input IDs:', encoded['input_ids'])
Output
Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]']
Input IDs: [101, 7592, 1010, 17662, 2224, 999, 102]
Common Pitfalls
Common mistakes include:
- Not using
from_pretrained()to load the tokenizer, which causes errors. - Forgetting to add
return_tensors='pt'or'tf'when passing inputs to models expecting tensors. - Confusing tokenized output formats;
tokenizer()returns a dictionary with keys likeinput_idsandattention_mask.
Always check the tokenizer documentation for your specific model.
python
from transformers import AutoTokenizer # Wrong: Not loading tokenizer properly # tokenizer = AutoTokenizer('bert-base-uncased') # This will raise an error # Right way: tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Wrong: Passing string directly to model without tensor conversion # inputs = tokenizer('Hello') # outputs dict with lists, not tensors # Right way: inputs = tokenizer('Hello', return_tensors='pt')
Quick Reference
| Function/Method | Description |
|---|---|
| AutoTokenizer.from_pretrained(model_name) | Load tokenizer for a specific pretrained model |
| tokenizer(text) | Tokenize input text into tokens and IDs |
| tokenizer(text, return_tensors='pt') | Tokenize and return PyTorch tensors |
| tokenizer.convert_ids_to_tokens(encoded['input_ids']) | Get list of tokens from encoded output |
| encoded['input_ids'] | Get token IDs from encoded output |
Key Takeaways
Always load the tokenizer with AutoTokenizer.from_pretrained() using the model name.
Use tokenizer(text) to convert text into tokens and input IDs for models.
Add return_tensors='pt' or 'tf' when preparing inputs for model inference.
Check tokenizer output keys like input_ids and attention_mask for correct usage.
Avoid common mistakes like skipping from_pretrained() or passing raw strings to models.
