Bird
Raised Fist0
NlpHow-ToBeginner ยท 3 min read

How to Use AutoTokenizer from Hugging Face in NLP

Use AutoTokenizer.from_pretrained() to load a tokenizer for any Hugging Face model by specifying its name. Then, call tokenizer() on your text to convert it into tokens ready for model input.
๐Ÿ“

Syntax

The basic syntax to use AutoTokenizer is:

  • AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.
  • tokenizer(text): Tokenizes the input text into tokens or token IDs.

This lets you easily switch between models without changing tokenization code.

python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer('Hello world!')
๐Ÿ’ป

Example

This example shows how to load the BERT tokenizer and tokenize a sentence. It prints the token IDs and tokens.

python
from transformers import AutoTokenizer

# Load tokenizer for BERT base uncased
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize a sample sentence
text = 'Hello, Hugging Face!'
tokens = tokenizer(text)

print('Token IDs:', tokens['input_ids'])
print('Tokens:', tokenizer.convert_ids_to_tokens(tokens['input_ids']))
Output
Token IDs: [101, 7592, 1010, 17662, 2224, 999, 102] Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]']
โš ๏ธ

Common Pitfalls

Common mistakes include:

  • Not calling from_pretrained() and trying to instantiate AutoTokenizer() directly.
  • Passing raw text to the model instead of tokenized inputs.
  • Ignoring special tokens like [CLS] and [SEP] which are added automatically.
  • Not using return_tensors='pt' or 'tf' when feeding tokens to PyTorch or TensorFlow models.
python
from transformers import AutoTokenizer

# Wrong: Instantiating without from_pretrained
# tokenizer = AutoTokenizer()  # This will raise an error

# Right way:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
๐Ÿ“Š

Quick Reference

MethodDescription
from_pretrained(model_name)Load tokenizer for a specific model
tokenizer(text)Tokenize input text into tokens and IDs
convert_ids_to_tokens(ids)Convert token IDs back to readable tokens
tokenizer(text, return_tensors='pt')Return PyTorch tensors for model input
tokenizer(text, return_tensors='tf')Return TensorFlow tensors for model input
โœ…

Key Takeaways

Always load tokenizer with AutoTokenizer.from_pretrained(model_name) before use.
Tokenize text by calling tokenizer(text) to get token IDs and attention masks.
Use return_tensors='pt' or 'tf' to get tensors compatible with your model framework.
Special tokens like [CLS] and [SEP] are added automatically by the tokenizer.
Avoid instantiating AutoTokenizer directly without from_pretrained to prevent errors.