How to use AutoTokenizer hugging face in nlp

NlpHow-ToBeginner · 3 min read

How to Use AutoTokenizer from Hugging Face in NLP

Use AutoTokenizer.from_pretrained() to load a tokenizer for any Hugging Face model by specifying its name. Then, call tokenizer() on your text to convert it into tokens ready for model input.

📐

Syntax

The basic syntax to use AutoTokenizer is:

AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.
tokenizer(text): Tokenizes the input text into tokens or token IDs.

This lets you easily switch between models without changing tokenization code.

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer('Hello world!')

💻

Example

This example shows how to load the BERT tokenizer and tokenize a sentence. It prints the token IDs and tokens.

python

from transformers import AutoTokenizer

# Load tokenizer for BERT base uncased
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize a sample sentence
text = 'Hello, Hugging Face!'
tokens = tokenizer(text)

print('Token IDs:', tokens['input_ids'])
print('Tokens:', tokenizer.convert_ids_to_tokens(tokens['input_ids']))

Output

Token IDs: [101, 7592, 1010, 17662, 2224, 999, 102] Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]']

⚠️

Common Pitfalls

Common mistakes include:

Not calling from_pretrained() and trying to instantiate AutoTokenizer() directly.
Passing raw text to the model instead of tokenized inputs.
Ignoring special tokens like [CLS] and [SEP] which are added automatically.
Not using return_tensors='pt' or 'tf' when feeding tokens to PyTorch or TensorFlow models.

python

from transformers import AutoTokenizer

# Wrong: Instantiating without from_pretrained
# tokenizer = AutoTokenizer()  # This will raise an error

# Right way:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

📊

Quick Reference

Method	Description
from_pretrained(model_name)	Load tokenizer for a specific model
tokenizer(text)	Tokenize input text into tokens and IDs
convert_ids_to_tokens(ids)	Convert token IDs back to readable tokens
tokenizer(text, return_tensors='pt')	Return PyTorch tensors for model input
tokenizer(text, return_tensors='tf')	Return TensorFlow tensors for model input

✅

Key Takeaways

Always load tokenizer with AutoTokenizer.from_pretrained(model_name) before use.

Tokenize text by calling tokenizer(text) to get token IDs and attention masks.

Use return_tensors='pt' or 'tf' to get tensors compatible with your model framework.

Special tokens like [CLS] and [SEP] are added automatically by the tokenizer.

Avoid instantiating AutoTokenizer directly without from_pretrained to prevent errors.