Bird
Raised Fist0
NlpHow-ToBeginner · 3 min read

How to Use Hugging Face Tokenizer in NLP: Simple Guide

To use a tokenizer from Hugging Face in NLP, first load it with AutoTokenizer.from_pretrained() using a model name. Then, apply tokenizer() on your text to convert it into tokens or input IDs ready for models.
📐

Syntax

The basic syntax involves importing AutoTokenizer from transformers, loading a tokenizer with from_pretrained(), and then calling the tokenizer on your text.

  • AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.
  • tokenizer(text): Tokenizes the input text into tokens and input IDs.
  • return_tensors='pt': Optional argument to get output as PyTorch tensors.
python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoded_input = tokenizer('Hello, Hugging Face!', return_tensors='pt')
💻

Example

This example shows how to load the BERT tokenizer, tokenize a sentence, and print the tokens and their corresponding input IDs.

python
from transformers import AutoTokenizer

# Load tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Text to tokenize
text = "Hello, Hugging Face!"

# Tokenize text
encoded = tokenizer(text)

# Print tokens and input IDs
print('Tokens:', tokenizer.convert_ids_to_tokens(encoded['input_ids']))
print('Input IDs:', encoded['input_ids'])
Output
Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]'] Input IDs: [101, 7592, 1010, 17662, 2224, 999, 102]
⚠️

Common Pitfalls

Common mistakes include:

  • Not using from_pretrained() to load the tokenizer, which causes errors.
  • Forgetting to add return_tensors='pt' or 'tf' when passing inputs to models expecting tensors.
  • Confusing tokenized output formats; tokenizer() returns a dictionary with keys like input_ids and attention_mask.

Always check the tokenizer documentation for your specific model.

python
from transformers import AutoTokenizer

# Wrong: Not loading tokenizer properly
# tokenizer = AutoTokenizer('bert-base-uncased')  # This will raise an error

# Right way:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Wrong: Passing string directly to model without tensor conversion
# inputs = tokenizer('Hello')  # outputs dict with lists, not tensors

# Right way:
inputs = tokenizer('Hello', return_tensors='pt')
📊

Quick Reference

Function/MethodDescription
AutoTokenizer.from_pretrained(model_name)Load tokenizer for a specific pretrained model
tokenizer(text)Tokenize input text into tokens and IDs
tokenizer(text, return_tensors='pt')Tokenize and return PyTorch tensors
tokenizer.convert_ids_to_tokens(encoded['input_ids'])Get list of tokens from encoded output
encoded['input_ids']Get token IDs from encoded output

Key Takeaways

Always load the tokenizer with AutoTokenizer.from_pretrained() using the model name.
Use tokenizer(text) to convert text into tokens and input IDs for models.
Add return_tensors='pt' or 'tf' when preparing inputs for model inference.
Check tokenizer output keys like input_ids and attention_mask for correct usage.
Avoid common mistakes like skipping from_pretrained() or passing raw strings to models.