How to use tokenizer hugging face in nlp

NlpHow-ToBeginner · 3 min read

How to Use Hugging Face Tokenizer in NLP: Simple Guide

To use a tokenizer from Hugging Face in NLP, first load it with AutoTokenizer.from_pretrained() using a model name. Then, apply tokenizer() on your text to convert it into tokens or input IDs ready for models.

📐

Syntax

The basic syntax involves importing AutoTokenizer from transformers, loading a tokenizer with from_pretrained(), and then calling the tokenizer on your text.

AutoTokenizer.from_pretrained(model_name): Loads the tokenizer for the specified model.
tokenizer(text): Tokenizes the input text into tokens and input IDs.
return_tensors='pt': Optional argument to get output as PyTorch tensors.

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoded_input = tokenizer('Hello, Hugging Face!', return_tensors='pt')

💻

Example

This example shows how to load the BERT tokenizer, tokenize a sentence, and print the tokens and their corresponding input IDs.

python

from transformers import AutoTokenizer

# Load tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Text to tokenize
text = "Hello, Hugging Face!"

# Tokenize text
encoded = tokenizer(text)

# Print tokens and input IDs
print('Tokens:', tokenizer.convert_ids_to_tokens(encoded['input_ids']))
print('Input IDs:', encoded['input_ids'])

Output

Tokens: ['[CLS]', 'hello', ',', 'hugging', 'face', '!', '[SEP]'] Input IDs: [101, 7592, 1010, 17662, 2224, 999, 102]

⚠️

Common Pitfalls

Common mistakes include:

Not using from_pretrained() to load the tokenizer, which causes errors.
Forgetting to add return_tensors='pt' or 'tf' when passing inputs to models expecting tensors.
Confusing tokenized output formats; tokenizer() returns a dictionary with keys like input_ids and attention_mask.

Always check the tokenizer documentation for your specific model.

python

from transformers import AutoTokenizer

# Wrong: Not loading tokenizer properly
# tokenizer = AutoTokenizer('bert-base-uncased')  # This will raise an error

# Right way:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Wrong: Passing string directly to model without tensor conversion
# inputs = tokenizer('Hello')  # outputs dict with lists, not tensors

# Right way:
inputs = tokenizer('Hello', return_tensors='pt')

📊

Quick Reference

Function/Method	Description
AutoTokenizer.from_pretrained(model_name)	Load tokenizer for a specific pretrained model
tokenizer(text)	Tokenize input text into tokens and IDs
tokenizer(text, return_tensors='pt')	Tokenize and return PyTorch tensors
tokenizer.convert_ids_to_tokens(encoded['input_ids'])	Get list of tokens from encoded output
encoded['input_ids']	Get token IDs from encoded output

✅

Key Takeaways

Always load the tokenizer with AutoTokenizer.from_pretrained() using the model name.

Use tokenizer(text) to convert text into tokens and input IDs for models.

Add return_tensors='pt' or 'tf' when preparing inputs for model inference.

Check tokenizer output keys like input_ids and attention_mask for correct usage.

Avoid common mistakes like skipping from_pretrained() or passing raw strings to models.