NLP Program to Translate Text Using Hugging Face Transformers
transformers library with a pretrained translation model like Helsinki-NLP/opus-mt-en-fr and call model.generate() on tokenized input to translate text, e.g., translator('Hello, how are you?').Examples
How to Think About It
Algorithm
Code
from transformers import MarianMTModel, MarianTokenizer model_name = 'Helsinki-NLP/opus-mt-en-fr' tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) def translator(text): if not text: return '' tokens = tokenizer(text, return_tensors='pt', padding=True) translated_tokens = model.generate(**tokens) translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) return translated_text print(translator('Hello, how are you?'))
Dry Run
Let's trace translating 'Hello, how are you?' through the code
Input Text
'Hello, how are you?'
Tokenization
tokens = tokenizer('Hello, how are you?', return_tensors='pt', padding=True) produces input IDs tensor
Model Generation
translated_tokens = model.generate(**tokens) produces output token IDs tensor
Decoding
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) results in 'Bonjour, comment ça va ?'
| Step | Operation | Value |
|---|---|---|
| 1 | Input Text | Hello, how are you? |
| 2 | Token IDs | [[ 71, 232, 389, 345, 30]] |
| 3 | Generated Token IDs | [[ 56, 12, 345, 45, 67, 2]] |
| 4 | Output Text | Bonjour, comment ça va ? |
Why This Works
Step 1: Load Pretrained Model
We use a pretrained model that already knows how to translate English to French, so we don't need to train from scratch.
Step 2: Tokenize Input
The tokenizer converts the input text into numbers (tokens) that the model can understand.
Step 3: Generate Translation
The model generates tokens for the translated text based on the input tokens.
Step 4: Decode Output
We convert the output tokens back into readable text to get the final translation.
Alternative Approaches
from googletrans import Translator translator = Translator() translated = translator.translate('Hello, how are you?', dest='fr') print(translated.text)
import torch from fairseq.models.transformer import TransformerModel model = TransformerModel.from_pretrained('path_to_fairseq_model') print(model.translate('Hello, how are you?'))
Complexity: O(n) time, O(n) space
Time Complexity
The translation time grows linearly with the length of the input text tokens because the model processes each token sequentially.
Space Complexity
Memory usage depends on storing tokenized input and output tokens, which is proportional to input length.
Which Approach is Fastest?
Using pretrained transformer models locally is fast for small texts; online APIs may add network delay but require no local resources.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Hugging Face Transformers | O(n) | O(n) | Offline, customizable translation |
| Googletrans API | O(n) + network delay | O(1) | Quick online translation with minimal setup |
| Fairseq Models | O(n) | O(n) | Research and custom model use |
