Bird
Raised Fist0
NlpProgramBeginner · 2 min read

NLP Program to Translate Text Using Hugging Face Transformers

Use Hugging Face's transformers library with a pretrained translation model like Helsinki-NLP/opus-mt-en-fr and call model.generate() on tokenized input to translate text, e.g., translator('Hello, how are you?').
📋

Examples

InputHello, how are you?
OutputBonjour, comment ça va ?
InputThis is a simple test.
OutputCeci est un test simple.
Input
Output
🧠

How to Think About It

To translate text using NLP, first choose a pretrained translation model that knows both source and target languages. Then, convert the input text into tokens the model understands, run the model to generate translated tokens, and finally convert those tokens back into readable text.
📐

Algorithm

1
Get the input text to translate.
2
Load a pretrained translation model and tokenizer.
3
Tokenize the input text into model-readable format.
4
Use the model to generate translated tokens.
5
Decode the translated tokens back to text.
6
Return the translated text.
💻

Code

python
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translator(text):
    if not text:
        return ''
    tokens = tokenizer(text, return_tensors='pt', padding=True)
    translated_tokens = model.generate(**tokens)
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

print(translator('Hello, how are you?'))
Output
Bonjour, comment ça va ?
🔍

Dry Run

Let's trace translating 'Hello, how are you?' through the code

1

Input Text

'Hello, how are you?'

2

Tokenization

tokens = tokenizer('Hello, how are you?', return_tensors='pt', padding=True) produces input IDs tensor

3

Model Generation

translated_tokens = model.generate(**tokens) produces output token IDs tensor

4

Decoding

translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) results in 'Bonjour, comment ça va ?'

StepOperationValue
1Input TextHello, how are you?
2Token IDs[[ 71, 232, 389, 345, 30]]
3Generated Token IDs[[ 56, 12, 345, 45, 67, 2]]
4Output TextBonjour, comment ça va ?
💡

Why This Works

Step 1: Load Pretrained Model

We use a pretrained model that already knows how to translate English to French, so we don't need to train from scratch.

Step 2: Tokenize Input

The tokenizer converts the input text into numbers (tokens) that the model can understand.

Step 3: Generate Translation

The model generates tokens for the translated text based on the input tokens.

Step 4: Decode Output

We convert the output tokens back into readable text to get the final translation.

🔄

Alternative Approaches

Using Googletrans Library
python
from googletrans import Translator
translator = Translator()
translated = translator.translate('Hello, how are you?', dest='fr')
print(translated.text)
This uses an online API for translation, which is easy but requires internet and may have usage limits.
Using Fairseq Pretrained Models
python
import torch
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('path_to_fairseq_model')
print(model.translate('Hello, how are you?'))
Fairseq offers powerful models but requires more setup and local model files.

Complexity: O(n) time, O(n) space

Time Complexity

The translation time grows linearly with the length of the input text tokens because the model processes each token sequentially.

Space Complexity

Memory usage depends on storing tokenized input and output tokens, which is proportional to input length.

Which Approach is Fastest?

Using pretrained transformer models locally is fast for small texts; online APIs may add network delay but require no local resources.

ApproachTimeSpaceBest For
Hugging Face TransformersO(n)O(n)Offline, customizable translation
Googletrans APIO(n) + network delayO(1)Quick online translation with minimal setup
Fairseq ModelsO(n)O(n)Research and custom model use
💡
Always check if your input text is empty before translating to avoid errors.
⚠️
Beginners often forget to skip special tokens when decoding, resulting in strange characters in output.