NlpProgramBeginner · 2 min read

NLP Program to Translate Text Using Hugging Face Transformers

Use Hugging Face's transformers library with a pretrained translation model like Helsinki-NLP/opus-mt-en-fr and call model.generate() on tokenized input to translate text, e.g., translator('Hello, how are you?').

📋

Examples

InputHello, how are you?

OutputBonjour, comment ça va ?

InputThis is a simple test.

OutputCeci est un test simple.

Input

Output

🧠

How to Think About It

To translate text using NLP, first choose a pretrained translation model that knows both source and target languages. Then, convert the input text into tokens the model understands, run the model to generate translated tokens, and finally convert those tokens back into readable text.

📐

Algorithm

Get the input text to translate.

Load a pretrained translation model and tokenizer.

Tokenize the input text into model-readable format.

Use the model to generate translated tokens.

Decode the translated tokens back to text.

Return the translated text.

💻

Code

python

from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translator(text):
    if not text:
        return ''
    tokens = tokenizer(text, return_tensors='pt', padding=True)
    translated_tokens = model.generate(**tokens)
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

print(translator('Hello, how are you?'))

Output

Bonjour, comment ça va ?

🔍

Dry Run

Let's trace translating 'Hello, how are you?' through the code

Input Text

'Hello, how are you?'

Tokenization

tokens = tokenizer('Hello, how are you?', return_tensors='pt', padding=True) produces input IDs tensor

Model Generation

translated_tokens = model.generate(**tokens) produces output token IDs tensor

Decoding

translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) results in 'Bonjour, comment ça va ?'

Step	Operation	Value
1	Input Text	Hello, how are you?
2	Token IDs	[[ 71, 232, 389, 345, 30]]
3	Generated Token IDs	[[ 56, 12, 345, 45, 67, 2]]
4	Output Text	Bonjour, comment ça va ?

💡

Why This Works

Step 1: Load Pretrained Model

We use a pretrained model that already knows how to translate English to French, so we don't need to train from scratch.

Step 2: Tokenize Input

The tokenizer converts the input text into numbers (tokens) that the model can understand.

Step 3: Generate Translation

The model generates tokens for the translated text based on the input tokens.

Step 4: Decode Output

We convert the output tokens back into readable text to get the final translation.

🔄

Alternative Approaches

Using Googletrans Library

python

from googletrans import Translator
translator = Translator()
translated = translator.translate('Hello, how are you?', dest='fr')
print(translated.text)

This uses an online API for translation, which is easy but requires internet and may have usage limits.

Using Fairseq Pretrained Models

python

import torch
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('path_to_fairseq_model')
print(model.translate('Hello, how are you?'))

Fairseq offers powerful models but requires more setup and local model files.

⚡

Complexity: O(n) time, O(n) space

Time Complexity

The translation time grows linearly with the length of the input text tokens because the model processes each token sequentially.

Space Complexity

Memory usage depends on storing tokenized input and output tokens, which is proportional to input length.

Which Approach is Fastest?

Using pretrained transformer models locally is fast for small texts; online APIs may add network delay but require no local resources.

Approach	Time	Space	Best For
Hugging Face Transformers	O(n)	O(n)	Offline, customizable translation
Googletrans API	O(n) + network delay	O(1)	Quick online translation with minimal setup
Fairseq Models	O(n)	O(n)	Research and custom model use

💡

Always check if your input text is empty before translating to avoid errors.

⚠️

Beginners often forget to skip special tokens when decoding, resulting in strange characters in output.