Bird
Raised Fist0
NLPml~20 mins

Translation with Hugging Face in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Translation with Hugging Face
Problem:You want to translate English sentences into French using a pre-trained Hugging Face transformer model.
Current Metrics:The model translates sentences but sometimes produces incorrect or incomplete translations. Example: 'Hello, how are you?' translated as 'Bonjour, comment'.
Issue:The model is not fine-tuned on your specific dataset and sometimes truncates or misses parts of the translation.
Your Task
Improve translation quality by fine-tuning the pre-trained model on a small English-French sentence dataset and evaluate translation accuracy.
Use Hugging Face transformers and datasets libraries.
Fine-tune only for 3 epochs to keep training short.
Use a small dataset of 100 sentence pairs.
Do not change the model architecture.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
import numpy as np

# Small sample dataset
data = {
    'en': [
        'Hello, how are you?',
        'I love machine learning.',
        'This is a test sentence.',
        'The weather is nice today.',
        'Can you help me?',
        'What is your name?',
        'I am learning to translate.',
        'This is fun!',
        'Have a great day.',
        'See you tomorrow.'
    ] * 10,  # 100 sentences
    'fr': [
        'Bonjour, comment ça va?',
        "J'aime l'apprentissage automatique.",
        "Ceci est une phrase de test.",
        "Il fait beau aujourd'hui.",
        "Pouvez-vous m'aider?",
        "Comment vous appelez-vous?",
        "J'apprends à traduire.",
        "C'est amusant!",
        "Bonne journée.",
        "À demain."
    ] * 10
}

# Create Hugging Face Dataset
dataset = Dataset.from_dict(data)

# Load tokenizer and model
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Tokenize function
def preprocess_function(examples):
    inputs = examples['en']
    targets = examples['fr']
    model_inputs = tokenizer(inputs, max_length=40, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=40, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Prepare dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='no',
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    logging_steps=10
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Train model
trainer.train()

# Test translation
test_sentences = [
    'Hello, how are you?',
    'I love machine learning.',
    'Can you help me?'
]

inputs = tokenizer(test_sentences, return_tensors='pt', padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=40)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print('Translations:')
for en, fr in zip(test_sentences, translations):
    print(f'EN: {en}')
    print(f'FR: {fr}')
    print('---')
Added a small English-French dataset of 100 sentence pairs by repeating 10 unique pairs.
Tokenized inputs and targets properly for seq2seq training.
Used Hugging Face Trainer API to fine-tune the pre-trained 'Helsinki-NLP/opus-mt-en-fr' model for 3 epochs.
Kept model architecture unchanged but improved translation quality by fine-tuning on domain-specific data.
Added 'truncation=True' to tokenizer call for test sentences to avoid potential length issues.
Results Interpretation

Before fine-tuning: Translations were sometimes incomplete or incorrect, e.g., 'Hello, how are you?' -> 'Bonjour, comment'.

After fine-tuning: Translations became more accurate and complete, e.g., 'Hello, how are you?' -> 'Bonjour, comment ça va?'.

Fine-tuning a pre-trained translation model on a small relevant dataset helps improve translation quality by adapting the model to the specific language style and vocabulary.
Bonus Experiment
Try fine-tuning the model on a different language pair, such as English to German, using a similar small dataset.
💡 Hint
Use the model 'Helsinki-NLP/opus-mt-en-de' and prepare English-German sentence pairs for fine-tuning.

Practice

(1/5)
1. What is the main purpose of using the Hugging Face translation pipeline?
easy
A. To train a new language model from scratch
B. To automatically convert text from one language to another
C. To analyze the sentiment of a text
D. To generate random text in the same language

Solution

  1. Step 1: Understand the translation pipeline purpose

    The translation pipeline is designed to convert text from one language to another automatically.
  2. Step 2: Compare with other options

    Training models, sentiment analysis, and text generation are different tasks not handled by this pipeline.
  3. Final Answer:

    To automatically convert text from one language to another -> Option B
  4. Quick Check:

    Translation pipeline = convert text languages [OK]
Hint: Translation pipeline means changing language automatically [OK]
Common Mistakes:
  • Confusing translation with training a model
  • Thinking it analyzes sentiment
  • Assuming it generates random text
2. Which of the following is the correct way to create a translation pipeline using Hugging Face in Python?
easy
A. translator = pipeline('translation_en_to_fr')
B. translator = pipeline('sentiment-analysis')
C. translator = pipeline('text-generation')
D. translator = pipeline('image-classification')

Solution

  1. Step 1: Identify the pipeline task for translation

    The correct task name for English to French translation is 'translation_en_to_fr'.
  2. Step 2: Eliminate unrelated pipeline tasks

    Sentiment analysis, text generation, and image classification are unrelated to translation.
  3. Final Answer:

    translator = pipeline('translation_en_to_fr') -> Option A
  4. Quick Check:

    Translation pipeline uses 'translation_en_to_fr' [OK]
Hint: Use 'translation_en_to_fr' for English to French translation [OK]
Common Mistakes:
  • Using sentiment-analysis instead of translation
  • Confusing text-generation with translation
  • Using image-classification for text tasks
3. What will be the output of the following code snippet?
from transformers import pipeline
translator = pipeline('translation_en_to_de')
result = translator('Hello, how are you?')
print(result[0]['translation_text'])
medium
A. Bonjour, comment ça va?
B. Hello, how are you?
C. Hallo, wie geht es dir?
D. Hola, ¿cómo estás?

Solution

  1. Step 1: Understand the pipeline task

    The pipeline is set to translate English to German ('translation_en_to_de').
  2. Step 2: Translate the input text

    The English phrase 'Hello, how are you?' translates to German as 'Hallo, wie geht es dir?'.
  3. Final Answer:

    Hallo, wie geht es dir? -> Option C
  4. Quick Check:

    English to German translation = 'Hallo, wie geht es dir?' [OK]
Hint: Check language codes: en_to_de means English to German [OK]
Common Mistakes:
  • Expecting output in English (no translation)
  • Confusing German with French or Spanish
  • Printing the whole result list instead of text
4. Identify the error in this code snippet for translating English to Spanish using Hugging Face:
from transformers import pipeline
translator = pipeline('translation_en_to_es')
result = translator('Good morning')
print(result['translation_text'])
medium
A. Using wrong pipeline task name
B. Incorrect input text format
C. Missing import statement
D. Accessing result as a dictionary instead of a list

Solution

  1. Step 1: Check the output type of translator()

    The translator returns a list of dictionaries, not a single dictionary.
  2. Step 2: Correct the way to access translation text

    We should access the first element of the list, then the 'translation_text' key: result[0]['translation_text'].
  3. Final Answer:

    Accessing result as a dictionary instead of a list -> Option D
  4. Quick Check:

    Output is list, not dict [OK]
Hint: Remember translator returns list of dicts, use result[0]['translation_text'] [OK]
Common Mistakes:
  • Trying to access result['translation_text'] directly
  • Using wrong pipeline task name
  • Forgetting to import pipeline
5. You want to translate a list of English sentences to French using Hugging Face. Which approach correctly handles multiple sentences efficiently?
from transformers import pipeline
translator = pipeline('translation_en_to_fr')
sentences = ['Good night', 'See you later', 'Thank you']
# What is the best way to translate all sentences?
hard
A. Call translator once with the whole list: translator(sentences)
B. Use a loop: [translator(sentence)[0]['translation_text'] for sentence in sentences]
C. Join sentences into one string and translate: translator(' '.join(sentences))
D. Translate only the first sentence: translator(sentences[0])

Solution

  1. Step 1: Understand batch support in pipelines

    Hugging Face translation pipelines natively support batched inputs by passing a list of strings, enabling efficient parallel translation.
  2. Step 2: Eliminate incorrect approaches

    A loop works but is less efficient with multiple forward passes; C loses sentence boundaries; D ignores all but the first sentence.
  3. Final Answer:

    Call translator once with the whole list: translator(sentences) -> Option A
  4. Quick Check:

    translator(sentences) batches efficiently [OK]
Hint: Pass list directly: translator(sentences) for batch efficiency [OK]
Common Mistakes:
  • Using a loop (less efficient than batching)
  • Joining sentences into one string and translating
  • Translating only the first sentence