Experiment - Translation with Hugging Face

Problem:You want to translate English sentences into French using a pre-trained Hugging Face transformer model.

Current Metrics:The model translates sentences but sometimes produces incorrect or incomplete translations. Example: 'Hello, how are you?' translated as 'Bonjour, comment'.

Issue:The model is not fine-tuned on your specific dataset and sometimes truncates or misses parts of the translation.

Your Task

Improve translation quality by fine-tuning the pre-trained model on a small English-French sentence dataset and evaluate translation accuracy.

Use Hugging Face transformers and datasets libraries.

Fine-tune only for 3 epochs to keep training short.

Use a small dataset of 100 sentence pairs.

Do not change the model architecture.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
import numpy as np

# Small sample dataset
data = {
    'en': [
        'Hello, how are you?',
        'I love machine learning.',
        'This is a test sentence.',
        'The weather is nice today.',
        'Can you help me?',
        'What is your name?',
        'I am learning to translate.',
        'This is fun!',
        'Have a great day.',
        'See you tomorrow.'
    ] * 10,  # 100 sentences
    'fr': [
        'Bonjour, comment ça va?',
        "J'aime l'apprentissage automatique.",
        "Ceci est une phrase de test.",
        "Il fait beau aujourd'hui.",
        "Pouvez-vous m'aider?",
        "Comment vous appelez-vous?",
        "J'apprends à traduire.",
        "C'est amusant!",
        "Bonne journée.",
        "À demain."
    ] * 10
}

# Create Hugging Face Dataset
dataset = Dataset.from_dict(data)

# Load tokenizer and model
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Tokenize function
def preprocess_function(examples):
    inputs = examples['en']
    targets = examples['fr']
    model_inputs = tokenizer(inputs, max_length=40, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=40, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Prepare dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='no',
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    logging_steps=10
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Train model
trainer.train()

# Test translation
test_sentences = [
    'Hello, how are you?',
    'I love machine learning.',
    'Can you help me?'
]

inputs = tokenizer(test_sentences, return_tensors='pt', padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=40)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print('Translations:')
for en, fr in zip(test_sentences, translations):
    print(f'EN: {en}')
    print(f'FR: {fr}')
    print('---')

Added a small English-French dataset of 100 sentence pairs by repeating 10 unique pairs.

Tokenized inputs and targets properly for seq2seq training.

Used Hugging Face Trainer API to fine-tune the pre-trained 'Helsinki-NLP/opus-mt-en-fr' model for 3 epochs.

Kept model architecture unchanged but improved translation quality by fine-tuning on domain-specific data.

Added 'truncation=True' to tokenizer call for test sentences to avoid potential length issues.

Results Interpretation

Before fine-tuning: Translations were sometimes incomplete or incorrect, e.g., 'Hello, how are you?' -> 'Bonjour, comment'.

After fine-tuning: Translations became more accurate and complete, e.g., 'Hello, how are you?' -> 'Bonjour, comment ça va?'.

Fine-tuning a pre-trained translation model on a small relevant dataset helps improve translation quality by adapting the model to the specific language style and vocabulary.

Bonus Experiment

Try fine-tuning the model on a different language pair, such as English to German, using a similar small dataset.

💡 Hint

Use the model 'Helsinki-NLP/opus-mt-en-de' and prepare English-German sentence pairs for fine-tuning.