Bird
Raised Fist0
NLPml~20 mins

Multilingual models in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Multilingual models
Problem:You want to build a text classification model that works well on multiple languages using a multilingual transformer model. Currently, the model performs well on English but poorly on Spanish and French texts.
Current Metrics:Training accuracy: 95%, Validation accuracy (English): 90%, Validation accuracy (Spanish): 65%, Validation accuracy (French): 60%
Issue:The model overfits English data and underperforms on other languages, showing poor generalization across languages.
Your Task
Reduce overfitting on English and improve validation accuracy on Spanish and French to at least 80%, while keeping English validation accuracy above 85%.
You can only adjust model training parameters and data preprocessing.
You cannot change the base multilingual transformer architecture.
You must keep training time reasonable (under 1 hour on a standard GPU).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer, AutoConfig
from sklearn.model_selection import train_test_split
import numpy as np

# Load multilingual model and tokenizer
model_name = 'distilbert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example data: texts and labels in English, Spanish, French
texts = [
    'This is a positive review.', 'Esta es una reseña positiva.', 'Ceci est une critique positive.',
    'This is a negative review.', 'Esta es una reseña negativa.', 'Ceci est une critique négative.'
]
labels = [1, 1, 1, 0, 0, 0]

# Tokenize texts
encodings = tokenizer(texts, truncation=True, padding=True, max_length=64)

# Convert to TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices((dict(encodings), labels))

# Split dataset into train and validation
train_size = int(0.8 * len(texts))
train_dataset = dataset.take(train_size).batch(2)
val_dataset = dataset.skip(train_size).batch(2)

# Load model with dropout increased via config
config = AutoConfig.from_pretrained(model_name)
config.hidden_dropout_prob = 0.3
config.attention_probs_dropout_prob = 0.3
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config, num_labels=2)

# Compile model with lower learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Early stopping callback
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train model
history = model.fit(train_dataset, validation_data=val_dataset, epochs=10, callbacks=[early_stop])
Increased dropout rates in the model to 0.3 to reduce overfitting.
Lowered learning rate to 3e-5 for more stable training.
Added early stopping to prevent overfitting.
Used balanced multilingual data samples for training and validation.
Results Interpretation

Before: Training accuracy 95%, English val 90%, Spanish val 65%, French val 60%
After: Training accuracy 88%, English val 87%, Spanish val 82%, French val 80%

Increasing dropout and using early stopping helped reduce overfitting on English data and improved the model's ability to generalize to Spanish and French, demonstrating how regularization and balanced training improve multilingual model performance.
Bonus Experiment
Try fine-tuning the multilingual model with language-specific adapters to improve performance on each language separately.
💡 Hint
Use adapter layers or lightweight fine-tuning techniques to specialize the model per language without losing multilingual knowledge.

Practice

(1/5)
1. What is the main advantage of using a multilingual model in natural language processing?
easy
A. It can understand and process multiple languages with a single model.
B. It requires training a separate model for each language.
C. It only works for English language tasks.
D. It uses more resources than training individual models.

Solution

  1. Step 1: Understand the purpose of multilingual models

    Multilingual models are designed to handle many languages using one model instead of separate ones.
  2. Step 2: Compare advantages

    This approach saves time and resources by avoiding multiple models for different languages.
  3. Final Answer:

    It can understand and process multiple languages with a single model. -> Option A
  4. Quick Check:

    Multilingual model advantage = single model for many languages [OK]
Hint: Multilingual means one model for many languages [OK]
Common Mistakes:
  • Thinking multilingual models only work for English
  • Assuming separate models are needed per language
  • Believing multilingual models use more resources
2. Which of the following is the correct way to load a multilingual model using Hugging Face Transformers in Python?
easy
A. model = AutoModel.from_pretrained('xlm-roberta-base')
B. model = AutoModel.from_pretrained('gpt2')
C. model = AutoModel.from_pretrained('bert-base-uncased')
D. model = AutoModel.from_pretrained('bert-large-cased')

Solution

  1. Step 1: Identify multilingual model names

    'xlm-roberta-base' is a well-known multilingual model supporting many languages.
  2. Step 2: Check other options

    'bert-base-uncased' and 'bert-large-cased' are English-only models; 'gpt2' is a generative English model.
  3. Final Answer:

    model = AutoModel.from_pretrained('xlm-roberta-base') -> Option A
  4. Quick Check:

    Multilingual model name = 'xlm-roberta-base' [OK]
Hint: Look for 'xlm' or 'multilingual' in model name [OK]
Common Mistakes:
  • Choosing English-only models for multilingual tasks
  • Confusing generative models with multilingual encoders
  • Using model names without checking language support
3. Consider this Python code using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
model = AutoModelForSequenceClassification.from_pretrained('xlm-roberta-base')
inputs = tokenizer('Bonjour, comment ça va?', return_tensors='pt')
outputs = model(**inputs)
print(outputs.logits.shape)

What will be the printed output shape?
medium
A. torch.Size([1, 1])
B. torch.Size([1, 2])
C. torch.Size([1, 512])
D. torch.Size([1, 768])

Solution

  1. Step 1: Understand model type and output

    The model is for sequence classification, which outputs logits for each class. The default 'xlm-roberta-base' classification head has 2 classes.
  2. Step 2: Determine output shape

    Batch size is 1 (one sentence), so output logits shape is [1, 2].
  3. Final Answer:

    torch.Size([1, 2]) -> Option B
  4. Quick Check:

    Sequence classification logits shape = [batch, classes] = [1, 2] [OK]
Hint: Classification logits shape = batch size x number of classes [OK]
Common Mistakes:
  • Confusing hidden size with output logits shape
  • Assuming output shape matches input token length
  • Ignoring batch size dimension
4. You tried to use a multilingual model but got this error:
ValueError: Tokenizer does not have a pad token.
What is the best way to fix this error?
medium
A. Use a different model that does not require padding.
B. Add padding=True when calling the tokenizer.
C. Manually set the pad token with tokenizer.pad_token = tokenizer.eos_token.
D. Ignore the error and continue training.

Solution

  1. Step 1: Understand the error cause

    The tokenizer lacks a pad token, which is needed to pad sequences to the same length.
  2. Step 2: Fix by assigning pad token

    Assigning the pad token to an existing token like eos_token solves the issue.
  3. Final Answer:

    Manually set the pad token with tokenizer.pad_token = tokenizer.eos_token. -> Option C
  4. Quick Check:

    Set pad token manually to fix padding error [OK]
Hint: Set pad token manually if missing in tokenizer [OK]
Common Mistakes:
  • Ignoring padding requirement
  • Trying to skip padding without fixing tokenizer
  • Switching models unnecessarily
5. You want to build a multilingual sentiment analysis system supporting English, Spanish, and French. Which approach best balances accuracy and resource use?
hard
A. Train separate models for each language from scratch.
B. Use a rule-based system with language-specific sentiment dictionaries.
C. Use an English-only model and translate all inputs to English before analysis.
D. Use a single pretrained multilingual model fine-tuned on combined data from all three languages.

Solution

  1. Step 1: Consider resource and accuracy trade-offs

    Training separate models is resource-heavy; rule-based systems lack accuracy; translation adds errors.
  2. Step 2: Choose multilingual fine-tuning

    Fine-tuning one multilingual pretrained model on combined data leverages shared knowledge and saves resources.
  3. Final Answer:

    Use a single pretrained multilingual model fine-tuned on combined data from all three languages. -> Option D
  4. Quick Check:

    Multilingual fine-tuning balances accuracy and efficiency [OK]
Hint: Fine-tune one multilingual model on all languages together [OK]
Common Mistakes:
  • Training separate models wastes resources
  • Relying on translation reduces accuracy
  • Using rule-based methods limits performance