Bird
Raised Fist0
NLPml~10 mins

Multilingual models in NLP - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to load a multilingual transformer model using Hugging Face Transformers.

NLP
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('[1]')
Drag options to blanks, or click blank then click option'
Agpt2
Bbert-base-uncased
Cxlm-roberta-base
Ddistilbert-base-uncased
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Choosing a monolingual model like bert-base-uncased.
2fill in blank
medium

Complete the code to tokenize input text for a multilingual model.

NLP
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('[1]')
inputs = tokenizer('Hello world', return_tensors='pt')
Drag options to blanks, or click blank then click option'
Abert-base-cased
Bxlm-roberta-base
Croberta-base
Dgpt2
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using a tokenizer for a different model causes errors.
3fill in blank
hard

Fix the error in the code to correctly predict with a multilingual model.

NLP
outputs = model(**[1])
predictions = outputs.logits.argmax(dim=1)
Drag options to blanks, or click blank then click option'
Ainputs
Btokenizer
Ctext
Dlabels
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Passing raw text or tokenizer object instead of tokenized inputs.
4fill in blank
hard

Fill both blanks to create a dictionary comprehension that maps languages to their ISO codes.

NLP
language_codes = {lang[1]: lang[2] for lang in ['eng', 'fra', 'spa', 'deu']}
Drag options to blanks, or click blank then click option'
A[:]
B[::]
C[:3]
D[3:]
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using incorrect slice like [3:] which gets characters after index 3.
5fill in blank
hard

Fill all three blanks to create a dictionary of language names and their token counts from a multilingual tokenizer.

NLP
token_counts = [1]([2]: len(tokenizer.tokenize([2])) for [3] in ['Hello', 'Bonjour', 'Hola'])
Drag options to blanks, or click blank then click option'
Adict
Bword
Cw
Dtext
Attempts:
3 left
๐Ÿ’ก Hint
Common Mistakes
Using different variable names inconsistently.

Practice

(1/5)
1. What is the main advantage of using a multilingual model in natural language processing?
easy
A. It can understand and process multiple languages with a single model.
B. It requires training a separate model for each language.
C. It only works for English language tasks.
D. It uses more resources than training individual models.

Solution

  1. Step 1: Understand the purpose of multilingual models

    Multilingual models are designed to handle many languages using one model instead of separate ones.
  2. Step 2: Compare advantages

    This approach saves time and resources by avoiding multiple models for different languages.
  3. Final Answer:

    It can understand and process multiple languages with a single model. -> Option A
  4. Quick Check:

    Multilingual model advantage = single model for many languages [OK]
Hint: Multilingual means one model for many languages [OK]
Common Mistakes:
  • Thinking multilingual models only work for English
  • Assuming separate models are needed per language
  • Believing multilingual models use more resources
2. Which of the following is the correct way to load a multilingual model using Hugging Face Transformers in Python?
easy
A. model = AutoModel.from_pretrained('xlm-roberta-base')
B. model = AutoModel.from_pretrained('gpt2')
C. model = AutoModel.from_pretrained('bert-base-uncased')
D. model = AutoModel.from_pretrained('bert-large-cased')

Solution

  1. Step 1: Identify multilingual model names

    'xlm-roberta-base' is a well-known multilingual model supporting many languages.
  2. Step 2: Check other options

    'bert-base-uncased' and 'bert-large-cased' are English-only models; 'gpt2' is a generative English model.
  3. Final Answer:

    model = AutoModel.from_pretrained('xlm-roberta-base') -> Option A
  4. Quick Check:

    Multilingual model name = 'xlm-roberta-base' [OK]
Hint: Look for 'xlm' or 'multilingual' in model name [OK]
Common Mistakes:
  • Choosing English-only models for multilingual tasks
  • Confusing generative models with multilingual encoders
  • Using model names without checking language support
3. Consider this Python code using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
model = AutoModelForSequenceClassification.from_pretrained('xlm-roberta-base')
inputs = tokenizer('Bonjour, comment รงa va?', return_tensors='pt')
outputs = model(**inputs)
print(outputs.logits.shape)

What will be the printed output shape?
medium
A. torch.Size([1, 1])
B. torch.Size([1, 2])
C. torch.Size([1, 512])
D. torch.Size([1, 768])

Solution

  1. Step 1: Understand model type and output

    The model is for sequence classification, which outputs logits for each class. The default 'xlm-roberta-base' classification head has 2 classes.
  2. Step 2: Determine output shape

    Batch size is 1 (one sentence), so output logits shape is [1, 2].
  3. Final Answer:

    torch.Size([1, 2]) -> Option B
  4. Quick Check:

    Sequence classification logits shape = [batch, classes] = [1, 2] [OK]
Hint: Classification logits shape = batch size x number of classes [OK]
Common Mistakes:
  • Confusing hidden size with output logits shape
  • Assuming output shape matches input token length
  • Ignoring batch size dimension
4. You tried to use a multilingual model but got this error:
ValueError: Tokenizer does not have a pad token.
What is the best way to fix this error?
medium
A. Use a different model that does not require padding.
B. Add padding=True when calling the tokenizer.
C. Manually set the pad token with tokenizer.pad_token = tokenizer.eos_token.
D. Ignore the error and continue training.

Solution

  1. Step 1: Understand the error cause

    The tokenizer lacks a pad token, which is needed to pad sequences to the same length.
  2. Step 2: Fix by assigning pad token

    Assigning the pad token to an existing token like eos_token solves the issue.
  3. Final Answer:

    Manually set the pad token with tokenizer.pad_token = tokenizer.eos_token. -> Option C
  4. Quick Check:

    Set pad token manually to fix padding error [OK]
Hint: Set pad token manually if missing in tokenizer [OK]
Common Mistakes:
  • Ignoring padding requirement
  • Trying to skip padding without fixing tokenizer
  • Switching models unnecessarily
5. You want to build a multilingual sentiment analysis system supporting English, Spanish, and French. Which approach best balances accuracy and resource use?
hard
A. Train separate models for each language from scratch.
B. Use a rule-based system with language-specific sentiment dictionaries.
C. Use an English-only model and translate all inputs to English before analysis.
D. Use a single pretrained multilingual model fine-tuned on combined data from all three languages.

Solution

  1. Step 1: Consider resource and accuracy trade-offs

    Training separate models is resource-heavy; rule-based systems lack accuracy; translation adds errors.
  2. Step 2: Choose multilingual fine-tuning

    Fine-tuning one multilingual pretrained model on combined data leverages shared knowledge and saves resources.
  3. Final Answer:

    Use a single pretrained multilingual model fine-tuned on combined data from all three languages. -> Option D
  4. Quick Check:

    Multilingual fine-tuning balances accuracy and efficiency [OK]
Hint: Fine-tune one multilingual model on all languages together [OK]
Common Mistakes:
  • Training separate models wastes resources
  • Relying on translation reduces accuracy
  • Using rule-based methods limits performance