Which of the following is the main advantage of using a multilingual model over separate monolingual models?
Think about how sharing information between languages can help when some languages have less data.
Multilingual models share parameters across languages, allowing them to learn from more data collectively. This helps especially low-resource languages. However, they do not always outperform monolingual models on every language, nor do they remove all preprocessing or bias.
Given the following code snippet using a multilingual tokenizer, what is the output tokens list?
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased') text = 'Hello, 你好, مرحبا' tokens = tokenizer.tokenize(text) print(tokens)
Remember that BERT uses WordPiece tokenization which may split words into subwords marked by '##'.
The multilingual BERT tokenizer splits Chinese characters individually and Arabic words into subwords with '##'. So '你' and '好' are separate tokens, and 'مرحبا' is split into 'م', '##رح', '##با'.
You want to build a text classification model for a low-resource language with very little labeled data. Which model choice is best?
Consider how pretrained knowledge from many languages can help when labeled data is scarce.
Large multilingual pretrained models have learned shared representations across languages and can transfer knowledge to low-resource languages during fine-tuning, improving performance. Training from scratch or translating data may be less effective.
You evaluate a multilingual model on three languages and get these F1 scores: English 0.85, Spanish 0.80, Swahili 0.60. Which metric best summarizes overall performance fairly?
Think about how to fairly combine scores when languages have different data sizes.
Weighted-average F1 accounts for the number of samples per language, giving a fair overall performance measure when data sizes differ. Macro-average treats all languages equally regardless of size, micro-average pools all samples ignoring language, and max ignores other languages.
You fine-tuned a multilingual model on a dataset with English and French texts. The model performs well on English but poorly on French. Which is the most likely cause?
Check if the tokenizer matches the model and supports all languages in your data.
If the tokenizer is not set to a multilingual tokenizer, French text may be split incorrectly, leading to poor model input and performance. Model architectures generally support all tokens from their tokenizer. Data exclusion or optimizer incompatibility are less common causes.