BERT uses a special pre-training task called Masked Language Model (MLM). What is the main goal of MLM?
Think about how BERT learns from words hidden in the middle of sentences.
MLM trains BERT to predict missing words by looking at words before and after the masked word, helping it understand context deeply.
Besides MLM, BERT uses Next Sentence Prediction (NSP) during pre-training. What does NSP help BERT learn?
Think about how BERT understands relationships between two sentences.
NSP trains BERT to decide if a second sentence naturally follows the first, helping it learn sentence relationships useful for tasks like question answering.
BERT can look at words before and after a masked word simultaneously. Which part of BERT's architecture allows this?
Think about which architecture processes all words at once with attention.
BERT uses Transformer encoder layers that attend to all words in the input simultaneously, enabling bidirectional context understanding.
Which metric is commonly used to measure BERT's performance on the Masked Language Model task during pre-training?
Focus on how well the model guesses the hidden words correctly.
Accuracy measures the percentage of masked tokens correctly predicted, directly reflecting MLM task performance.
Suppose you accidentally feed BERT input sequences without masking any tokens during MLM pre-training. What is the most likely outcome?
Think about what happens if the model never has to guess missing words.
If no tokens are masked, the model always sees the full input and does not learn to predict missing words, resulting in low loss but poor MLM learning.