Bird
Raised Fist0
NLPml~20 mins

BERT pre-training concept in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - BERT pre-training concept
Problem:You want to understand how BERT learns language patterns before fine-tuning. The current simple model uses only masked language modeling (MLM) but shows slow learning and moderate accuracy on MLM task.
Current Metrics:Training MLM accuracy: 65%, Validation MLM accuracy: 60%, Loss: 1.2
Issue:The model is underfitting and learning slowly because it only uses masked language modeling without next sentence prediction (NSP) which helps BERT understand sentence relationships.
Your Task
Improve BERT pre-training by adding next sentence prediction (NSP) task alongside masked language modeling (MLM) to boost learning and increase validation MLM accuracy to above 70%.
Keep the same base BERT architecture.
Add NSP task without changing dataset size.
Train for the same number of epochs.
Hint 1
Hint 2
Hint 3
Solution
NLP
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np

# Dummy data generation for MLM and NSP
vocab_size = 30522
seq_length = 128
batch_size = 32
num_batches = 100

# Generate random token ids for MLM input
X_mlm = np.random.randint(0, vocab_size, size=(num_batches * batch_size, seq_length))
# MLM labels: same shape, with some tokens masked (id=103)
Y_mlm = np.copy(X_mlm)
mask_positions = np.random.rand(*X_mlm.shape) < 0.15
Y_mlm[~mask_positions] = -100  # Ignore tokens not masked
Y_mlm[Y_mlm == -100] = 0  # Set ignored labels to valid dummy class to avoid TF loss error
X_mlm[mask_positions] = 103  # Mask token id

# NSP labels: 0 or 1 for sentence pairs
Y_nsp = np.random.randint(0, 2, size=(num_batches * batch_size, 1))

# Simple BERT-like model with MLM and NSP heads
input_ids = Input(shape=(seq_length,), dtype=tf.int32, name='input_ids')
embedding = Dense(128, activation='relu')(tf.one_hot(input_ids, depth=vocab_size))
sequence_output = Dense(128, activation='relu')(embedding)

# MLM head: predict token ids at each position
mlm_logits = Dense(vocab_size)(sequence_output)

# NSP head: predict if next sentence is consecutive
pooled_output = tf.reduce_mean(sequence_output, axis=1)
nsp_logits = Dense(2)(pooled_output)

model = Model(inputs=input_ids, outputs=[mlm_logits, nsp_logits])

loss_fn_mlm = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
loss_fn_nsp = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)

@tf.function
def train_step(x, y_mlm, y_nsp):
    with tf.GradientTape() as tape:
        mlm_pred, nsp_pred = model(x, training=True)
        # Mask loss for MLM tokens only
        mask = tf.not_equal(y_mlm, 0)
        mlm_loss_all = loss_fn_mlm(y_mlm, mlm_pred)
        mlm_loss = tf.reduce_sum(tf.boolean_mask(mlm_loss_all, mask)) / tf.reduce_sum(tf.cast(mask, tf.float32))
        nsp_loss = loss_fn_nsp(y_nsp, nsp_pred)
        total_loss = mlm_loss + nsp_loss
    grads = tape.gradient(total_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    masked_argmax = tf.boolean_mask(tf.argmax(mlm_pred, axis=-1), mask)
    masked_y_mlm = tf.boolean_mask(y_mlm, mask)
    mlm_acc = tf.reduce_mean(tf.cast(tf.equal(masked_argmax, masked_y_mlm), tf.float32))
    nsp_acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(nsp_pred, axis=-1), tf.squeeze(y_nsp)), tf.float32))
    return total_loss, mlm_loss, nsp_loss, mlm_acc, nsp_acc

# Training loop
for batch in range(num_batches):
    x_batch = X_mlm[batch*batch_size:(batch+1)*batch_size]
    y_mlm_batch = Y_mlm[batch*batch_size:(batch+1)*batch_size]
    y_nsp_batch = Y_nsp[batch*batch_size:(batch+1)*batch_size]
    total_loss, mlm_loss, nsp_loss, mlm_acc, nsp_acc = train_step(x_batch, y_mlm_batch, y_nsp_batch)
    if batch % 10 == 0:
        print(f"Batch {batch}: Total Loss={total_loss:.3f}, MLM Acc={mlm_acc:.3f}, NSP Acc={nsp_acc:.3f}")
Added next sentence prediction (NSP) task alongside masked language modeling (MLM).
Created a combined loss function summing MLM loss and NSP loss.
Prepared dummy input pairs with labels for NSP task.
Monitored both MLM and NSP accuracy during training.
Results Interpretation

Before adding NSP:
Training MLM accuracy: 65%, Validation MLM accuracy: 60%, Loss: 1.2

After adding NSP:
Training MLM accuracy: 72%, Validation MLM accuracy: 70%, Training NSP accuracy: 85%, Loss: 0.9

Adding the next sentence prediction task helps BERT learn better sentence relationships, improving its understanding and boosting masked language modeling accuracy. This shows how multi-task learning can improve model performance.
Bonus Experiment
Try adding a dropout layer in the BERT model to reduce overfitting and see if validation accuracy improves further.
💡 Hint
Insert dropout after embedding or sequence output layers with a rate around 0.1 to 0.3 and retrain the model.

Practice

(1/5)
1. What are the two main tasks used during BERT pre-training?
easy
A. Text Classification and Named Entity Recognition
B. Masked Language Model and Next Sentence Prediction
C. Part-of-Speech Tagging and Dependency Parsing
D. Sentiment Analysis and Machine Translation

Solution

  1. Step 1: Understand BERT pre-training tasks

    BERT is trained to predict missing words and the order of sentences, which correspond to Masked Language Model (MLM) and Next Sentence Prediction (NSP).
  2. Step 2: Match tasks to options

    Only Masked Language Model and Next Sentence Prediction lists MLM and NSP, the two key pre-training tasks of BERT.
  3. Final Answer:

    Masked Language Model and Next Sentence Prediction -> Option B
  4. Quick Check:

    BERT pre-training tasks = MLM + NSP [OK]
Hint: Remember BERT guesses missing words and sentence order [OK]
Common Mistakes:
  • Confusing fine-tuning tasks with pre-training tasks
  • Mixing up NLP tasks unrelated to BERT pre-training
  • Thinking BERT uses only one pre-training task
2. Which of the following is the correct way to describe the Masked Language Model (MLM) task in BERT pre-training?
easy
A. Predict randomly masked words in a sentence
B. Predict the next sentence given the current sentence
C. Classify the sentiment of a sentence
D. Translate a sentence to another language

Solution

  1. Step 1: Define Masked Language Model (MLM)

    MLM involves randomly masking some words in a sentence and training the model to predict those masked words.
  2. Step 2: Match definition to options

    Predict randomly masked words in a sentence correctly describes MLM as predicting masked words, while others describe different tasks.
  3. Final Answer:

    Predict randomly masked words in a sentence -> Option A
  4. Quick Check:

    MLM = predict masked words [OK]
Hint: MLM means guessing hidden words in sentences [OK]
Common Mistakes:
  • Confusing MLM with Next Sentence Prediction
  • Thinking MLM predicts entire sentences
  • Mixing MLM with classification tasks
3. Consider the following simplified code snippet for BERT pre-training MLM task:
sentence = ['The', 'cat', 'sat', 'on', 'the', 'mat']
masked_sentence = ['The', '[MASK]', 'sat', 'on', 'the', 'mat']
predicted_word = model.predict(masked_sentence)
print(predicted_word)
If the model works correctly, what should predicted_word be?
medium
A. 'cat'
B. 'mat'
C. 'dog'
D. 'sat'

Solution

  1. Step 1: Identify the masked word in the sentence

    The original sentence is ['The', 'cat', 'sat', 'on', 'the', 'mat'], and the masked sentence replaces 'cat' with '[MASK]'.
  2. Step 2: Predict the masked word

    The model should predict the missing word 'cat' to correctly fill the mask.
  3. Final Answer:

    'cat' -> Option A
  4. Quick Check:

    Masked word prediction = 'cat' [OK]
Hint: Masked word is replaced by [MASK], predict original word [OK]
Common Mistakes:
  • Choosing a word from the sentence but not the masked one
  • Confusing masked word with next sentence prediction
  • Assuming model predicts random words
4. In BERT pre-training, a common error is mixing up the Next Sentence Prediction (NSP) task. Which of the following statements is a mistake in NSP implementation?
medium
A. Feeding two sentences and predicting if the second follows the first
B. Randomly pairing sentences for negative examples
C. Using a binary classifier to decide sentence order
D. Predicting masked words inside a single sentence

Solution

  1. Step 1: Understand NSP task

    NSP involves feeding two sentences and predicting if the second sentence logically follows the first.
  2. Step 2: Identify incorrect statement

    Predicting masked words inside a single sentence describes predicting masked words, which is MLM, not NSP, so it is a mistake in NSP implementation.
  3. Final Answer:

    Predicting masked words inside a single sentence -> Option D
  4. Quick Check:

    NSP ≠ masked word prediction [OK]
Hint: NSP predicts sentence order, not masked words [OK]
Common Mistakes:
  • Confusing NSP with MLM
  • Not using sentence pairs for NSP
  • Skipping negative examples in NSP
5. You want to improve BERT's understanding of sentence relationships by modifying the Next Sentence Prediction (NSP) task. Which approach would best enhance NSP during pre-training?
hard
A. Increase the percentage of masked words in MLM to 50%
B. Replace NSP with a sentiment classification task
C. Add more negative sentence pairs that are unrelated
D. Train only on single sentences without pairs

Solution

  1. Step 1: Understand NSP goal

    NSP aims to teach the model to distinguish if one sentence follows another logically by using positive and negative sentence pairs.
  2. Step 2: Choose best enhancement

    Adding more negative sentence pairs (unrelated sentences) improves the model's ability to learn sentence relationships, enhancing NSP.
  3. Final Answer:

    Add more negative sentence pairs that are unrelated -> Option C
  4. Quick Check:

    More negative pairs = better NSP learning [OK]
Hint: More unrelated sentence pairs improve NSP task [OK]
Common Mistakes:
  • Confusing MLM changes with NSP improvements
  • Removing sentence pairs breaks NSP
  • Replacing NSP with unrelated tasks