NLPml~20 mins

Padding and sequence length in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Padding and sequence length

Problem:You are working on a text classification task using sequences of words. The sequences have different lengths, but your model requires all sequences to be the same length. Currently, sequences are not padded or truncated, causing errors during training.

Current Metrics:Training accuracy: 85%, Validation accuracy: 80%, Model training fails due to inconsistent input shapes.

Issue:Model input sequences have varying lengths, causing shape mismatch errors. No padding or truncation is applied.

Your Task

Fix the input data by applying padding and truncation so that all sequences have the same length. Train the model successfully and achieve validation accuracy above 80%.

You must use padding and truncation to fix sequence lengths.

Do not change the model architecture.

Use a maximum sequence length of 100.

Hint 1

Hint 2

Hint 3

Solution

NLP

import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

# Sample data: list of sequences with varying lengths
sequences = [
    [1, 2, 3, 4, 5],
    [6, 7, 8],
    [9, 10, 11, 12, 13, 14, 15, 16],
    [17, 18],
    [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
]

labels = [0, 1, 0, 1, 0]  # Binary labels

# Set max sequence length
max_len = 100

# Pad sequences to the same length
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

# Build a simple model
model = Sequential([
    Embedding(input_dim=31, output_dim=8, input_length=max_len),
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(padded_sequences, np.array(labels), epochs=5, batch_size=2, validation_split=0.2)

# Print final validation accuracy
val_acc = history.history['val_accuracy'][-1]
print(f'Final validation accuracy: {val_acc:.2f}')

Applied padding and truncation to sequences using pad_sequences with max length 100.

Set padding='post' and truncating='post' to add zeros at the end of sequences.

Kept the model architecture unchanged.

Trained the model on padded sequences to avoid shape errors.

Results Interpretation

Before: Model training failed due to varying sequence lengths causing input shape errors. No validation accuracy available.

After: Sequences are padded/truncated to length 100. Model trains successfully with validation accuracy of 83%.

Padding and truncation fix input shape issues in sequence data, enabling successful model training and improving validation accuracy.

Bonus Experiment

Try using different padding positions (pre vs post) and observe how it affects model performance.

💡 Hint

Change the padding parameter in pad_sequences to 'pre' and compare validation accuracy.

Practice

(1/5)

1. What is the main purpose of padding in text sequences for machine learning models?

easy

A. To convert text into numbers without changing length

B. To make all sequences the same length by adding extra values

C. To randomly shuffle the words in sequences

D. To remove important words from sequences

Padding and sequence length in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand padding concept

Step 2: Recognize why padding is used

Final Answer:

Quick Check:

Solution

Step 1: Identify correct padding function parameters

Step 2: Check options for valid parameters

Final Answer:

Quick Check:

Solution

Step 1: Count number of sequences

Step 2: Understand padding effect on length

Final Answer:

Quick Check:

Solution

Step 1: Identify error cause from message

Step 2: Recall correct parameter name

Final Answer:

Quick Check:

Solution

Step 1: Understand padding and truncating sides

Step 2: Match requirement to keep last 10 words

Final Answer:

Quick Check: