0
0
NLPml~5 mins

Padding and sequence length in NLP

Choose your learning style9 modes available
Introduction

Padding helps make all text sequences the same length so computers can understand them easily. Sequence length is how long each text piece is.

When you have sentences of different lengths and want to feed them into a machine learning model.
When training a neural network that requires fixed-size input, like RNNs or Transformers.
When batching multiple text samples together for faster processing.
When you want to compare or analyze text data uniformly.
When preparing data for models that do not handle variable-length input directly.
Syntax
NLP
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, maxlen=desired_length, padding='post', truncating='post', value=0)

sequences is a list of lists where each inner list is a sequence of numbers (like word indexes).

maxlen sets the fixed length for all sequences after padding or truncating.

Examples
This pads sequences at the end with zeros to length 4.
NLP
padded = pad_sequences([[1, 2, 3], [4, 5]], maxlen=4, padding='post')
This cuts off extra elements from the start if sequences are longer than 3.
NLP
padded = pad_sequences([[1, 2, 3], [4, 5]], maxlen=3, truncating='pre')
This pads sequences at the start with -1 to length 5.
NLP
padded = pad_sequences([[1, 2], [3, 4, 5]], maxlen=5, padding='pre', value=-1)
Sample Model

This program shows how sequences of different lengths become the same length by adding zeros at the end.

NLP
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample sequences of different lengths
sequences = [[10, 20, 30], [40, 50], [60]]

# Pad sequences to length 5, add zeros at the end
padded = pad_sequences(sequences, maxlen=5, padding='post', value=0)

print('Original sequences:')
print(sequences)
print('\nPadded sequences:')
print(padded)
OutputSuccess
Important Notes

Padding value is usually zero but can be changed if zero is a meaningful number in your data.

Truncating removes extra elements if sequences are longer than maxlen.

Consistent sequence length is important for batch processing in neural networks.

Summary

Padding makes all sequences the same length by adding extra values.

Sequence length is the fixed size after padding or truncating.

This helps models process text data efficiently and correctly.