In natural language processing, why is padding used when preparing sequences for models?
Think about how models handle input data in batches.
Padding makes all sequences the same length, allowing batch processing. Without padding, sequences of different lengths can't be processed together efficiently.
Given the following code that pads sequences to a max length of 5, what is the output?
from tensorflow.keras.preprocessing.sequence import pad_sequences sequences = [[1, 2, 3], [4, 5], [6]] padded = pad_sequences(sequences, maxlen=5, padding='post') print(padded.tolist())
Check the padding='post' argument and maxlen=5.
Padding='post' adds zeros after the sequence to reach length 5.
You have text sequences of varying lengths from 10 to 100 tokens. You want to train an RNN model efficiently. Which sequence length choice is best?
Consider balancing information retention and training speed.
Padding to median length balances keeping enough information and efficient training. Padding to max length wastes resources; too short loses info; variable lengths complicate batching.
When training a text classification model, how can excessive padding affect accuracy?
Think about how meaningless zeros might affect learning.
Excessive padding adds many zeros that do not carry information, which can confuse the model and reduce accuracy.
Consider this code snippet:
from tensorflow.keras.preprocessing.sequence import pad_sequences sequences = [[1, 2], [3, 4, 5, 6]] padded = pad_sequences(sequences, maxlen=3, padding='post') print(padded)
Why does this code truncate unexpectedly?
Check the maxlen parameter and sequence lengths.
When maxlen is smaller than some sequences, pad_sequences truncates them. If truncating parameter is not set, it defaults to 'pre'. This can cause unexpected behavior if sequences are longer than maxlen.