Challenge - 5 Problems
Text Preprocessing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of tokenization and padding
Given the following PyTorch code for tokenizing and padding sequences for an RNN, what is the output of the padded tensor?
PyTorch
import torch from torch.nn.utils.rnn import pad_sequence sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5]), torch.tensor([6])] padded = pad_sequence(sequences, batch_first=True, padding_value=0) print(padded)
Attempts:
2 left
💡 Hint
Remember pad_sequence aligns sequences by padding shorter ones with zeros at the end when batch_first=True.
✗ Incorrect
pad_sequence with batch_first=True pads shorter sequences with the padding_value (0) at the end, so all sequences have the same length as the longest one (3 here).
🧠 Conceptual
intermediate1:30remaining
Why use padding in RNN input sequences?
Why do we pad sequences to the same length before feeding them into an RNN?
Attempts:
2 left
💡 Hint
Think about how batches are processed in parallel.
✗ Incorrect
RNNs process batches of sequences in parallel, which requires all sequences in a batch to have the same length. Padding shorter sequences ensures uniform length.
❓ Hyperparameter
advanced2:00remaining
Choosing max sequence length for padding
When preprocessing text for an RNN, what is a common approach to decide the maximum sequence length for padding?
Attempts:
2 left
💡 Hint
Consider balancing memory use and information retention.
✗ Incorrect
Choosing a fixed max length based on domain knowledge or a percentile (e.g., 95th percentile) balances keeping most data and limiting padding overhead.
🔧 Debug
advanced2:00remaining
Error in embedding input shape for RNN
What error will this PyTorch code raise when feeding input to an RNN embedding layer?
import torch
import torch.nn as nn
embedding = nn.Embedding(10, 3)
inputs = torch.tensor([1, 2, 3, 4])
embedded = embedding(inputs)
rnn = nn.RNN(input_size=3, hidden_size=5, batch_first=True)
output, hidden = rnn(embedded)
Attempts:
2 left
💡 Hint
Check the shape of the tensor passed to the RNN.
✗ Incorrect
The RNN expects a 3D input tensor (batch_size, seq_len, input_size), but embedded has shape (seq_len, embedding_dim) which is 2D, causing a RuntimeError.
❓ Model Choice
expert2:30remaining
Best preprocessing for variable-length text sequences in RNN training
You have a dataset of sentences with widely varying lengths. You want to train an RNN efficiently. Which preprocessing approach is best?
Attempts:
2 left
💡 Hint
Consider both efficiency and preserving sequence information.
✗ Incorrect
Using pack_padded_sequence with padded sequences allows the RNN to ignore padded tokens and process variable-length sequences efficiently.