In natural language processing, what is the main effect of increasing the context window size when training a language model?
Think about how much text the model can see at once.
Increasing the context window size allows the model to see more words together, which helps it understand relationships between words that are far apart in the text.
What is the output of the following Python code that simulates a sliding context window over tokens?
tokens = ['I', 'love', 'machine', 'learning', 'and', 'AI'] window_size = 3 windows = [tokens[i:i+window_size] for i in range(len(tokens) - window_size + 1)] print(windows)
Look at how the list comprehension slices the tokens with a fixed window size.
The code creates overlapping windows of size 3 moving one token at a time, so each window contains 3 consecutive tokens.
You want to build a model that can understand very long documents (thousands of words). Which model architecture is best suited to handle such long context windows efficiently?
Consider models designed to reduce computation for long sequences.
Transformers with sparse attention or memory mechanisms can handle longer contexts by focusing attention on important parts, reducing computation compared to standard transformers.
How does increasing the context window size affect the training time of a Transformer-based language model?
Think about how attention computation scales with sequence length.
Self-attention in transformers computes relationships between all token pairs, so computation grows roughly with the square of the context window size.
What error does the following code raise when trying to create context windows, and why?
tokens = ['a', 'b', 'c'] window_size = 4 windows = [tokens[i:i+window_size] for i in range(len(tokens) - window_size + 1)] print(windows)
Check what happens when the range argument is negative.
The range becomes range(3 - 4 + 1) = range(0), which is empty, so windows is an empty list without error.