Transformer modeling in Simulink - Time & Space Complexity
We want to understand how the time needed to run a Transformer model changes as the input size grows.
Specifically, how does the model's processing time increase when we feed it longer sequences?
Analyze the time complexity of the following Transformer attention block in Simulink.
% Transformer Attention Block
InputSequence = Inport('Sequence');
Q = DenseLayer(InputSequence, 'WeightsQ');
K = DenseLayer(InputSequence, 'WeightsK');
V = DenseLayer(InputSequence, 'WeightsV');
AttentionScores = Q * K';
AttentionWeights = Softmax(AttentionScores);
Output = AttentionWeights * V;
Outport('Output', Output);
This code computes self-attention by multiplying queries and keys, then applying weights to values.
Look for repeated calculations that depend on input size.
- Primary operation: Matrix multiplication between Q and K transpose.
- How many times: For each element in the input sequence, it multiplies with every other element, repeating n × n times.
As the input sequence length grows, the number of operations grows much faster.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 100 (10×10) |
| 100 | 10,000 (100×100) |
| 1000 | 1,000,000 (1000×1000) |
Pattern observation: The operations grow by the square of the input size, so doubling the input makes the work four times bigger.
Time Complexity: O(n2)
This means the time to compute attention grows with the square of the input length.
[X] Wrong: "The attention calculation grows linearly with input size because it processes each element once."
[OK] Correct: Each element compares with every other element, so the number of comparisons grows much faster than just once per element.
Understanding how Transformer attention scales helps you explain model performance and design choices clearly in real projects.
What if we changed the attention to only look at a fixed number of nearby elements instead of all? How would the time complexity change?