Overview - Positional encoding

What is it?

Positional encoding is a way to add information about the order of words or tokens in a sequence to a model. Since some models, like transformers, do not process data in order, positional encoding helps them understand the position of each token. It creates a unique pattern for each position that the model can learn from. This allows the model to use the order of words to make better predictions.

Why it matters

Without positional encoding, models that process sequences all at once would treat the input tokens as if they were unordered, like a bag of words. This would make it impossible to understand sentences or time series where order matters. Positional encoding solves this by giving the model a sense of position, enabling it to learn relationships that depend on order, such as grammar or time dependencies.

Where it fits

Before learning positional encoding, you should understand basic neural networks and the transformer architecture. After mastering positional encoding, you can explore advanced transformer models, attention mechanisms, and sequence modeling tasks like language translation or time series forecasting.

Mental Model

Core Idea

Positional encoding adds unique position information to each token so models can understand the order in sequences that process data all at once.

Think of it like...

It's like adding a numbered label to each book on a shelf so you know their order, even if you look at all books at once.

Sequence: [Token1] [Token2] [Token3] ...
Positions:   1        2        3   ...

Positional Encoding:
  Token1 + Pos1 → Vector with unique pattern
  Token2 + Pos2 → Vector with unique pattern
  Token3 + Pos3 → Vector with unique pattern

Model input = Token embedding + Positional encoding

Build-Up - 7 Steps

1

FoundationWhy order matters in sequences

Concept: Sequences have order, and understanding this order is key to meaning.

Imagine the sentence 'I love cats' versus 'Cats love I'. The words are the same but the meaning changes because of order. Models need to know this order to understand language or time series data.

Result

Recognizing that order changes meaning helps us see why models must know token positions.

Understanding that sequence order carries meaning is the foundation for why positional encoding is necessary.

2

FoundationLimitations of transformer input processing

3

IntermediateHow positional encoding represents positions

4

IntermediateAdding positional encoding to token embeddings

5

IntermediateImplementing positional encoding in PyTorch

6

AdvancedLearned vs fixed positional encoding tradeoffs

7

ExpertPositional encoding in relative attention models

Under the Hood

Positional encoding works by creating a unique vector for each position using sine and cosine waves at different frequencies. These vectors are added to token embeddings, so the model input contains both token meaning and position info. The model's attention mechanism then uses these combined vectors to learn relationships that depend on token order. Internally, the sine and cosine functions produce smooth, continuous signals that help the model distinguish positions and interpolate between them.

Why designed this way?

The original transformer used fixed sine and cosine functions to avoid learning position embeddings, which could overfit or fail to generalize to longer sequences. The choice of different frequencies allows encoding positions uniquely and continuously. Alternatives like learned embeddings were possible but risked losing generalization. This design balances simplicity, generalization, and performance.

Input tokens → Token embeddings ─┐
                               │
                               +→ Add positional encoding → Model input

Positional encoding:
Position index → Sine & Cosine functions at multiple frequencies → Position vector

Model input = Token embedding + Position vector

Transformer layers process combined input preserving order info.

Myth Busters - 4 Common Misconceptions

Quick: Does positional encoding change the token meaning vectors themselves? Commit yes or no.

Common Belief:Positional encoding replaces the token embeddings with position information.

Tap to reveal reality

Quick: Do learned positional encodings always outperform fixed ones? Commit yes or no.

Common Belief:Learned positional encodings are always better because the model can adapt them.

Tap to reveal reality

Quick: Is positional encoding only needed for transformers? Commit yes or no.

Common Belief:Only transformer models need positional encoding because other models process sequences differently.

Tap to reveal reality

Quick: Does positional encoding encode absolute positions only? Commit yes or no.

Common Belief:Positional encoding always encodes absolute positions of tokens.

Tap to reveal reality

Expert Zone

1

Positional encoding vectors form a continuous space allowing interpolation for unseen positions, which helps models generalize beyond training lengths.

2

The choice of frequencies in sine/cosine functions affects how well the model can distinguish close versus distant positions.

3

Relative positional encoding modifies attention scores directly, which can reduce memory usage and improve efficiency in long sequences.

When NOT to use

Positional encoding is less useful in models that inherently process sequences in order, like RNNs or LSTMs, which track position through their recurrent steps. For tasks where absolute position is irrelevant, such as bag-of-words models, positional encoding is unnecessary.

Production Patterns

In production, fixed positional encoding is common for standard transformers due to simplicity and generalization. Learned positional embeddings are used when training data is large and fixed sequence length is guaranteed. Relative positional encoding is popular in large language models and speech recognition systems to handle variable-length inputs efficiently.

Connections

Fourier Transform

Positional encoding uses sine and cosine waves similar to Fourier basis functions.

Understanding Fourier transforms helps grasp why sine and cosine functions can uniquely represent positions as continuous signals.

Time Series Analysis

Both positional encoding and time series methods rely on order and temporal patterns in data.

Knowing how time series models use time steps clarifies why positional encoding is crucial for sequence models to capture temporal dependencies.

Music Notation

Positional encoding is like musical notes having timing and order to create melody.

Recognizing that order and timing create meaning in music helps understand why models need position info to interpret sequences.

Common Pitfalls

#1Using learned positional embeddings without enough data or sequence length variety.

Wrong approach:pos_embedding = torch.nn.Embedding(max_len, d_model) # Model uses pos_embedding learned from limited data only

Correct approach:pos_encoding = get_positional_encoding(max_len, d_model) # Use fixed sine/cosine positional encoding for better generalization

Root cause:Assuming learned embeddings always outperform fixed ones ignores overfitting and poor generalization risks.

#2Replacing token embeddings with positional encoding vectors.

Wrong approach:model_input = positional_encoding # ignoring token embeddings

Correct approach:model_input = token_embeddings + positional_encoding # combine both

Root cause:Misunderstanding that positional encoding supplements rather than replaces token meaning.

#3Applying positional encoding to unordered data where position is irrelevant.

Wrong approach:Adding positional encoding to bag-of-words model inputs

Correct approach:Skip positional encoding for unordered data or use models designed for unordered inputs

Root cause:Not recognizing when sequence order is meaningful leads to unnecessary complexity.

Key Takeaways

Positional encoding gives models a way to understand the order of tokens in sequences that are processed all at once.

It works by adding unique position vectors, often created with sine and cosine functions, to token embeddings.

This combined input lets models learn relationships that depend on token order, essential for language and time series tasks.

There are fixed and learned positional encodings, each with tradeoffs in generalization and flexibility.

Advanced models use relative positional encoding to focus on distances between tokens, improving performance on variable-length sequences.