Building Intuition from First Principles
Shuffle the words in "The cat sat on the mat" and a transformer sees no difference. It processes tokens in parallel with no notion of order. So how do models like GPT know that "cat" comes before "sat"?
PE(pos, 2i) = sin(pos / 100002i/d)
This formula. But why sine? Why 10000? Why alternate sin/cos? We'll derive this from scratch, and each choice will feel inevitable.
A transformer's self-attention treats input as a set, not a sequence. Unlike RNNs which process tokens one by one, transformers see all tokens simultaneously with no built-in notion of "first" or "last."
This is powerful for parallelization, but creates a fundamental problem: "The cat sat on the mat" and "mat the on sat cat The"produce identical attention patterns.
Let's prove this. We'll create three token embeddings and compute their attention scores:
Now shuffle the token order and recompute:
Identical attention patterns, just permuted. The model genuinely cannot distinguish these orderings.
Our goal: inject position information into embeddings so that tokens the same distance apart in the sequence are "pushed together" in embedding space. The model can then learn to attend based on both content and position.
The simplest idea: add the position number directly. Position 0 adds 0, position 1 adds 1, and so on.
Here's a small example — three token embeddings with values around [-1, 1]:
Adding positions 0, 1, 2 works fine for short sequences. But what about position 9999?
The position signal drowns out the semantics. "Cat" and "dog" at position 9999 become indistinguishable.
We could normalize to [0, 1], but then "position 5" means different things in different length sequences. We need something better.
Let's think about what a good positional encoding requires:
Sine is bounded [-1, 1], smooth, and consistent regardless of sequence length. Let's try sin(position):
Positions 0, 1, 2 get values 0, 0.84, 0.91 — distinct and bounded:
Promising! But sine has a period of 2π ≈ 6.28. What happens at position 6?
Position 0 and 6 get nearly identical encodings. For sequences longer than ~6 tokens, we're back to ambiguity.
We can control how fast sine repeats with a frequency multiplier :
Lower ω = slower oscillation = longer before it repeats. Let's try ω = 0.1:
No more collisions at position 6! But now positions 0 and 1 are almost identical:
High frequency means good local discrimination but bad global. Low frequency is the opposite. We need both.
Think of a clock: the hour hand (slow), minute hand (medium), and second hand (fast) together uniquely identify any moment. We can do the same with position: use multiple dimensions, each oscillating at a different frequency.
Two dimensions — one high frequency (ω=1) for local, one low (ω=0.1) for global:
Now each position has a unique 2D signature. Position 0 ≠ position 6 because even though the high-frequency component repeats, the low-frequency one doesn't:
Plotting both frequencies shows how they complement each other:
There's still a subtle issue: sin(θ) = sin(π - θ). Two different positions can have the same sine value. The fix? Add cosine — which is 90° out of phase.
When sine values collide, cosine values differ:
Geometrically, (sin(θ), cos(θ)) traces a circle. Each position is a unique point — no collisions possible:
With sin/cos pairs at multiple frequencies, we have a complete encoding scheme:
The original Transformer paper uses a specific formula for frequencies:
Where i is the dimension index and d is the model dimension. This creates a geometric progression of wavelengths:
| Dim (i) | Frequency ω | Wavelength (positions) | Purpose |
|---|---|---|---|
| 0 | 1.0 | ~6 positions | Fine-grained: adjacent tokens |
| d/4 | 0.01 | ~600 positions | Medium: sentence structure |
| d/2 | 0.0001 | ~60,000 positions | Coarse: document structure |
Why 10000? This "base" ensures the lowest frequency dimension doesn't repeat within ~10,000 positions. It sets the effective maximum context length the encoding can handle before patterns start repeating.
Let's compute these frequencies for a small embedding dimension:
High frequency dimensions change rapidly (good for nearby positions). Low frequency dimensions barely change (but stay unique over long distances):
Explore how different dimension pairs behave:
High frequency: Changes rapidly. Good for distinguishing nearby positions.
The complete formula from "Attention Is All You Need":
Here is the complete implementation:
Adding positional encoding to token embeddings:
The same token at different positions now has different representations:
The positional encoding forms a beautiful pattern. Each row is a position, each column is a dimension. Notice the different wavelengths across dimensions:
Explore the matrix interactively. Adjust sequence length and model dimension:
Looking at specific dimension pairs to see the frequency differences:
Does our encoding satisfy the requirements we set out? Let's check:
Bounded: Values stay in [-1, 1] regardless of position:
Unique: Each position gets a distinct encoding (measuring pairwise distances):
Smooth: Nearby positions have similar encodings (small distance), far positions differ more:
A beautiful property: PE(i) · PE(j) depends mainly on |i - j|, the relativedistance between positions, not the absolute positions themselves.
Computing dot products between all position pairs:
PE(5) · PE(8) ≈ PE(10) · PE(13) because both pairs have distance 3:
Explore how dot product varies with relative distance:
Notice: The dot product is highest at the reference position (purple) and decays smoothly with distance. Positions at equal distance from the reference have similar dot products.
Plotting dot product as a function of relative distance:
A clean PyTorch implementation you can use in your models:
Try computing positional encodings yourself:
Sinusoidal PE works well, but has a subtle issue. What we ideally want is for attention scores to cleanly separate into:
But when we add positional encodings and compute attention:
Let's see what this expansion actually produces:
This expands to four terms:
| Term | What it measures |
|---|---|
| q · k | Pure semantic similarity — how related are these tokens? |
| q · pe_j | How much the query content "likes" position j |
| pe_i · k | How much position i "likes" the key content |
| pe_i · pe_j | Pure positional relationship (depends on i - j) |
The cross terms mix semantic content with position. The model has to learn to extract the useful relative position signal from pe_i · pe_j while disentangling it from the content-position interactions.
One alternative is concatenation instead of addition:
But concatenation doubles the dimension, increasing computation:
We've built up sinusoidal PE piece by piece: bounded values → multiple frequencies → sin/cos pairs → the 10000 base. It works! But position and content get entangled in attention.
Here's a thought: the (sin, cos) pairs we're using... those define a rotation. What if instead of adding position to embeddings, we rotated the query and key vectors?