Silen's Blog

Positional Embeddings

Building Intuition from First Principles

Shuffle the words in "The cat sat on the mat" and a transformer sees no difference. It processes tokens in parallel with no notion of order. So how do models like GPT know that "cat" comes before "sat"?

PE_{(pos, 2i)} = sin(pos / 10000^2i/d)

This formula. But why sine? Why 10000? Why alternate sin/cos? We'll derive this from scratch, and each choice will feel inevitable.

The Problem: Transformers Are Blind to Order

A transformer's self-attention treats input as a set, not a sequence. Unlike RNNs which process tokens one by one, transformers see all tokens simultaneously with no built-in notion of "first" or "last."

This is powerful for parallelization, but creates a fundamental problem: "The cat sat on the mat" and "mat the on sat cat The"produce identical attention patterns.

Let's prove this. We'll create three token embeddings and compute their attention scores:

Now shuffle the token order and recompute:

Identical attention patterns, just permuted. The model genuinely cannot distinguish these orderings.

Our goal: inject position information into embeddings so that tokens the same distance apart in the sequence are "pushed together" in embedding space. The model can then learn to attend based on both content and position.

First Attempt: Just Add Integers

The simplest idea: add the position number directly. Position 0 adds 0, position 1 adds 1, and so on.

Here's a small example — three token embeddings with values around [-1, 1]:

Adding positions 0, 1, 2 works fine for short sequences. But what about position 9999?

The position signal drowns out the semantics. "Cat" and "dog" at position 9999 become indistinguishable.

We could normalize to [0, 1], but then "position 5" means different things in different length sequences. We need something better.

What Do We Actually Need?

Let's think about what a good positional encoding requires:

Bounded values: Should not explode for long sequences
Unique per position: Each position must be distinguishable
Consistent scale: Same encoding scheme works for any sequence length
Smooth: Nearby positions should have similar encodings
Learnable patterns: The model should be able to learn relative positions

Enter the Sine Function

Sine is bounded [-1, 1], smooth, and consistent regardless of sequence length. Let's try sin(position):

Positions 0, 1, 2 get values 0, 0.84, 0.91 — distinct and bounded:

Promising! But sine has a period of 2π ≈ 6.28. What happens at position 6?

Position 0 and 6 get nearly identical encodings. For sequences longer than ~6 tokens, we're back to ambiguity.

Adjusting the Frequency

We can control how fast sine repeats with a frequency multiplier $\omega$ :

\sin(\omega \cdot \text{pos})

Lower ω = slower oscillation = longer before it repeats. Let's try ω = 0.1:

No more collisions at position 6! But now positions 0 and 1 are almost identical:

High frequency means good local discrimination but bad global. Low frequency is the opposite. We need both.

The Key Insight: Multiple Frequencies

Think of a clock: the hour hand (slow), minute hand (medium), and second hand (fast) together uniquely identify any moment. We can do the same with position: use multiple dimensions, each oscillating at a different frequency.

Two dimensions — one high frequency (ω=1) for local, one low (ω=0.1) for global:

Now each position has a unique 2D signature. Position 0 ≠ position 6 because even though the high-frequency component repeats, the low-frequency one doesn't:

Plotting both frequencies shows how they complement each other:

Adding Cosine: The Circle Trick

There's still a subtle issue: sin(θ) = sin(π - θ). Two different positions can have the same sine value. The fix? Add cosine — which is 90° out of phase.

When sine values collide, cosine values differ:

Geometrically, (sin(θ), cos(θ)) traces a circle. Each position is a unique point — no collisions possible:

With sin/cos pairs at multiple frequencies, we have a complete encoding scheme:

The Frequency Formula

The original Transformer paper uses a specific formula for frequencies:

Frequency Formula

\omega_i = \frac{1}{10000^{2i/d}}

Where i is the dimension index and d is the model dimension. This creates a geometric progression of wavelengths:

Dim (i)	Frequency ω	Wavelength (positions)	Purpose
0	1.0	~6 positions	Fine-grained: adjacent tokens
d/4	0.01	~600 positions	Medium: sentence structure
d/2	0.0001	~60,000 positions	Coarse: document structure

Why 10000? This "base" ensures the lowest frequency dimension doesn't repeat within ~10,000 positions. It sets the effective maximum context length the encoding can handle before patterns start repeating.

Let's compute these frequencies for a small embedding dimension:

High frequency dimensions change rapidly (good for nearby positions). Low frequency dimensions barely change (but stay unique over long distances):

Explore how different dimension pairs behave:

Dimension pair:

Frequency: 1.000e+0(wavelength: 6 positions)

sin (dim 0)

cos (dim 1)

High frequency: Changes rapidly. Good for distinguishing nearby positions.

Putting It All Together

The complete formula from "Attention Is All You Need":

Sinusoidal Positional Encoding

\begin{aligned} PE_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d}}\right) \\ PE_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d}}\right) \end{aligned}

Here is the complete implementation:

Adding positional encoding to token embeddings:

The same token at different positions now has different representations:

Visualizing the Full Matrix

The positional encoding forms a beautiful pattern. Each row is a position, each column is a dimension. Notice the different wavelengths across dimensions:

Explore the matrix interactively. Adjust sequence length and model dimension:

Sequence Length: 100

Model Dimension: 64

Dimension (i) → [sin/cos pairs at different frequencies]

Position ↓

-1

Looking at specific dimension pairs to see the frequency differences:

Verifying Our Requirements

Does our encoding satisfy the requirements we set out? Let's check:

Bounded: Values stay in [-1, 1] regardless of position:

Unique: Each position gets a distinct encoding (measuring pairwise distances):

Smooth: Nearby positions have similar encodings (small distance), far positions differ more:

The Relative Position Property

A beautiful property: PE(i) · PE(j) depends mainly on |i - j|, the relativedistance between positions, not the absolute positions themselves.

Computing dot products between all position pairs:

PE(5) · PE(8) ≈ PE(10) · PE(13) because both pairs have distance 3:

Explore how dot product varies with relative distance:

Reference position:25

PE(25) · PE(j) for all positions j:

pos 0pos 49

Notice: The dot product is highest at the reference position (purple) and decays smoothly with distance. Positions at equal distance from the reference have similar dot products.

Plotting dot product as a function of relative distance:

PyTorch Implementation

A clean PyTorch implementation you can use in your models:

Try computing positional encodings yourself:

Try it yourself

Limitations of Additive PE

Sinusoidal PE works well, but has a subtle issue. What we ideally want is for attention scores to cleanly separate into:

Semantic similarity: How related are the tokens?
Relative position: How far apart are they?

But when we add positional encodings and compute attention:

\text{score} = (q + pe_q) \cdot (k + pe_k)

Let's see what this expansion actually produces:

This expands to four terms:

q \cdot k + q \cdot pe_k + pe_q \cdot k + pe_q \cdot pe_k

Term	What it measures
q · k	Pure semantic similarity — how related are these tokens?
q · pe_j	How much the query content "likes" position j
pe_i · k	How much position i "likes" the key content
pe_i · pe_j	Pure positional relationship (depends on i - j)

The cross terms mix semantic content with position. The model has to learn to extract the useful relative position signal from pe_i · pe_j while disentangling it from the content-position interactions.

One alternative is concatenation instead of addition:

But concatenation doubles the dimension, increasing computation:

We've built up sinusoidal PE piece by piece: bounded values → multiple frequencies → sin/cos pairs → the 10000 base. It works! But position and content get entangled in attention.

Here's a thought: the (sin, cos) pairs we're using... those define a rotation. What if instead of adding position to embeddings, we rotated the query and key vectors?