Tied Embeddings

Tied Embeddings

Why Sharing Weights Isn't Always Principled

In deep learning, we constantly trade compute for accuracy. Quantization sacrifices precision for speed. Distillation trades model size for latency. Weight sharing reduces parameters at the cost of expressivity.

Tied embeddings are one such trade: using the same weight matrix for both input embeddings and output predictions. It saves roughly 200M parameters for a typical LLM. For smaller models, this works surprisingly well.

But there's a fundamental mathematical reason why tied embeddings can't capture something as basic as "New York" being common while "York New" is rare. This isn't a training issue or a matter of more data. It's linear algebra.

The Two Matrices

Transformers have two matrices that deal with tokens:

  • Embedding matrix WEW_E: converts tokens to vectors (input)
  • Unembedding matrix WUW_U: converts residual stream to logits (output)

With tied embeddings, we use the same weights: WU=WETW_U = W_E^T. This seems elegant. Fewer parameters, shared structure, and a nice symmetry between input and output.

Let's make this concrete with a tiny vocabulary:

Tokens are represented as one-hot vectors, where a single 1 indicates which token we're referring to:

The embedding matrix WEW_E has shape (vocab_size, embedding_dim). Each row is the learned embedding for one token:

To get a token's embedding, we multiply its one-hot vector by WEW_E. This just selects the corresponding row:

At the output, the unembedding matrix converts a residual stream vector to logits. With tied embeddings, this is WETW_E^T:

Each logit is the dot product of the residual with a token's embedding. Higher dot product means the model thinks that token is more likely:

The Direct Path

In a transformer, information flows through attention and MLP layers. But because of residual connections, there's always a direct linear path from input to output:

token (one-hot) → W_E → [residual stream] → W_U → logits

Even in a deep model, this path exists. If we ignore all the attention/MLP layers, the model computes:

logits=one_hotWEWU\text{logits} = \text{one\_hot} \cdot W_E \cdot W_U
High-level diagram of a transformer showing the direct path from input embeddings through residual stream to output logits
High-level transformer architecture showing the direct path

The matrix WEWUW_E \cdot W_U is a vocab_size × vocab_size matrix. Entry [i, j] tells us: given input token i, what's the logit for output token j?

Let's interpret what this matrix means for our vocabulary:

What Should This Path Learn?

If a model had no attention or MLP layers, the only thing it could learn is: "Given the current token, what's the most likely next token?"

This is exactly bigram statistics: conditional probabilities P(next_token | current_token). A bigram is any two consecutive tokens in text.

For our vocabulary, realistic bigram probabilities look like this:

Notice the critical observation: bigrams are asymmetric. The order matters:

Visualizing this asymmetry. Notice how the matrix is NOT symmetric across the diagonal:

The Problem: Forced Symmetry

Here's the critical issue. With tied embeddings, WU=WETW_U = W_E^T, so the direct path matrix becomes:

WEWU=WEWETW_E \cdot W_U = W_E \cdot W_E^T

And WEWETW_E \cdot W_E^T is always symmetric. Let's see why.

Computing WEWETW_E \cdot W_E^T with our embedding matrix:

Let's verify: is WEWETW_E \cdot W_E^T symmetric?

Visualizing the tied result. Notice the symmetry across the diagonal:

Why is WEWETW_E \cdot W_E^T Always Symmetric?

This follows from basic linear algebra. The (i, j) entry of WEWETW_E \cdot W_E^T is:

(WEWET)ij=rowi(WE)rowj(WE)=embeddingiembeddingj(W_E \cdot W_E^T)_{ij} = \text{row}_i(W_E) \cdot \text{row}_j(W_E) = \text{embedding}_i \cdot \text{embedding}_j

And dot products are commutative: ab=baa \cdot b = b \cdot a. Therefore:

(WEWET)ij=(WEWET)ji(W_E \cdot W_E^T)_{ij} = (W_E \cdot W_E^T)_{ji}

No matter what values WEW_E has, WEWETW_E \cdot W_E^T is always symmetric.

Different Purposes, Same Weights?

Token Embedding (W_E)
Needs to encode:
  • Syntactic type (noun/verb/etc)
  • Morphology
  • Semantic meaning
  • Style and register
  • Attentional compatibility
Logit Vector (W_U)
Needs to encode:
  • Predictive distribution
  • Context-dependent likelihood
  • Grammar constraints
  • Topic flow
  • Memorized facts
Key insight: These serve fundamentally different purposes. Tying them forces a single representation to do both jobs.

Why SGD Can't Fix This

You might wonder: "SGD is powerful. Can't it find embeddings such that WEWETW_E \cdot W_E^T approximates the bigram probabilities?"

The answer is no. This isn't an optimization issue or a matter of training longer. The constraint is mathematical:

The set of symmetric matrices is a subspace. No matter how SGD adjusts WEW_E, the product WEWETW_E \cdot W_E^T will always land in this subspace. It can never reach an asymmetric target.

Let's verify with random matrices. Every single one produces a symmetric result:

Try it yourself. No matter what values you enter, the result is symmetric:

Symmetry Explorer

W_E (embedding matrix)
dim 0   dim 1
New
York
City
W_E @ W_E^T (result)
NewYorkCity
New
1.25
1.25
0.90
York
1.25
1.45
1.32
City
0.90
1.32
1.53
Symmetry Check
[New,York] vs [York,New]:1.250 = 1.250
[New,City] vs [City,New]:0.900 = 0.900
[York,City] vs [City,York]:1.320 = 1.320
No matter what values you enter, W_E @ W_E^T is always symmetric.

What About Untied Embeddings?

With untied embeddings, WUW_U is a separate learnable matrix. Now the direct path is:

WEWUW_E \cdot W_U

where WEW_E and WUW_U are independent. This product can be any matrix, including asymmetric ones.

With untied embeddings, we can solve for a WUW_U that approximates the bigram probabilities:

The untied result can now represent asymmetric relationships:

The key comparison. Can we represent "New→York ≠ York→New"?

How Small Models Cope

If tied embeddings have this limitation, why do smaller models still use them?

The answer: MLP₀ can break the symmetry. Instead of the direct path being just WEWETW_E \cdot W_E^T, it becomes:

WEMLP0WETW_E \rightarrow \text{MLP}_0 \rightarrow W_E^T

MLP₀ learns a transformation M, making the effective path WEMWETW_E \cdot M \cdot W_E^T, which CAN be asymmetric.

Let's see how adding an MLP transformation breaks the symmetry:

But this workaround has a cost. MLP capacity that could be used for reasoning or knowledge is instead spent "undoing" the embedding constraint:

Explore the parameter savings from tied embeddings:

Parameter Calculator

50,000
4,096
Tied parameters (W_E only):204.8M
Untied parameters (W_E + W_U):409.6M
Savings from tying:204.8M (50%)
As percentage of total model:
20.5%
1B model
2.9%
7B model
0.3%
70B model

When to Tie, When Not to Tie

AspectTiedUntied
Direct path matrixW_E @ W_E^T (symmetric)W_E @ W_U (any matrix)
Can represent bigrams?Not directlyYes
ParametersFewer (shared)More (separate)
MemoryLessMore

When Tied Embeddings Work

  • Small models (<8B parameters): MLP₀ can compensate
  • Training efficiency: Fewer parameters = faster training
  • Memory constrained: Sharing weights reduces footprint

When to Use Untied Embeddings

  • Large models: The direct path becomes more important
  • Maximum performance: Untied gives more expressivity
  • SOTA models: Most large LLMs (GPT-4, Claude, etc.) use untied

The Bottom Line

Tied embeddings are a practical tradeoff, not a principled design choice. They work because small models don't rely heavily on the direct path, and MLP₀ can partially compensate. But mathematically, tying embeddings forces the direct path to be symmetric when language is fundamentally asymmetric.