The pragmatic tradeoff of tied embeddings
Scroll-synced animations are best viewed on desktop

The pragmatic tradeoff of tied embeddings

In deep learning, we commonly trade compute for accuracy. Quantization sacrifices precision for speed. Distillation trades model size for latency. Weight sharing reduces parameters at the cost of expressivity.

Tied embeddings are one such tradeoff.

It comes from a simple observation: "we have a 617 million parameter embedding matrix* on both sides of our nn. why not just make them the same matrix?" *(in GPT 3)

In other words: since the embedding matrix encodes semantic meaning in words, it serves roughly the same purpose for both input and output predictions.

Two sides of the same coin

Transformers have two matrices that deal with tokens, an

  • embedding matrix WEW_E: converts tokens to vectors (input), and an
  • unembedding matrix WUW_U: converts residual stream to logits (output).

With tied embeddings, we use the same weights: WU=WETW_U = W_E^T. It the ML brain because it appears to be an elegant way to reduce parameters and add symmetry.

Let's make this concrete with a toy example vocabulary:

Tokens are represented as one-hot vectors, where a single 1 indicates which token we're referring to:

The embedding matrix WEW_E has shape (vocab_size, embedding_dim). Each row is the learned embedding for one token (these are made up):

To get a token's embedding, we multiply its one-hot vector by WEW_E. This just selects the corresponding row:

At the output, the unembedding matrix converts a residual stream vector to logits. With tied embeddings, this is WETW_E^T:

Each logit is the dot product of the residual with a token's embedding. Higher dot product means the model thinks that token is more likely:

The residual stream

In a transformer, information flows through attention and MLP layers. But residual connections mean that there's always a direct linear path from input to output:

token (one-hot) → W_E → [residual stream] → W_U → logits

Even in a deep model, this path exists. If we ignore all the attention/MLP layers, the model is simply computing:

logits=one_hotWEWU\text{logits} = \text{one\_hot} \cdot W_E \cdot W_U

What the residual stream learns

The matrix WEWUW_E \cdot W_U is a vocab_size × vocab_size matrix. Entry [i, j] tells us: given input token i, what's the logit for output token j?

"Given the current token, what's the most likely next token?"

This is called bigram statistics. It's the conditional probabilities P(next_tokencurrent_token)P(\text{next\_token} \mid \text{current\_token}). A bigram is any two consecutive tokens in text.

For our vocabulary, realistic bigram probabilities look like this:

Notice the critical observation: bigrams are asymmetric. The order matters:

Visualizing this asymmetry. Notice how the matrix is NOT symmetric across the diagonal:

The problem: forced symmetry

Here's the critical issue. With tied embeddings, WU=WETW_U = W_E^T, so the direct path matrix becomes:

WEWU=WEWETW_E \cdot W_U = W_E \cdot W_E^T

And WEWETW_E \cdot W_E^T is always symmetric. Let's see why.

This follows from basic linear algebra. The (i, j) entry of WEWETW_E \cdot W_E^T is:

(WEWET)ij=rowi(WE)rowj(WE)=embeddingiembeddingj(W_E \cdot W_E^T)_{ij} = \text{row}_i(W_E) \cdot \text{row}_j(W_E) = \text{embedding}_i \cdot \text{embedding}_j

And dot products are commutative: ab=baa \cdot b = b \cdot a. Therefore:

(WEWET)ij=(WEWET)ji(W_E \cdot W_E^T)_{ij} = (W_E \cdot W_E^T)_{ji}

No matter what values WEW_E has, WEWETW_E \cdot W_E^T is always symmetric.

Symmetry Explorer

W_E (embedding matrix)
dim 0   dim 1
New
York
City
W_E @ W_E^T (result)
NewYorkCity
New
1.25
1.25
0.90
York
1.25
1.45
1.32
City
0.90
1.32
1.53
Symmetry Check
[New,York] vs [York,New]:1.250 = 1.250
[New,City] vs [City,New]:0.900 = 0.900
[York,City] vs [City,York]:1.320 = 1.320
No matter what values you enter, W_E @ W_E^T is always symmetric.

This is bad. We lose expressivity because these serve fundamentally different purposes. Tying them forces a single representation to do both jobs.

Different Purposes, Same Weights?

Token Embedding (W_E)
Needs to encode:
  • Syntactic type (noun/verb/etc)
  • Morphology
  • Semantic meaning
  • Style and register
  • Attentional compatibility
Logit Vector (W_U)
Needs to encode:
  • Predictive distribution
  • Context-dependent likelihood
  • Grammar constraints
  • Topic flow
  • Memorized facts

With untied embeddings, WUW_U is a separate learnable matrix. Now the direct path is

WEWUW_E \cdot W_U

where WEW_E and WUW_U are independent. This product can be any matrix, including asymmetric ones.

With untied embeddings, we can solve for a WUW_U that approximates the bigram probabilities:

The untied result can now represent asymmetric relationships:

Why it still might make sense to tie embeddings

Smaller models still use them because the lack of expressivity means that the nonlinearity added through attention and MLP layers is enough to not bump against the asymmetry constraints.

Some smaller models also do this by directly adding a MLP layer before the residual stream which breaks the symmetry as well.

WEMLP0WETW_E \rightarrow \text{MLP}_0 \rightarrow W_E^T

MLP₀ learns a transformation M, making the effective path WEMWETW_E \cdot M \cdot W_E^T, which CAN be asymmetric.

Let's see how adding an MLP transformation breaks the symmetry:

Explore the parameter savings from tied embeddings:

Parameter Calculator

50,000
4,096
Tied parameters (W_E only):204.8M
Untied parameters (W_E + W_U):409.6M
Savings from tying:204.8M (50%)
As percentage of total model:
20.5%
1B model
2.9%
7B model
0.3%
70B model

For larger models these days, this isn't a worthy tradeoff.

Remember, the entirety of GPT-3 is ~175 billion parameters. 617 million embeddings parameters is a drop in the bucket if it means more expressivity.

To tie or not to tie

AspectTiedUntied
Direct path matrixW_E @ W_E^T (symmetric)W_E @ W_U (any matrix)
Can represent bigrams?Not directlyYes
ParametersFewer (shared)More (separate)
MemoryLessMore

Tied embeddings are a practical tradeoff, not a principled design choice. They save parameters and memory, which matters for smaller models where early layers can compensate for the symmetry constraint. But for large models chasing maximum performance, the math is clear: language is asymmetric, and untied embeddings can represent that.

  • Tied: GPT-2, Gemma, original Transformer
  • Untied: LLaMA 1/2/3, Mistral
  • Unknown: GPT-3/4, Claude (not publicly disclosed)

References

[1] Press, O., & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. arXiv:1608.05859
Introduced tied embeddings, showing parameter reduction with minimal performance loss.

[2] Inan, H., Khosravi, K., & Socher, R. (2017). "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling." ICLR 2017. arXiv:1611.01462
Concurrent work providing theoretical justification for weight tying.

[3] Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
Original Transformer used tied embeddings between input, output, and pre-softmax layers.

[4] Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. (2018). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." ICLR 2018. arXiv:1711.03953
Shows expressiveness limits from low-rank embeddings—relevant to why untying helps.

[5] Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. arXiv:2012.14913
Shows early layers capture surface patterns, suggesting how models compensate for tied embeddings.