Scroll-synced animations are best viewed on desktop

The pragmatic tradeoff of tied embeddings

In deep learning, we commonly trade compute for accuracy. Quantization sacrifices precision for speed. Distillation trades model size for latency. Weight sharing reduces parameters at the cost of expressivity.

Tied embeddings are one such tradeoff.

It comes from a simple observation: "we have a 617 million parameter embedding matrix^* on both sides of our nn. why not just make them the same matrix?" ^*(in GPT 3)

In other words: since the embedding matrix encodes semantic meaning in words, it serves roughly the same purpose for both input and output predictions.

Two sides of the same coin

Transformers have two matrices that deal with tokens, an

embedding matrix $W_E$ : converts tokens to vectors (input), and an
unembedding matrix $W_U$ : converts residual stream to logits (output).

With tied embeddings, we use the same weights: $W_U = W_E^T$ . It the ML brain because it appears to be an elegant way to reduce parameters and add symmetry.

Let's make this concrete with a toy example vocabulary:

Tokens are represented as one-hot vectors, where a single 1 indicates which token we're referring to:

The embedding matrix $W_E$ has shape (vocab_size, embedding_dim). Each row is the learned embedding for one token (these are made up):

To get a token's embedding, we multiply its one-hot vector by $W_E$ . This just selects the corresponding row:

At the output, the unembedding matrix converts a residual stream vector to logits. With tied embeddings, this is $W_E^T$ :

Each logit is the dot product of the residual with a token's embedding. Higher dot product means the model thinks that token is more likely:

The residual stream

In a transformer, information flows through attention and MLP layers. But residual connections mean that there's always a direct linear path from input to output:

token (one-hot) → W_E → [residual stream] → W_U → logits

Even in a deep model, this path exists. If we ignore all the attention/MLP layers, the model is simply computing:

\text{logits} = \text{one\_hot} \cdot W_E \cdot W_U

What the residual stream learns

The matrix $W_E \cdot W_U$ is a vocab_size × vocab_size matrix. Entry [i, j] tells us: given input token i, what's the logit for output token j?

"Given the current token, what's the most likely next token?"

This is called bigram statistics. It's the conditional probabilities $P(\text{next\_token} \mid \text{current\_token})$ . A bigram is any two consecutive tokens in text.

For our vocabulary, realistic bigram probabilities look like this:

Notice the critical observation: bigrams are asymmetric. The order matters:

Visualizing this asymmetry. Notice how the matrix is NOT symmetric across the diagonal:

The problem: forced symmetry

Here's the critical issue. With tied embeddings, $W_U = W_E^T$ , so the direct path matrix becomes:

W_E \cdot W_U = W_E \cdot W_E^T

And $W_E \cdot W_E^T$ is always symmetric. Let's see why.

This follows from basic linear algebra. The (i, j) entry of $W_E \cdot W_E^T$ is:

(W_E \cdot W_E^T)_{ij} = \text{row}_i(W_E) \cdot \text{row}_j(W_E) = \text{embedding}_i \cdot \text{embedding}_j

And dot products are commutative: $a \cdot b = b \cdot a$ . Therefore:

(W_E \cdot W_E^T)_{ij} = (W_E \cdot W_E^T)_{ji}

No matter what values $W_E$ has, $W_E \cdot W_E^T$ is always symmetric.

Symmetry Explorer

W_E (embedding matrix)

dim 0 dim 1

New

York

City

W_E @ W_E^T (result)

NewYorkCity

New

1.25

0.90

York

1.25

1.45

1.32

City

0.90

1.32

1.53

Symmetry Check

[New,York] vs [York,New]:1.250 = 1.250✓

[New,City] vs [City,New]:0.900 = 0.900✓

[York,City] vs [City,York]:1.320 = 1.320✓

No matter what values you enter, W_E @ W_E^T is always symmetric.

This is bad. We lose expressivity because these serve fundamentally different purposes. Tying them forces a single representation to do both jobs.

Different Purposes, Same Weights?

Token Embedding (W_E)

Needs to encode:

Syntactic type (noun/verb/etc)
Morphology
Semantic meaning
Style and register
Attentional compatibility

Logit Vector (W_U)

Needs to encode:

Predictive distribution
Context-dependent likelihood
Grammar constraints
Topic flow
Memorized facts

With untied embeddings, $W_U$ is a separate learnable matrix. Now the direct path is

W_E \cdot W_U

where $W_E$ and $W_U$ are independent. This product can be any matrix, including asymmetric ones.

With untied embeddings, we can solve for a $W_U$ that approximates the bigram probabilities:

The untied result can now represent asymmetric relationships:

Why it still might make sense to tie embeddings

Smaller models still use them because the lack of expressivity means that the nonlinearity added through attention and MLP layers is enough to not bump against the asymmetry constraints.

Some smaller models also do this by directly adding a MLP layer before the residual stream which breaks the symmetry as well.

W_E \rightarrow \text{MLP}_0 \rightarrow W_E^T

MLP₀ learns a transformation M, making the effective path $W_E \cdot M \cdot W_E^T$ , which CAN be asymmetric.

Let's see how adding an MLP transformation breaks the symmetry:

Explore the parameter savings from tied embeddings:

Parameter Calculator

Vocabulary Size

50,000

Embedding Dimension

4,096

Tied parameters (W_E only):204.8M

Untied parameters (W_E + W_U):409.6M

Savings from tying:204.8M (50%)

As percentage of total model:

20.5%

1B model

2.9%

7B model

0.3%

70B model

For larger models these days, this isn't a worthy tradeoff.

Remember, the entirety of GPT-3 is ~175 billion parameters. 617 million embeddings parameters is a drop in the bucket if it means more expressivity.

To tie or not to tie

Aspect	Tied	Untied
Direct path matrix	W_E @ W_E^T (symmetric)	W_E @ W_U (any matrix)
Can represent bigrams?	Not directly	Yes
Parameters	Fewer (shared)	More (separate)
Memory	Less	More

Tied embeddings are a practical tradeoff, not a principled design choice. They save parameters and memory, which matters for smaller models where early layers can compensate for the symmetry constraint. But for large models chasing maximum performance, the math is clear: language is asymmetric, and untied embeddings can represent that.

Tied: GPT-2, Gemma, original Transformer
Untied: LLaMA 1/2/3, Mistral
Unknown: GPT-3/4, Claude (not publicly disclosed)

References

[1] Press, O., & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. arXiv:1608.05859
Introduced tied embeddings, showing parameter reduction with minimal performance loss.

[2] Inan, H., Khosravi, K., & Socher, R. (2017). "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling." ICLR 2017. arXiv:1611.01462
Concurrent work providing theoretical justification for weight tying.

[3] Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
Original Transformer used tied embeddings between input, output, and pre-softmax layers.

[4] Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. (2018). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." ICLR 2018. arXiv:1711.03953
Shows expressiveness limits from low-rank embeddings—relevant to why untying helps.

[5] Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. arXiv:2012.14913
Shows early layers capture surface patterns, suggesting how models compensate for tied embeddings.

silennai.com