In deep learning, we commonly trade compute for accuracy. Quantization sacrifices precision for speed. Distillation trades model size for latency. Weight sharing reduces parameters at the cost of expressivity.
Tied embeddings are one such tradeoff.
It comes from a simple observation: "we have a 617 million parameter embedding matrix* on both sides of our nn. why not just make them the same matrix?" *(in GPT 3)
In other words: since the embedding matrix encodes semantic meaning in words, it serves roughly the same purpose for both input and output predictions.
Transformers have two matrices that deal with tokens, an
With tied embeddings, we use the same weights: . It the ML brain because it appears to be an elegant way to reduce parameters and add symmetry.
Let's make this concrete with a toy example vocabulary:
Tokens are represented as one-hot vectors, where a single 1 indicates which token we're referring to:
The embedding matrix has shape (vocab_size, embedding_dim). Each row is the learned embedding for one token (these are made up):
To get a token's embedding, we multiply its one-hot vector by . This just selects the corresponding row:
At the output, the unembedding matrix converts a residual stream vector to logits. With tied embeddings, this is :
Each logit is the dot product of the residual with a token's embedding. Higher dot product means the model thinks that token is more likely:
In a transformer, information flows through attention and MLP layers. But residual connections mean that there's always a direct linear path from input to output:
Even in a deep model, this path exists. If we ignore all the attention/MLP layers, the model is simply computing:
The matrix is a vocab_size × vocab_size matrix. Entry [i, j] tells us: given input token i, what's the logit for output token j?
"Given the current token, what's the most likely next token?"
This is called bigram statistics. It's the conditional probabilities . A bigram is any two consecutive tokens in text.
For our vocabulary, realistic bigram probabilities look like this:
Notice the critical observation: bigrams are asymmetric. The order matters:
Visualizing this asymmetry. Notice how the matrix is NOT symmetric across the diagonal:
Here's the critical issue. With tied embeddings, , so the direct path matrix becomes:
And is always symmetric. Let's see why.
This follows from basic linear algebra. The (i, j) entry of is:
And dot products are commutative: . Therefore:
No matter what values has, is always symmetric.
This is bad. We lose expressivity because these serve fundamentally different purposes. Tying them forces a single representation to do both jobs.
With untied embeddings, is a separate learnable matrix. Now the direct path is
where and are independent. This product can be any matrix, including asymmetric ones.
With untied embeddings, we can solve for a that approximates the bigram probabilities:
The untied result can now represent asymmetric relationships:
Smaller models still use them because the lack of expressivity means that the nonlinearity added through attention and MLP layers is enough to not bump against the asymmetry constraints.
Some smaller models also do this by directly adding a MLP layer before the residual stream which breaks the symmetry as well.
MLP₀ learns a transformation M, making the effective path , which CAN be asymmetric.
Let's see how adding an MLP transformation breaks the symmetry:
Explore the parameter savings from tied embeddings:
For larger models these days, this isn't a worthy tradeoff.
Remember, the entirety of GPT-3 is ~175 billion parameters. 617 million embeddings parameters is a drop in the bucket if it means more expressivity.
| Aspect | Tied | Untied |
|---|---|---|
| Direct path matrix | W_E @ W_E^T (symmetric) | W_E @ W_U (any matrix) |
| Can represent bigrams? | Not directly | Yes |
| Parameters | Fewer (shared) | More (separate) |
| Memory | Less | More |
Tied embeddings are a practical tradeoff, not a principled design choice. They save parameters and memory, which matters for smaller models where early layers can compensate for the symmetry constraint. But for large models chasing maximum performance, the math is clear: language is asymmetric, and untied embeddings can represent that.
[1] Press, O., & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. arXiv:1608.05859
Introduced tied embeddings, showing parameter reduction with minimal performance loss.
[2] Inan, H., Khosravi, K., & Socher, R. (2017). "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling." ICLR 2017. arXiv:1611.01462
Concurrent work providing theoretical justification for weight tying.
[3] Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
Original Transformer used tied embeddings between input, output, and pre-softmax layers.
[4] Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. (2018). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." ICLR 2018. arXiv:1711.03953
Shows expressiveness limits from low-rank embeddings—relevant to why untying helps.
[5] Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. arXiv:2012.14913
Shows early layers capture surface patterns, suggesting how models compensate for tied embeddings.