Why Sharing Weights Isn't Always Principled
In deep learning, we constantly trade compute for accuracy. Quantization sacrifices precision for speed. Distillation trades model size for latency. Weight sharing reduces parameters at the cost of expressivity.
Tied embeddings are one such trade: using the same weight matrix for both input embeddings and output predictions. It saves roughly 200M parameters for a typical LLM. For smaller models, this works surprisingly well.
But there's a fundamental mathematical reason why tied embeddings can't capture something as basic as "New York" being common while "York New" is rare. This isn't a training issue or a matter of more data. It's linear algebra.
Transformers have two matrices that deal with tokens:
With tied embeddings, we use the same weights: . This seems elegant. Fewer parameters, shared structure, and a nice symmetry between input and output.
Let's make this concrete with a tiny vocabulary:
Tokens are represented as one-hot vectors, where a single 1 indicates which token we're referring to:
The embedding matrix has shape (vocab_size, embedding_dim). Each row is the learned embedding for one token:
To get a token's embedding, we multiply its one-hot vector by . This just selects the corresponding row:
At the output, the unembedding matrix converts a residual stream vector to logits. With tied embeddings, this is :
Each logit is the dot product of the residual with a token's embedding. Higher dot product means the model thinks that token is more likely:
In a transformer, information flows through attention and MLP layers. But because of residual connections, there's always a direct linear path from input to output:
Even in a deep model, this path exists. If we ignore all the attention/MLP layers, the model computes:

The matrix is a vocab_size × vocab_size matrix. Entry [i, j] tells us: given input token i, what's the logit for output token j?
Let's interpret what this matrix means for our vocabulary:
If a model had no attention or MLP layers, the only thing it could learn is: "Given the current token, what's the most likely next token?"
This is exactly bigram statistics: conditional probabilities P(next_token | current_token). A bigram is any two consecutive tokens in text.
For our vocabulary, realistic bigram probabilities look like this:
Notice the critical observation: bigrams are asymmetric. The order matters:
Visualizing this asymmetry. Notice how the matrix is NOT symmetric across the diagonal:
Here's the critical issue. With tied embeddings, , so the direct path matrix becomes:
And is always symmetric. Let's see why.
Computing with our embedding matrix:
Let's verify: is symmetric?
Visualizing the tied result. Notice the symmetry across the diagonal:
This follows from basic linear algebra. The (i, j) entry of is:
And dot products are commutative: . Therefore:
No matter what values has, is always symmetric.
You might wonder: "SGD is powerful. Can't it find embeddings such that approximates the bigram probabilities?"
The answer is no. This isn't an optimization issue or a matter of training longer. The constraint is mathematical:
The set of symmetric matrices is a subspace. No matter how SGD adjusts , the product will always land in this subspace. It can never reach an asymmetric target.
Let's verify with random matrices. Every single one produces a symmetric result:
Try it yourself. No matter what values you enter, the result is symmetric:
With untied embeddings, is a separate learnable matrix. Now the direct path is:
where and are independent. This product can be any matrix, including asymmetric ones.
With untied embeddings, we can solve for a that approximates the bigram probabilities:
The untied result can now represent asymmetric relationships:
The key comparison. Can we represent "New→York ≠ York→New"?
If tied embeddings have this limitation, why do smaller models still use them?
The answer: MLP₀ can break the symmetry. Instead of the direct path being just , it becomes:
MLP₀ learns a transformation M, making the effective path , which CAN be asymmetric.
Let's see how adding an MLP transformation breaks the symmetry:
But this workaround has a cost. MLP capacity that could be used for reasoning or knowledge is instead spent "undoing" the embedding constraint:
Explore the parameter savings from tied embeddings:
| Aspect | Tied | Untied |
|---|---|---|
| Direct path matrix | W_E @ W_E^T (symmetric) | W_E @ W_U (any matrix) |
| Can represent bigrams? | Not directly | Yes |
| Parameters | Fewer (shared) | More (separate) |
| Memory | Less | More |
Tied embeddings are a practical tradeoff, not a principled design choice. They work because small models don't rely heavily on the direct path, and MLP₀ can partially compensate. But mathematically, tying embeddings forces the direct path to be symmetric when language is fundamentally asymmetric.