Tied Embeddings
Weight tying means the embedding matrix and the output head share the same physical parameters — they point to the same memory. This was introduced by Press & Wolf (2017) and used in GPT-2 and BERT. The benefit: you save vocab_size × hidden_size parameters. For Llama 3.1 8B, that would save ~525M parameters (~1 GB in BF16). The config field tie_word_embeddings controls this.
Current Practice
Modern open-source LLMs like Llama and Mistral do NOT tie embeddings — they use tie_word_embeddings: false. The embedding and lm_head are separate tensors. This gives the model more capacity: the input embedding can specialize in encoding meaning, while the output head can specialize in predicting the next token.
Config Examples
// Llama 3.1 8B config.json:
"tie_word_embeddings": false
// → Two separate tensors in the file:
// model.embed_tokens.weight [128256, 4096]
// lm_head.weight [128256, 4096]
// Total: ~2 GB for both
// GPT-2 / smaller models:
"tie_word_embeddings": true
// → One tensor, shared:
// model.embed_tokens.weight [50257, 768]
// lm_head.weight → same memory
Key insight: When you see tie_word_embeddings: false, expect two large vocab-sized tensors in the file. When it's true, the safetensors header will only contain one, and the framework creates a reference for the other.