Why Models Are Sharded
A 70B-parameter model in BF16 is ~140 GB. That's too large for a single file on many filesystems (FAT32 has a 4 GB limit), too slow to download in one chunk, and impractical for multi-GPU serving where different GPUs need different layers. Sharding splits the model across multiple files — each shard holds a subset of tensors. An index file maps every tensor name to its shard.
Index File Structure
// model.safetensors.index.json
{
"metadata": {
"total_size": 16060514304
},
"weight_map": {
"model.embed_tokens.weight":
"model-00001-of-00004.safetensors",
"model.layers.0.self_attn.q_proj.weight":
"model-00001-of-00004.safetensors",
"model.layers.16.mlp.gate_proj.weight":
"model-00003-of-00004.safetensors",
// ... every tensor → shard mapping
}
}
Key insight: When you see "missing key" errors loading a model, the index file is the first place to check. It tells you exactly which shard should contain each tensor. Layers are typically packed contiguously — layers 0-7 in shard 1, 8-15 in shard 2, etc.