Ch 2 — File Formats: Safetensors, GGUF, and PyTorch

Three container formats, their binary layouts, and why the choice matters
High Level
shield
Safetensors
arrow_forward
memory
GGUF
arrow_forward
warning
PyTorch
arrow_forward
view_column
Sharding
arrow_forward
data_object
Headers
arrow_forward
compare
Compare
-
Click play or press Space to begin...
Step- / 8
shield
Safetensors: The Modern Standard
Designed by Hugging Face for security and speed
Why Safetensors Exists
Safetensors was created by Hugging Face to replace PyTorch's pickle-based format. The key insight: a model file should contain only data, never executable code. Safetensors achieves this with a dead-simple layout: an 8-byte header length, a JSON header describing every tensor, and then raw binary tensor data. No code, no pickle, no surprises. The format passed a Trail of Bits security audit confirming it cannot execute arbitrary code.
Binary Layout
// Safetensors file structure: Bytes 0-7: uint64 LE // Header size (N) Bytes 8..8+N: JSON // Tensor metadata Bytes 8+N..: Binary // Raw tensor data // That's it. Three sections. No code.
Key insight: The first 8 bytes tell you exactly how long the JSON header is. You can parse the entire file structure without reading a single weight — enabling fast metadata extraction via HTTP Range requests.
data_object
Inside the Safetensors JSON Header
Each tensor entry maps name → dtype, shape, and byte offsets
Header Structure
The JSON header is a dictionary where each key is a tensor name and each value describes that tensor's dtype (BF16, F16, F32, etc.), shape (dimension array), and data_offsets (start and end byte positions within the data section). A special __metadata__ key holds format info like {"format": "pt"}. This is everything the loader needs to mmap the file and extract any tensor without reading the others.
Header Example
{ "__metadata__": { "format": "pt" }, "model.embed_tokens.weight": { "dtype": "BF16", "shape": [128256, 4096], "data_offsets": [0, 1050673152] }, "model.layers.0.self_attn.q_proj.weight": { "dtype": "BF16", "shape": [4096, 4096], "data_offsets": [1050673152, 1084227584] } }
Key insight: Zero-copy memory mapping (mmap) means the OS maps the file directly into virtual memory. The framework can access any tensor by pointer arithmetic using the offsets — no deserialization, no copy. This makes loading a 16 GB model nearly instant.
memory
GGUF: The Self-Contained Format
Built for llama.cpp — everything in one file
What Makes GGUF Different
GGUF (GPT-Generated Unified Format) was designed for the llama.cpp ecosystem and CPU/edge inference. Unlike Safetensors which stores weights alongside separate config and tokenizer files, GGUF is entirely self-contained — the tokenizer vocabulary, model architecture metadata, quantization info, and weight data are all in a single file. You need nothing else to run the model.
Key Design Goals
Self-contained: One file has everything. Memory-mappable: Tensor data is aligned to 32-byte boundaries. Extensible: The KV metadata system can store arbitrary key-value pairs. Quantization-native: Built-in support for dozens of quantization types (Q4_0, Q4_K_M, Q8_0, etc.).
GGUF Binary Layout
// GGUF file structure (version 3): Bytes 0-3: "GGUF" // Magic: 0x46554747 Bytes 4-7: uint32 // Version (3) Bytes 8-15: uint64 // Tensor count Bytes 16-23: uint64 // KV pair count Then: KV pairs // Metadata (variable) Then: Tensor info// Name, dims, type, offset Then: Tensor data// Aligned to 32 bytes
Key insight: The 24-byte header tells you immediately how many tensors and metadata entries to expect. The KV system is how GGUF embeds the tokenizer, architecture name, quantization type, and everything else that Safetensors stores in separate files.
tune
GGUF’s KV Metadata System
How architecture, tokenizer, and quantization are all embedded
KV Pair Categories
GGUF stores metadata as typed key-value pairs with standardized key names. Each pair has a string key, a type tag (uint8 through float64, string, array), and a typed value. The keys follow a dot-separated namespace convention: general.architecture, llama.attention.head_count, tokenizer.ggml.model. This is how GGUF achieves self-containment — the tokenizer vocabulary, merge rules, and all config parameters live inside these KV pairs.
Common KV Keys
// Architecture metadata: general.architecture: "llama" general.name: "Meta-Llama-3.1-8B" general.file_type: 15 // Q4_K_M // Model dimensions: llama.embedding_length: 4096 llama.block_count: 32 llama.attention.head_count: 32 llama.attention.head_count_kv: 8 // Tokenizer (embedded!): tokenizer.ggml.model: "gpt2" tokenizer.ggml.tokens: ["!", "\"", ...] tokenizer.ggml.merges: ["Ġ t", ...]
warning
PyTorch .bin: The Legacy Format
Pickle serialization — powerful but dangerous
How PyTorch Files Work
PyTorch's .bin format uses Python's pickle protocol to serialize tensors. Pickle is not a declarative data format like JSON — it's a stack-based programming language that can execute arbitrary Python functions during deserialization. When you call torch.load(), it runs the Pickle Virtual Machine (PVM), which can reconstruct objects by calling any Python callable.
The Security Problem
Attackers exploit pickle's __reduce__ protocol: a malicious class can return os.system with shell commands as arguments. When the file is loaded, the commands execute silently. Loading an untrusted .bin file is equivalent to running an untrusted Python script.
Why It Still Exists
// The dangerous path: import torch model = torch.load("untrusted_model.bin") // ↑ This runs arbitrary code during load! // The safe alternative: from safetensors.torch import load_file model = load_file("model.safetensors") // ↑ Pure data read. No code execution.
Why it matters: PyTorch .bin persists due to ecosystem inertia — many older models and training scripts still produce it. Hugging Face now defaults to Safetensors and shows security warnings on pickle-based uploads. Always prefer Safetensors for models from untrusted sources.
view_column
How Sharding Works
Splitting large models across multiple files
Why Models Are Sharded
A 70B-parameter model in BF16 is ~140 GB. That's too large for a single file on many filesystems (FAT32 has a 4 GB limit), too slow to download in one chunk, and impractical for multi-GPU serving where different GPUs need different layers. Sharding splits the model across multiple files — each shard holds a subset of tensors. An index file maps every tensor name to its shard.
Index File Structure
// model.safetensors.index.json { "metadata": { "total_size": 16060514304 }, "weight_map": { "model.embed_tokens.weight": "model-00001-of-00004.safetensors", "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors", "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00004.safetensors", // ... every tensor → shard mapping } }
Key insight: When you see "missing key" errors loading a model, the index file is the first place to check. It tells you exactly which shard should contain each tensor. Layers are typically packed contiguously — layers 0-7 in shard 1, 8-15 in shard 2, etc.
terminal
Reading the First Bytes
How to identify a format from its magic bytes
Safetensors First 16 Bytes
// hexdump -C model.safetensors | head -1 00000000 a8 6a 00 00 00 00 00 00 // 0x6aa8 = 27304 // ↑ Header is 27,304 bytes of JSON // Followed immediately by the JSON header 00000008 7b 22 ... // '{"' = JSON start
GGUF First 24 Bytes
// hexdump -C model.gguf | head -2 00000000 47 47 55 46 // Magic: "GGUF" 00000004 03 00 00 00 // Version: 3 00000008 23 01 00 00 00 00 00 00 // 291 tensors 00000010 17 00 00 00 00 00 00 00 // 23 KV pairs
PyTorch First 16 Bytes
// hexdump -C pytorch_model.bin | head -1 00000000 80 02 // Pickle protocol 2 // Or for newer files: 00000000 50 4b // "PK" = ZIP archive // PyTorch wraps pickle in a ZIP container // Contains: archive/data.pkl + tensor files
Key insight: You can identify any model file format from its first 4 bytes: JSON brace {" after 8-byte header = Safetensors, GGUF = GGUF, PK or 80 02 = PyTorch pickle. This is useful when files are misnamed or lack extensions.
compare
Choosing the Right Format
When to use each format — the decision matrix
Format Comparison
// Feature | Safetensors | GGUF | PyTorch Security: Safe ✓ Safe ✓ Unsafe ✗ Self-contained: No ✗ Yes ✓ No ✗ Memory mapping: Yes ✓ Yes ✓ No ✗ Quantization: Limited Native No GPU inference: Best ✓ Slow Good CPU inference: Poor Best ✓ Poor Sharding: Yes ✓ Split Yes ✓ Framework: Any ggml PyTorch
Decision Rules
GPU inference with Hugging Face/vLLM? → Safetensors. It's the default, fastest to load, and framework-agnostic.

CPU/laptop inference with llama.cpp or Ollama? → GGUF. Self-contained with native quantization support, designed for edge/CPU workloads.

Existing pipeline that only outputs .bin? → Convert to Safetensors using huggingface_hub's convert utility. Never distribute .bin for untrusted consumption.

Training output? → Safetensors. PyTorch checkpoints can use .safetensors natively since recent versions.
Rule of thumb: Safetensors for the GPU cloud, GGUF for the laptop, and avoid PyTorch .bin unless you trust the source completely and have no alternative.