Ch 2: File Formats — Safetensors, GGUF, and PyTorch

shield

Safetensors: The Modern Standard

Designed by Hugging Face for security and speed

Why Safetensors Exists

Safetensors was created by Hugging Face to replace PyTorch's pickle-based format. The key insight: a model file should contain only data, never executable code. Safetensors achieves this with a dead-simple layout: an 8-byte header length, a JSON header describing every tensor, and then raw binary tensor data. No code, no pickle, no surprises. The format passed a Trail of Bits security audit confirming it cannot execute arbitrary code.

Binary Layout

// Safetensors file structure: Bytes 0-7: uint64 LE // Header size (N) Bytes 8..8+N: JSON // Tensor metadata Bytes 8+N..: Binary // Raw tensor data // That's it. Three sections. No code.

Key insight: The first 8 bytes tell you exactly how long the JSON header is. You can parse the entire file structure without reading a single weight — enabling fast metadata extraction via HTTP Range requests.

data_object

Inside the Safetensors JSON Header

Each tensor entry maps name → dtype, shape, and byte offsets

Header Structure

The JSON header is a dictionary where each key is a tensor name and each value describes that tensor's dtype (BF16, F16, F32, etc.), shape (dimension array), and data_offsets (start and end byte positions within the data section). A special __metadata__ key holds format info like {"format": "pt"}. This is everything the loader needs to mmap the file and extract any tensor without reading the others.

Header Example

{ "__metadata__": { "format": "pt" }, "model.embed_tokens.weight": { "dtype": "BF16", "shape": [128256, 4096], "data_offsets": [0, 1050673152] }, "model.layers.0.self_attn.q_proj.weight": { "dtype": "BF16", "shape": [4096, 4096], "data_offsets": [1050673152, 1084227584] } }

Key insight: Zero-copy memory mapping (mmap) means the OS maps the file directly into virtual memory. The framework can access any tensor by pointer arithmetic using the offsets — no deserialization, no copy. This makes loading a 16 GB model nearly instant.

memory

GGUF: The Self-Contained Format

Built for llama.cpp — everything in one file

What Makes GGUF Different

GGUF (GPT-Generated Unified Format) was designed for the llama.cpp ecosystem and CPU/edge inference. Unlike Safetensors which stores weights alongside separate config and tokenizer files, GGUF is entirely self-contained — the tokenizer vocabulary, model architecture metadata, quantization info, and weight data are all in a single file. You need nothing else to run the model.

Key Design Goals

Self-contained: One file has everything. Memory-mappable: Tensor data is aligned to 32-byte boundaries. Extensible: The KV metadata system can store arbitrary key-value pairs. Quantization-native: Built-in support for dozens of quantization types (Q4_0, Q4_K_M, Q8_0, etc.).

GGUF Binary Layout

// GGUF file structure (version 3): Bytes 0-3: "GGUF" // Magic: 0x46554747 Bytes 4-7: uint32 // Version (3) Bytes 8-15: uint64 // Tensor count Bytes 16-23: uint64 // KV pair count Then: KV pairs // Metadata (variable) Then: Tensor info// Name, dims, type, offset Then: Tensor data// Aligned to 32 bytes

Key insight: The 24-byte header tells you immediately how many tensors and metadata entries to expect. The KV system is how GGUF embeds the tokenizer, architecture name, quantization type, and everything else that Safetensors stores in separate files.

tune

GGUF’s KV Metadata System

How architecture, tokenizer, and quantization are all embedded

KV Pair Categories

GGUF stores metadata as typed key-value pairs with standardized key names. Each pair has a string key, a type tag (uint8 through float64, string, array), and a typed value. The keys follow a dot-separated namespace convention: general.architecture, llama.attention.head_count, tokenizer.ggml.model. This is how GGUF achieves self-containment — the tokenizer vocabulary, merge rules, and all config parameters live inside these KV pairs.

Common KV Keys

// Architecture metadata: general.architecture: "llama" general.name: "Meta-Llama-3.1-8B" general.file_type: 15 // Q4_K_M // Model dimensions: llama.embedding_length: 4096 llama.block_count: 32 llama.attention.head_count: 32 llama.attention.head_count_kv: 8 // Tokenizer (embedded!): tokenizer.ggml.model: "gpt2" tokenizer.ggml.tokens: ["!", "\"", ...] tokenizer.ggml.merges: ["Ġ t", ...]

warning

PyTorch .bin: The Legacy Format

Pickle serialization — powerful but dangerous

How PyTorch Files Work

PyTorch's .bin format uses Python's pickle protocol to serialize tensors. Pickle is not a declarative data format like JSON — it's a stack-based programming language that can execute arbitrary Python functions during deserialization. When you call torch.load(), it runs the Pickle Virtual Machine (PVM), which can reconstruct objects by calling any Python callable.

The Security Problem

Attackers exploit pickle's __reduce__ protocol: a malicious class can return os.system with shell commands as arguments. When the file is loaded, the commands execute silently. Loading an untrusted .bin file is equivalent to running an untrusted Python script.

Why It Still Exists

// The dangerous path: import torch model = torch.load("untrusted_model.bin") // ↑ This runs arbitrary code during load! // The safe alternative: from safetensors.torch import load_file model = load_file("model.safetensors") // ↑ Pure data read. No code execution.

Why it matters: PyTorch .bin persists due to ecosystem inertia — many older models and training scripts still produce it. Hugging Face now defaults to Safetensors and shows security warnings on pickle-based uploads. Always prefer Safetensors for models from untrusted sources.

view_column

How Sharding Works

Splitting large models across multiple files

Why Models Are Sharded

A 70B-parameter model in BF16 is ~140 GB. That's too large for a single file on many filesystems (FAT32 has a 4 GB limit), too slow to download in one chunk, and impractical for multi-GPU serving where different GPUs need different layers. Sharding splits the model across multiple files — each shard holds a subset of tensors. An index file maps every tensor name to its shard.

Index File Structure

// model.safetensors.index.json { "metadata": { "total_size": 16060514304 }, "weight_map": { "model.embed_tokens.weight": "model-00001-of-00004.safetensors", "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors", "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00004.safetensors", // ... every tensor → shard mapping } }

Key insight: When you see "missing key" errors loading a model, the index file is the first place to check. It tells you exactly which shard should contain each tensor. Layers are typically packed contiguously — layers 0-7 in shard 1, 8-15 in shard 2, etc.

terminal

Reading the First Bytes

How to identify a format from its magic bytes

Safetensors First 16 Bytes

// hexdump -C model.safetensors | head -1 00000000 a8 6a 00 00 00 00 00 00 // 0x6aa8 = 27304 // ↑ Header is 27,304 bytes of JSON // Followed immediately by the JSON header 00000008 7b 22 ... // '{"' = JSON start

GGUF First 24 Bytes

// hexdump -C model.gguf | head -2 00000000 47 47 55 46 // Magic: "GGUF" 00000004 03 00 00 00 // Version: 3 00000008 23 01 00 00 00 00 00 00 // 291 tensors 00000010 17 00 00 00 00 00 00 00 // 23 KV pairs

PyTorch First 16 Bytes

// hexdump -C pytorch_model.bin | head -1 00000000 80 02 // Pickle protocol 2 // Or for newer files: 00000000 50 4b // "PK" = ZIP archive // PyTorch wraps pickle in a ZIP container // Contains: archive/data.pkl + tensor files

Key insight: You can identify any model file format from its first 4 bytes: JSON brace {" after 8-byte header = Safetensors, GGUF = GGUF, PK or 80 02 = PyTorch pickle. This is useful when files are misnamed or lack extensions.

compare

Choosing the Right Format

When to use each format — the decision matrix

Format Comparison

// Feature | Safetensors | GGUF | PyTorch Security: Safe ✓ Safe ✓ Unsafe ✗ Self-contained: No ✗ Yes ✓ No ✗ Memory mapping: Yes ✓ Yes ✓ No ✗ Quantization: Limited Native No GPU inference: Best ✓ Slow Good CPU inference: Poor Best ✓ Poor Sharding: Yes ✓ Split Yes ✓ Framework: Any ggml PyTorch

Decision Rules

GPU inference with Hugging Face/vLLM? → Safetensors. It's the default, fastest to load, and framework-agnostic.

CPU/laptop inference with llama.cpp or Ollama? → GGUF. Self-contained with native quantization support, designed for edge/CPU workloads.

Existing pipeline that only outputs .bin? → Convert to Safetensors using huggingface_hub's convert utility. Never distribute .bin for untrusted consumption.

Training output? → Safetensors. PyTorch checkpoints can use .safetensors natively since recent versions.

Rule of thumb: Safetensors for the GPU cloud, GGUF for the laptop, and avoid PyTorch .bin unless you trust the source completely and have no alternative.

Ch 2 — File Formats: Safetensors, GGUF, and PyTorch