Ch 6: Files & Versions — Reading Model Cards

Ch 6 — Files & Versions

What’s in the repository — weight files, configs, tokenizers, and quantization variants

Index

High Level

folder

Overview

arrow_forward

settings

Config

arrow_forward

fitness_center

Weights

arrow_forward

spellcheck

Tokenizer

arrow_forward

compress

Quant

arrow_forward

groups

Signals

Click play or press Space to begin...

Step- / 8

folder

The Files Tab Overview

What you see when you click “Files and versions”

What You’ll Find

The “Files and versions” tab shows every file in the model’s Git repository (Hugging Face uses Git + Git LFS for large files). A typical model repo contains: README.md (the model card), config.json (architecture blueprint), tokenizer files (text encoding/decoding), model weight files (the actual trained parameters), and supporting files like generation_config.json, .gitattributes, and LICENSE. Large models may also have a USAGE_POLICY file.

Why It Matters

The Files tab is the engine room of the model. The Model Card tab tells you what the model does; the Files tab shows you what the model is at a technical level. This is where you verify the model’s actual size (by looking at file sizes), check the format (safetensors, GGUF, PyTorch), and confirm the architecture (by reading config.json). For quantized repos, this is where you find the specific variant you need.

Key insight: Think of the Model Card tab as the brochure and the Files tab as the parts list. Both are useful; the parts list tells you exactly what you’re getting.

settings

Configuration Files

config.json, generation_config.json, and the architecture blueprint

config.json

The most important file after the weights. Loaded first by AutoModel.from_pretrained(). Contains: vocabulary size, hidden dimensions, number of layers, attention head counts, context length, normalization type, and position embedding configuration. Everything you need to understand the architecture is here. We covered the key fields in Chapter 3.

generation_config.json

Controls default inference behavior: temperature, top_p, repetition penalty, max new tokens, and stop sequences. This file tells you how the model creator intended the model to be used. A model with temperature: 0.0 was tuned for deterministic output. One with temperature: 0.7 was designed for creative generation.

Key insight: config.json tells you what the model is. generation_config.json tells you how the model should be run. Both are worth a quick scan before you start using the model.

fitness_center

Weight Files: Safetensors vs. PyTorch vs. GGUF

Three formats, three use cases

The Three Formats

Safetensors (.safetensors): The modern standard. Fast loading, memory-mapped, and secure (no arbitrary code execution risk). Hugging Face recommends this format. Large models are sharded: model-00001-of-00005.safetensors, etc.

PyTorch (.bin or .pt): The legacy format. Uses Python pickle, which can execute arbitrary code when loaded — a security risk. Being phased out in favor of safetensors.

GGUF (.gguf): A self-contained format for llama.cpp and Ollama. Encodes both weights and metadata in one file. Designed for local/CPU inference. Supports quantized variants directly.

Which to Choose

GPU inference with Transformers: Safetensors. Always.
Local/CPU inference with Ollama or llama.cpp: GGUF.
PyTorch .bin: Only if safetensors isn’t available (older models).

The Hub has a built-in GGUF viewer that shows metadata and tensor information without downloading the file. Useful for inspecting quantization type and tensor shapes.

Key insight: Safetensors is the safe default for GPU inference. GGUF is the format for local/Ollama use. If you see only PyTorch .bin files and no safetensors, the model repo may be older or less maintained.

spellcheck

Tokenizer Files

The text encoding layer — how the model turns words into numbers

The Tokenizer Files

tokenizer.json: The main tokenizer file — contains the vocabulary and merge rules (for BPE tokenizers). This is what converts text to token IDs and back.

tokenizer_config.json: Configuration for special tokens (BOS, EOS, pad), chat template format, and preprocessing settings. The chat template defines how messages are formatted for instruction-tuned models.

special_tokens_map.json: Maps special token types (beginning-of-sequence, end-of-sequence) to specific tokens.

Why You Should Care

The tokenizer determines: vocabulary size (how many unique tokens the model knows), language coverage (does it handle non-English text efficiently?), and chat format (how to structure multi-turn conversations). If you’re using an instruction-tuned model, the chat template in tokenizer_config.json is critical — sending messages in the wrong format produces garbage output.

Key insight: The tokenizer is the model’s “ears” — if it doesn’t tokenize your language efficiently, the model will use more tokens (= higher cost, shorter effective context) and produce worse results. Check vocab size and language coverage.

compress

Quantization Variants

What Q4_K_M, Q5_K_S, GPTQ-Int4, and AWQ mean

GGUF Quantization Labels

Q4_K_M 4-bit, K-quant, medium quality Best balance of size and quality Q4_K_S 4-bit, K-quant, small (lower quality) Saves ~0.5GB vs Q4_K_M Q5_K_M 5-bit, higher quality than Q4 15-20% larger than Q4_K_M Q8_0 8-bit, near-lossless Double the size of Q4 but minimal loss F16 Full 16-bit, no quantization Original quality, largest size

GPU Quantization Formats

GPTQ-Int4: 4-bit quantization optimized for CUDA GPUs. Uses calibration data to minimize quality loss. Load with AutoGPTQ or Transformers.

AWQ (Activation-Aware): 4-bit quantization that preserves important weights based on activation patterns. Retains ~95% original quality. Better for creative tasks.

The rule of thumb: Q4_K_M for GGUF/Ollama (best balance), GPTQ or AWQ for GPU inference (best throughput). Q5 or Q8 if you need higher quality and have the VRAM.

Key insight: At 4-bit quantization, models retain 90–98% of original quality while using 75% less storage. Q4_K_M is the community’s default recommendation for GGUF. For GPU: GPTQ for throughput, AWQ for quality.

inventory

Sharding & File Size

Why large models are split across multiple files

Sharded Weights

Large models split their weights across multiple files: model-00001-of-00005.safetensors through model-00005-of-00005.safetensors. This is necessary because Git LFS has file size limits and because sharding allows memory-efficient loading — you can load one shard at a time. The model.safetensors.index.json file maps which parameters live in which shard.

Reading File Sizes

The total size of all weight files tells you the actual download and storage cost. A 7B model at FP16 is about 14GB total. The same model as Q4_K_M GGUF is about 4.5GB. File size is a quick sanity check: if a repo claims to be a 70B model but the total weights are 8GB, something is off (it’s probably a heavily quantized variant or not actually 70B).

Key insight: File size = storage cost + download time + VRAM requirement. Before downloading, add up the weight files. If the total exceeds your available storage or VRAM, look for a quantized variant or a different repo.

groups

Community Signals

Downloads, likes, discussions, and linked Spaces

Downloads

Download counts are tracked via specific “query files” (usually config.json) to avoid double-counting. Millions/month = widespread production use. Thousands/month = active community interest. Hundreds or less = niche or very new. GGUF files are an exception: each GGUF file counts individually because they’re self-contained. High download counts on a specific quantization variant tell you which variant the community prefers.

Other Signals

Likes: Community endorsement. Useful but can be inflated by hype. Community discussions: Active threads about bugs, performance, and usage tips are gold — they tell you real users are working with the model. Linked Spaces: Live demos where you can test the model without downloading. Last updated: A repo that hasn’t been updated in 6+ months may have unpatched issues or be superseded by newer versions.

Key insight: Community signals are the “Yelp reviews” of the model world. High downloads + active discussions + recent updates = healthy, production-tested model. Low downloads + no discussions + stale repo = proceed with caution.

checklist

Your Files Tab Checklist

What to verify before you download

The Quick Scan

1. Format: Safetensors (GPU), GGUF (local), or PyTorch .bin (legacy)? Match to your stack.

2. Total size: Add up weight files. Will it fit your storage and VRAM?

3. Quantization: If GGUF, which variant? Q4_K_M is the safe default. If GPU, GPTQ or AWQ?

4. Config check: Open config.json to verify architecture matches expectations (layers, heads, context).

5. Recency: When was the repo last updated? Stale repos may have issues.

Connecting the Dots

The Files tab completes the picture started by the Model Card tab. The card tells you what and why. The files tell you how and how much. Together, they give you everything needed to make an informed download decision. If the card is great but the files reveal only PyTorch .bin with no safetensors and no quantized variants, the model may not be ready for your use case.

Key insight: The Files tab is the “engine room” of a model page. config.json tells you the architecture, safetensors hold the weights, and the quant label tells you the tradeoff between quality and RAM. A 2-minute scan of this tab saves hours of debugging.

arrow_back Ch 5: Training Data & Licensing Ch 7: Model Selection Workflow arrow_forward