dictionary

Reading Model Cards Glossary

Core terminology for navigating AI model documentation.

0 Terms

Arena Elo

A rating derived from the Chatbot Arena platform where users compare model outputs side-by-side. Calculated like chess Elo ratings from millions of human votes; considered the most holistic single metric for real-world model quality.

Ch 4 Ch 8

AWQ (Activation-Aware Weight Quantization)

A GPU-optimized quantization method that identifies which weights are most important by analyzing activation patterns, then preserves those weights at higher precision. Retains more quality than GPTQ at the same bit width.

Ch 6

Apache 2.0

A permissive open-source license allowing commercial use, modification, and distribution with attribution. One of the most permissive licenses in the AI model ecosystem, alongside MIT.

Ch 2 Ch 5

base_model

A YAML metadata field specifying the upstream model that this model was fine-tuned from. Creates a lineage chain — the root model’s license applies to all derivatives.

Ch 2 Ch 5

Benchmark Saturation

When frontier models all score above 90% on a benchmark (e.g., original MMLU), making it unable to distinguish between them. Drives the creation of harder successors like MMLU-Pro.

Ch 4

BBH (BIG-Bench Hard)

A subset of challenging tasks from the BIG-Bench benchmark requiring multi-step reasoning. Part of the Open LLM Leaderboard v2 evaluation suite.

Ch 4

config.json

The architecture blueprint file in every model repository. Contains hidden_size, num_layers, num_attention_heads, num_key_value_heads, vocab_size, max_position_embeddings, and rope_theta — everything needed to reconstruct the model structure.

Ch 3 Ch 6

Context Length

The maximum number of tokens a model can process in a single pass, set by max_position_embeddings in config.json. Common values: 4K, 8K, 32K, 128K. Longer context = more memory usage and typically slower inference.

Ch 3 Ch 7

Community Card

A model card written by the community rather than the original model creator. Quality varies widely — always cross-check claims against the base model’s official card and independent evaluations.

Ch 8

Data Contamination

When benchmark test data leaks into a model’s training set, artificially inflating scores. A major red flag — v1 leaderboard benchmarks like MMLU are especially contaminated.

Ch 4 Ch 8

Dense Model

A model where all parameters are active for every token. Contrast with MoE models that only activate a subset. Llama 3.1, Gemma 2, and Qwen 2.5 are dense models.

Ch 3

Few-Shot (N-shot)

A benchmark testing condition where N examples are given in the prompt before the test question. 5-shot MMLU scores are typically 5–15 points higher than 0-shot. Always check the shot count when comparing scores.

Ch 4

FP16 (Half Precision)

16-bit floating point format, where each parameter uses 2 bytes. An 8B model at FP16 needs ~16GB of VRAM. The baseline precision for calculating memory requirements.

Ch 3 Ch 6

generation_config.json

A config file containing default inference parameters: temperature, top_p, top_k, max_new_tokens, repetition_penalty. Sets the model’s default behavior during text generation.

Ch 6

GGUF (GPT-Generated Unified Format)

A single-file model format designed for CPU and local inference with tools like llama.cpp and Ollama. Self-contained (includes tokenizer and config). Supports various quantization levels (Q4_K_M, Q5_K_S, Q8_0, etc.).

Ch 6 Ch 7

GPQA (Graduate-Level Problem QA)

A PhD-level benchmark in physics, chemistry, and biology. Even domain experts score ~65%. The “hard” knowledge benchmark — model scores in the 40–60% range indicate strong reasoning.

Ch 4

GPTQ

A GPU-optimized post-training quantization method that compresses weights using calibration data. Faster throughput than AWQ but slightly lower quality. Common format for GPU deployment.

Ch 6

GQA (Grouped-Query Attention)

An attention mechanism where multiple query heads share a smaller number of key/value heads. The modern standard (Llama 3, Gemma 2, Qwen 2.5) — balances quality and memory efficiency. Identifiable when num_key_value_heads < num_attention_heads in config.json.

Ch 3

HumanEval

A code generation benchmark with 164 Python function-writing problems. Tests isolated function completion — a model can score 90%+ here but struggle with real-world multi-file coding tasks.

Ch 4

Hugging Face Hub

The largest open platform for sharing ML models, datasets, and Spaces. Each model has a Git-based repository with a README.md (model card), config files, and weight files.

Ch 1 Ch 2 Ch 6

IFEval (Instruction Following Eval)

A benchmark measuring how well a model follows explicit formatting and constraint instructions (“respond in JSON,” “use exactly 3 bullet points”). Part of the Open LLM Leaderboard v2.

Ch 4

Intended Use

The model card section describing what the model was designed for and its out-of-scope uses. Using a model outside its intended scope may produce unreliable results or violate the license.

Ch 5

License Laundering

Relabeling a fine-tuned model with a permissive license (e.g., Apache 2.0) while the base model has a restrictive license (e.g., cc-by-nc). The base model’s restrictions still apply legally.

Ch 5

Llama Community License

Meta’s custom license for Llama models. Allows commercial use but requires Meta’s permission for deployments serving 700M+ monthly active users. Not technically “open source” by OSI definition.

Ch 5

Model Card

A short, structured document accompanying an ML model, first proposed by Mitchell et al. (2019). The “nutrition label” for AI — describes what the model does, how it was trained, its limitations, and its intended use.

Ch 1

model-index

A YAML metadata field embedding benchmark results directly in the model card. Supports verified (independently confirmed), community (self-reported), and leaderboard badges.

Ch 2 Ch 4

Model Merge

Combining weights from multiple models without additional training, using methods like SLERP, TIES, or DARE (via tools like mergekit). Benchmark scores for merges are unreliable — always test directly.

Ch 8

MMLU (Massive Multitask Language Understanding)

57 subjects, 16,000 multiple-choice questions. The “SAT for AI.” Now saturated above 90% for frontier models. Superseded by MMLU-Pro (10 choices, chain-of-thought required) for model differentiation.

Ch 4

MoE (Mixture of Experts)

An architecture where only a subset of parameters (experts) activates per token. Mixtral 8x7B: 47B total, 12.9B active. DeepSeek-R1: 671B total, 37B active. More capacity per FLOP but requires full parameter storage.

Ch 3

MQA (Multi-Query Attention)

An attention variant where all query heads share a single key/value head. Fastest inference but slightly lower quality than GQA. Used by some older/smaller models.

Ch 3

Open LLM Leaderboard

Hugging Face’s automated evaluation platform. v1 used saturated benchmarks (ARC, HellaSwag, MMLU). v2 uses harder tests (MMLU-Pro, GPQA, BBH, IFEval, MATH, MuSR). Results are independently evaluated and directly comparable.

Ch 4 Ch 8

Open Weight

A model that releases trained weights but not necessarily training code, data, or full reproducibility. Most models marketed as “open source” (like Llama) are technically open weight.

Ch 5

pipeline_tag

A YAML metadata field specifying the model’s primary task (e.g., text-generation, text-to-image, automatic-speech-recognition). Drives the Hugging Face inference widget and task-based filtering.

Ch 2

Quantization

Reducing the precision of model weights (e.g., FP16 → INT4) to shrink model size and reduce memory requirements. A 4-bit quantized model retains 90–98% of quality at ~25% of the original size.

Ch 3 Ch 6

Q4_K_M

A GGUF quantization variant using 4-bit precision with K-quant medium strategy. Considered the best balance of quality and size for most GGUF use cases.

Ch 6

RAIL (Responsible AI License)

A license family with use-based restrictions — allows most uses but prohibits specific harmful applications (surveillance, weapons, etc.). More permissive than NC but more restrictive than Apache 2.0.

Ch 5

rope_theta

A config.json parameter controlling Rotary Position Embedding frequency. Higher values (500,000+) enable longer context lengths. A technical indicator of whether the model was trained for extended context.

Ch 3

Safetensors

The modern, secure weight file format for GPU deployment. No arbitrary code execution (unlike PyTorch .bin which uses pickle). Memory-mappable for fast loading. The recommended default format.

Ch 6

Sharding

Splitting large model weight files across multiple files (e.g., model-00001-of-00004.safetensors). An index file (model.safetensors.index.json) maps parameter names to shard files.

Ch 6

SWE-bench

A real-world coding benchmark that tests a model’s ability to resolve actual GitHub issues across popular Python repos. Much harder than HumanEval — frontier models score 20–50%. Tests end-to-end engineering, not just function writing.

Ch 4

Synthetic Data

Training data generated by other AI models. Not inherently bad, but may propagate upstream biases and errors. Look for dataset names like “UltraChat” or “Cosmopedia” as indicators.

Ch 5

System Card

Anthropic’s documentation format focused on safety evaluation — red teaming results, dangerous capabilities assessment, and risk mitigation. Complements model cards by focusing on “whether it’s safe” rather than “what it is.”

Ch 1

Tokenizer

The component that converts text to token IDs and back. Key files: tokenizer.json (vocabulary and merge rules), tokenizer_config.json (special tokens and chat template). The chat template defines conversation formatting.

Ch 6

VRAM

Video RAM on a GPU. The primary constraint for running models locally. Memory rule: Params × bytes-per-param + 25% overhead. An 8B model at Q4 needs ~5GB; at FP16 needs ~16GB.

Ch 3 Ch 7

YAML Metadata

The structured data block between --- delimiters at the top of a Hugging Face model card (README.md). Contains machine-readable fields like license, language, pipeline_tag, base_model, and model-index that power Hub search and filtering.

Ch 2

No matching terms found.