Ch 2 — The YAML Header

Metadata that machines read — the 10 fields that define a model’s identity
High Level
code
Structure
arrow_forward
badge
Identity
arrow_forward
category
Task
arrow_forward
account_tree
Lineage
arrow_forward
assessment
Eval
arrow_forward
search
Discovery
-
Click play or press Space to begin...
Step- / 8
code
The YAML Block
Those three dashes at the top of README.md are the most important lines in the file
What It Looks Like
Every model card on Hugging Face is a README.md file. At the very top, between two lines of ---, sits a block of YAML (Yet Another Markup Language). This is the structured metadata that Hugging Face parses automatically. Everything below the second --- is free-form Markdown for humans. The YAML block is what powers search, filtering, the inference widget, and the sidebar badges you see on every model page.
A Real Example
--- license: apache-2.0 language: - en pipeline_tag: text-generation library_name: transformers base_model: meta-llama/Llama-3.1-8B datasets: - HuggingFaceTB/cosmopedia tags: - text-generation - llm ---
Key insight: If the YAML is the model’s passport, each field is a visa stamp. Without it, the model exists on the Hub but is essentially invisible to anyone searching or filtering.
badge
license and language
The two fields that filter out 90% of models immediately
license
This field determines whether you can legally use the model. Common values: apache-2.0 (fully permissive, commercial OK), mit (permissive), llama3.1 (Meta’s community license with usage restrictions above 700M monthly users), gemma (Google’s terms with prohibited use cases), cc-by-nc-4.0 (non-commercial only). If you’re building a product, this is the first field to check. A model with cc-by-nc-4.0 cannot be used commercially, period.
language
Uses ISO 639-1 codes: en for English, zh for Chinese, fr for French. A model listing [en, de, fr] was trained on those languages. A model listing only [en] may produce garbage output in other languages, even if it technically generates text. Multilingual models typically list many codes or use multilingual as a tag.
Key insight: These two fields answer the two most fundamental questions: “Am I allowed to use this?” (license) and “Does it speak my language?” (language). If either answer is no, stop reading and move on.
category
pipeline_tag and library_name
What task does this model do, and how do I load it?
pipeline_tag
This tells you the model’s primary task. Common values for LLMs: text-generation (autoregressive generation), text2text-generation (encoder-decoder like T5), fill-mask (BERT-style masked language modeling). For other modalities: text-to-image (Stable Diffusion), automatic-speech-recognition (Whisper), image-classification, feature-extraction (embedding models). The pipeline tag also powers the interactive widget on the model page — it tells HF what kind of input box to show.
library_name
How to load the model in code. transformers (Hugging Face’s main library), diffusers (for diffusion models), sentence-transformers (for embeddings), peft (for adapters/LoRA), gguf (for llama.cpp format). This tells you which import to use: a transformers model loads with AutoModelForCausalLM.from_pretrained(), while a gguf model loads with llama.cpp or Ollama.
Key insight: If you see pipeline_tag: text-generation and library_name: transformers, you know immediately: “This is a standard LLM I can load with HF Transformers.” If you see library_name: gguf, you know: “This is for local inference with llama.cpp.”
account_tree
base_model — Tracing the Family Tree
Where this model came from and why lineage matters
What It Tells You
The base_model field links to the parent model this one was derived from. A fine-tuned model will point to its foundation model; a quantized variant will point to the full-precision original. Example: base_model: meta-llama/Llama-3.1-8B tells you this is a derivative of Meta’s Llama 3.1 8B. You can click through to the base model to see its original card, benchmarks, and training data.
Following the Chain
Models often form a chain: Base → Fine-tune → Quantized. For example: meta-llama/Llama-3.1-8B (base) → NousResearch/Hermes-3-Llama-3.1-8B (fine-tune) → bartowski/Hermes-3-Llama-3.1-8B-GGUF (quantized). Each link in the chain inherits the upstream model’s strengths, weaknesses, and license terms. A fine-tune of a Llama model still carries the Llama Community License, regardless of what license the fine-tuner claims.
Key insight: Always follow the base_model chain to the root. The original model’s license and training data disclosures apply to every downstream derivative. A model can’t be “MIT licensed” if its base model has a more restrictive license.
dataset
datasets and tags
What was it trained on, and how can you find it?
datasets
Lists the Hugging Face dataset IDs used for training. Example: datasets: [HuggingFaceTB/cosmopedia, allenai/dolma]. This lets you click through to the actual training data and inspect it. For fine-tuned models, this usually lists the fine-tuning dataset, not the base model’s pre-training data. Watch for models that don’t list any datasets — either the data is proprietary or the model creator didn’t document it.
tags
Free-form labels that help with discovery. Common useful tags: chat (instruction-tuned for conversation), code (trained on code), math (math-focused), gguf, 4bit, lora. Tags are not validated — anyone can add any tag. They’re useful for broad filtering but should not be trusted as ground truth. Cross-check tags against the actual card content.
Key insight: The datasets field is where transparency lives. A model that lists its training data lets you assess data quality, check for contamination (did they train on the benchmark test set?), and understand domain coverage. Opaque training data is a risk factor.
assessment
model-index — Automated Evaluation Results
Benchmark scores embedded directly in the YAML
How It Works
The model-index field embeds benchmark results directly in the metadata. Hugging Face uses a decentralized evaluation system: results are stored as YAML files in an .eval_results/ folder in the model repo. These results appear automatically on the model page with badges showing their provenance — “verified” (ran on HF infrastructure), “community” (submitted via PR), or “leaderboard” (from the Open LLM Leaderboard).
What to Look For
Check the badge type: verified results are more trustworthy than self-reported ones. Check the benchmark names: do they cover the tasks you care about? And check whether the results are suspiciously high — if a 7B model claims 95% on MMLU, be skeptical. The evaluation results also link to the benchmark dataset’s leaderboard, letting you compare this model against others on the same benchmark.
Key insight: Verified evaluation results (with the “verified” badge) are worth more than self-reported numbers. They were run on Hugging Face’s infrastructure with reproducible configurations. Self-reported results may have been tested under favorable conditions.
search
How Metadata Powers Discovery
Why good YAML makes models findable
The Filtering System
When you go to huggingface.co/models and use the filters on the left sidebar, every filter maps directly to a YAML field. Filter by “Text Generation”? That’s pipeline_tag. Filter by “English”? That’s language. Filter by “Apache 2.0”? That’s license. Filter by “transformers”? That’s library_name. A model with no YAML metadata is a model that doesn’t appear in any filtered search.
The Widget Connection
The interactive widget on the model page (the text box where you can type a prompt and see output) is powered by pipeline_tag. If the tag says text-generation, you get a text input. If it says text-to-image, you get an image generation interface. If it says automatic-speech-recognition, you get a file upload for audio. No pipeline_tag = no widget = no way to test the model in-browser before downloading.
Key insight: Well-filled YAML metadata is a quality signal in itself. A model with complete metadata (license, language, pipeline_tag, base_model, datasets) was created by someone who understands the ecosystem and cares about discoverability. Sparse metadata often correlates with sparse documentation everywhere else.
checklist
Your YAML Reading Checklist
The 30-second scan that tells you whether to keep reading
The Quick Scan
When you land on a model page, read the YAML in this order:

1. license — Can I use this? (If non-commercial and you need commercial, stop.)
2. pipeline_tag — Is this the right task type?
3. language — Does it support my language?
4. base_model — What family does it belong to?
5. library_name — Can I load it with my stack?
6. datasets — What was it trained on?
7. tags — Any useful context?
When to Go Deeper
If all 7 checks pass, you have a candidate worth investigating. Now you move beyond the YAML: read the prose description, check the benchmark tables (Chapter 4), look at the Files tab (Chapter 6), and scan the Community discussions. If any of the 7 checks fail, move on to the next model — there are over 2 million models on the Hub. Your time is better spent finding a model that fits than trying to make a misfit work.
Key insight: The YAML header is the model’s passport — 10 fields that determine whether it even shows up in search, what widget it gets, and whether you’re legally allowed to use it. Master these fields and you can triage models in 30 seconds.