Ch 5: Training Data, Licensing & Intended Use

Ch 5 — Training Data, Licensing & Intended Use

The fine print — what was it trained on, can you use it, and what is it for?

Index

High Level

storage

Data

arrow_forward

gavel

License

arrow_forward

lock_open

Openness

arrow_forward

check_circle

Intended

arrow_forward

balance

Bias

arrow_forward

business

Commercial

Click play or press Space to begin...

Step- / 8

storage

Training Data: What Went In

The data determines the model’s strengths, blind spots, and biases

What to Look For

The training data section tells you what the model learned from. For base models, look for: data sources (web crawl, books, code repositories, Wikipedia), data mix (what percentage was code vs. text vs. math), total token count (Llama 3.1 was trained on ~15 trillion tokens), and data cutoff date (the model has no knowledge after this date). For fine-tuned models, look for the fine-tuning dataset and its size.

Synthetic Data Flag

Increasingly, models are trained on synthetic data — data generated by other AI models. Cards may mention “synthetically generated” or reference datasets like “UltraChat” or “Cosmopedia.” Synthetic data isn’t inherently bad, but it means the model may have inherited biases or errors from the model that generated the training data. It’s a form of model incest — one model’s mistakes propagating to the next.

Key insight: A model is only as good as its training data. Opaque training data (no details provided) is a risk factor. Transparent training data (linked datasets, documented mix) lets you assess domain coverage, freshness, and potential contamination.

gavel

License Types Decoded

The most common licenses and what they actually allow

Permissive Licenses

apache-2.0 Commercial OK, modify OK, distribute OK Patent grant included. Gold standard. mit Commercial OK, very permissive No patent grant. Almost no restrictions. cc-by-4.0 Commercial OK, must give credit Common for datasets, less common for models.

Restrictive Licenses

llama3.1 Meta Community License Commercial OK if <700M monthly users. Must include "Built with Llama." gemma Google Gemma Terms of Use Commercial OK, prohibited use cases listed. cc-by-nc-4.0 Non-commercial ONLY Cannot be used in any revenue-generating product. openrail RAIL (Responsible AI License) Commercial OK but with use restrictions.

Key insight: Apache 2.0 and MIT are the “safest” licenses for commercial use. Anything else requires reading the specific terms. Don’t assume “open” means “free to use for anything” — check the actual license text.

lock_open

Open Source vs. Open Weight vs. Open Model

The spectrum of “openness” in AI — and why it matters

The Three Levels

Open Source: Weights, training data, training code, and evaluation code are all public. Anyone can reproduce the model from scratch. Very rare in practice — examples include OLMo by AI2 and some academic models.

Open Weight: The model weights are downloadable and usable, but training data and/or training code are not fully disclosed. This is what most “open” models actually are — Llama, Mistral, Gemma, Qwen. You can run and fine-tune them, but you can’t recreate them.

Proprietary/Closed: Only accessible via API. No weights available. GPT-4o, Claude, Gemini Pro. You interact through an endpoint, and the model is a black box.

Why the Distinction Matters

When a model card says “open source,” verify what is actually open. Can you download the weights? (Open weight.) Can you see the training data? (Rare.) Can you reproduce the training? (Very rare.) The level of openness affects your vendor lock-in risk and your ability to audit the model for biases, safety issues, or domain-specific concerns.

Key insight: “Open” is a spectrum, not a binary. Most models marketed as “open source” are actually “open weight” — you can use the trained model but can’t see or reproduce the training process. Know which level you need.

check_circle

Intended Use vs. Out-of-Scope Use

What the creator says you should — and shouldn’t — do with this model

Intended Use

This section describes the tasks and contexts the model was designed for. Examples: “text generation in English,” “code completion in Python and JavaScript,” “conversational assistant with safety guardrails.” A model intended for “research only” was not tested for production deployment and may lack safety tuning. A model intended for “text classification” should not be used for generative tasks.

Out-of-Scope Use

Equally important: what the model should not be used for. Common out-of-scope uses: medical diagnosis, legal advice, autonomous decision-making without human oversight, generating content for vulnerable populations. Some cards are specific: “not suitable for languages other than English and Chinese.” Others are vague: “not intended for harmful purposes.” Specific out-of-scope descriptions are more useful than generic disclaimers.

Key insight: A model used outside its intended scope isn’t just risky — it may also violate the license terms. Many model licenses (like RAIL) include “acceptable use” clauses that prohibit specific applications. Using the model for prohibited purposes can be a legal issue, not just a quality issue.

balance

Bias, Risks & Limitations

How to read this section critically

What Good Cards Include

The best bias sections describe specific, known limitations: “The model may produce stereotypical associations for gender and race.” “Performance degrades significantly for low-resource languages.” “The model may generate plausible-sounding but factually incorrect information (hallucination).” They also include recommendations: “Use with human oversight for any consequential decisions.”

What to Be Skeptical Of

Generic disclaimers like “this model may produce biased output” without specifics are better than nothing, but not very helpful. No bias section at all is a red flag — every model has biases, and a card that doesn’t acknowledge them suggests the creator either didn’t test for them or didn’t care to document them. The original Mitchell et al. paper specifically called for disaggregated evaluation across demographic groups — look for cards that report performance broken down by subgroup, not just overall.

Key insight: The bias section is where honesty lives. A model card that says “this model may generate harmful content in the following scenarios” is more trustworthy than one that says “this model is safe and aligned.” Healthy skepticism about safety claims is a virtue.

business

Commercial Use Rights

When you can and cannot ship a product with this model

The Quick Commercial Test

Step 1: Check the license field. apache-2.0 or mit? You’re almost certainly clear.

Step 2: If it’s a custom license (Llama, Gemma), read the specific terms. Look for: user thresholds (Llama’s 700M monthly users limit), prohibited use cases, and attribution requirements.

Step 3: Follow the base_model chain. A model fine-tuned from Llama inherits Llama’s license restrictions, regardless of what license the fine-tuner claims.

Step 4: Check for cc-by-nc anywhere in the chain. If it appears, no commercial use, period.

Common Traps

License laundering: A community member fine-tunes a model with a restrictive license and releases the fine-tune under MIT. The original license still applies — this “MIT” claim is incorrect. Always check the base model’s license.

Dataset license contamination: Even if the model weights are Apache-licensed, if the training data included non-commercial content, there may be legal ambiguity. This is an active area of legal debate.

Key insight: The license determines whether your project can actually use this model. Read it before downloading 140GB of weights. A model with perfect benchmarks is worthless to you if its license prohibits your use case.

science

Training Details: Hyperparameters & Procedure

The technical fine print that tells you how carefully the model was built

What the Training Section Includes

For fine-tuned models, look for: learning rate, batch size, number of epochs, optimizer (usually AdamW), hardware used (number and type of GPUs), training time, and RLHF/DPO details (was it aligned with human feedback?). For base models, look for: total compute (GPU hours or petaflop-days), training tokens, and data preprocessing steps.

Why You Should Care

Training details tell you about reproducibility and thoroughness. A model card that lists “trained for 3 epochs on 8x A100 GPUs with learning rate 2e-5 using DPO” was built methodically. A card that says “fine-tuned on our data” with no details could be anything. If you plan to further fine-tune a model, the training details help you set appropriate hyperparameters for your own training.

Key insight: Training details are like a recipe. If you can see the recipe, you can judge the cooking quality. If the recipe is hidden, you’re trusting the chef blindly. For production models, prefer chefs who share their recipes.

checklist

Your Fine Print Checklist

The five things to verify before committing to any model

Before You Download

1. License compatibility: Does the license allow your use case (commercial, academic, internal)?

2. License chain: Follow base_model to the root. All upstream licenses apply.

3. Training data disclosure: Is the training data documented? Are there red flags (undisclosed data, potential benchmark contamination)?

4. Intended use match: Is your application within the model’s stated intended use?

5. Bias acknowledgment: Does the card honestly discuss limitations? Are there specific, actionable warnings?

The Bottom Line

This section of the model card is the legal and ethical foundation of your model choice. It’s not as exciting as benchmarks, but it’s more consequential. A model with great benchmarks but a non-commercial license is useless for your startup. A model with undisclosed training data may land you in legal trouble. A model with no bias documentation may produce harmful output in production. Read the fine print.

Key insight: Benchmarks tell you what a model can do. The fine print tells you what you’re allowed to do with it, what went into making it, and what could go wrong. Skip this section at your own risk.

arrow_back Ch 4: Benchmarks & Evaluation Ch 6: Files & Versions arrow_forward