What to Look For
The training data section tells you what the model learned from. For base models, look for: data sources (web crawl, books, code repositories, Wikipedia), data mix (what percentage was code vs. text vs. math), total token count (Llama 3.1 was trained on ~15 trillion tokens), and data cutoff date (the model has no knowledge after this date). For fine-tuned models, look for the fine-tuning dataset and its size.
Synthetic Data Flag
Increasingly, models are trained on synthetic data — data generated by other AI models. Cards may mention “synthetically generated” or reference datasets like “UltraChat” or “Cosmopedia.” Synthetic data isn’t inherently bad, but it means the model may have inherited biases or errors from the model that generated the training data. It’s a form of model incest — one model’s mistakes propagating to the next.
Key insight: A model is only as good as its training data. Opaque training data (no details provided) is a risk factor. Transparent training data (linked datasets, documented mix) lets you assess domain coverage, freshness, and potential contamination.