Ch 7 — The Model Selection Workflow

Card to decision — a practical checklist and side-by-side comparison walkthrough
High Level
checklist
7 Questions
arrow_forward
filter_alt
Filter
arrow_forward
compare
Compare
arrow_forward
public
Providers
arrow_forward
groups
Community
arrow_forward
verified
Decide
-
Click play or press Space to begin...
Step- / 8
checklist
The 7-Question Checklist
Answer these before downloading any model
The Questions
Q1 Does the license allow my use case? Q2 Is the parameter count feasible for my hardware? Q3 Does the context length fit my application? Q4 Do the benchmarks cover my task type? Q5 Is the training data appropriate? Q6 Is there a quantized variant that fits? Q7 How active is the community?
The Order Matters
These questions are ordered by elimination efficiency. Q1 (license) kills the most candidates with the least effort. Q2 (hardware) eliminates the next largest set. By Q4, you’re comparing 3–5 finalists instead of 50 candidates. Don’t start with benchmarks (Q4) — that’s the mistake most people make. Start with constraints (Q1–Q3), then compare quality (Q4–Q5), then verify logistics (Q6–Q7).
Key insight: The checklist is a funnel, not a wish list. Start with hard constraints (license, hardware, context) that immediately eliminate incompatible models. Then compare the survivors on quality and community signals.
filter_alt
Using Hub Filters Effectively
Narrowing 2 million models to a shortlist of 10
The Filter Stack
On huggingface.co/models, apply filters in this order:

1. Task: Text Generation, Text-to-Image, etc. (pipeline_tag)
2. Library: Transformers, GGUF, diffusers (library_name)
3. License: Apache 2.0, MIT, or your requirement
4. Language: English, multilingual, etc.
5. Sort by: Most Downloads or Trending

This typically reduces millions of models to hundreds. Then scan the first 2–3 pages for models from reputable organizations (Meta, Mistral, Google, Microsoft, Alibaba, Nous Research, etc.).
Beyond Filters
Filters get you to a shortlist. From there, open the top 3–5 model cards in tabs and compare them side by side. Look for: recency (when was it released?), base model family (Llama, Mistral, Qwen?), benchmark table coverage, and quality of the model card itself. A thorough, well-written card is a signal of a well-built model.
Key insight: Don’t search for “the best model.” Search for “the best model for my constraints.” Filter by your requirements first, then compare quality within the constraint set.
compare
Side-by-Side Comparison
Walking through a real model selection: coding assistant on a single GPU
Scenario
You need a coding assistant. Constraints: commercial use, single 24GB GPU, 32K+ context, strong Python/JavaScript performance. You’ve filtered and found three candidates.
The Comparison
Model A Model B Model C Params 8B 14B 32B License Apache MIT Apache Context 128K 32K 128K HumanEval 72% 82% 88% FP16 VRAM 16GB 28GB 64GB Q4 VRAM 5GB 9GB 20GB
The Analysis
Model C: Best benchmarks, but at FP16 it needs 64GB (doesn’t fit). At Q4 it needs 20GB — fits, but tight with KV-cache. Only works with aggressive quantization.

Model B: Good benchmarks, fits at FP16 with slight squeeze. 32K context might be tight for full codebase analysis but fine for file-level work.

Model A: Lower benchmarks, but fits easily at FP16 with room for long context. The 128K window enables whole-project analysis.

The decision: No single “right” answer. If coding quality is paramount: Model B at FP16 or Model C at Q4. If context length matters: Model A at FP16. If easy deployment: Model A.
Key insight: Model selection is always a tradeoff between quality, hardware requirements, and features. The “best” model is the one that gives you the best results within your constraints.
public
Provider-Specific Documentation
Where to find equivalent information for OpenAI, Anthropic, Google, Meta
Finding the Info
OpenAI platform.openai.com/docs/models Model Spec: openai.com/index/the-model-spec Anthropic docs.anthropic.com/en/docs/about-claude/models System Cards: anthropic.com/system-cards Google ai.google.dev/gemini-api/docs/models Gemma cards on Hugging Face Meta ai.meta.com/llama (overview) Full cards on Hugging Face repos
What’s Different
For API-based models (GPT-4o, Claude, Gemini Pro), the model card equivalent is the pricing page + model docs + system card. You won’t find config.json or weight files — instead, look for: token limits, rate limits, pricing per million tokens, supported features (function calling, vision, streaming), and safety documentation. The same 7-question checklist applies; just replace “hardware” with “budget” and “quantization” with “pricing tier.”
Key insight: The skill of reading model documentation transfers across all providers. Whether it’s a Hugging Face model card, an OpenAI model spec, or an Anthropic system card — you’re asking the same questions, just finding answers in different places.
groups
Community Signals as Tiebreakers
When two models look equal on paper, the community breaks the tie
Signals to Check
Download trajectory: Is usage growing or declining? A model with 500K downloads last month but 100K this month is being replaced.

Discussion quality: Are people reporting success stories or filing bugs? Active Q&A means the model is being used in production.

Linked Spaces: Can you test it before downloading? Live demos are worth more than benchmark tables.

Third-party reviews: Has the model been covered by independent benchmarkers, tech blogs, or the Hugging Face blog?
Red Flags in Community
No discussions: Either nobody is using it, or the creator disabled comments. Neither is a great sign.

Unresolved bug reports: Multiple users reporting the same issue with no response from the creator.

Hype without substance: 10,000 likes but only 100 downloads. The model was upvoted for novelty, not utility.

Stale repo: No updates in 6+ months. The model is likely superseded.
Key insight: Community signals tell you what benchmarks can’t: is this model actually being used successfully by real people? A model with mediocre benchmarks but 10 million downloads per month is doing something right.
science
Test Before You Commit
Your shortlist isn’t final until you’ve tested the top 2–3 candidates
Testing Strategy
Step 1: Quick test. Use the HF widget or a linked Space to test with 3–5 prompts representative of your use case. Takes 5 minutes. Eliminates models that clearly don’t work for your task.

Step 2: Local test. Download the top 2–3 and test with 20–30 prompts from your actual domain. Score outputs on your own criteria (accuracy, format compliance, tone). Takes 1–2 hours. This is where you find the winner.

Step 3: Integration test. Deploy the winner in your actual pipeline and measure real-world performance over 100+ queries. Takes 1–2 days.
What Benchmarks Miss
Benchmarks can’t tell you: whether the model’s default tone matches your application, how it handles your specific domain (medical, legal, finance), whether it follows your output format requirements, or how it behaves with your system prompt. Only testing on your data answers these questions.
Key insight: The model card gets you to a shortlist. Testing gets you to a decision. Never skip testing — the card tells you the model’s potential; testing tells you its actual performance on your task.
swap_horiz
When to Reassess
Model selection is not a one-time decision
Triggers for Reassessment
New model release: Every 3–6 months, a new generation appears. Set a calendar reminder to check whether your model has been superseded.

Use case change: Your requirements evolved — new language, longer context, different task type.

Cost pressure: A smaller model can now match your current model’s quality at lower cost.

Community decline: Your model’s repo goes stale, downloads drop, maintainer stops responding.
The Switching Cost
Switching models isn’t free. You need to re-test, update prompts, potentially change infrastructure. The checklist helps minimize unnecessary switching by making your selection criteria explicit. If your current model still passes all 7 questions, don’t switch just because something new came out. Switch when a new model is meaningfully better on the dimensions that matter to your use case.
Key insight: The model landscape changes every 3–6 months. Build your selection workflow as a repeatable process, not a one-time decision. The 7-question checklist is a tool you’ll use again and again.
verified
The Decision Framework
A model card is an engineering specification — read it like one
The Complete Workflow
Phase 1: Filter (2 min)
License → Hardware fit → Context length → Task type

Phase 2: Compare (10 min)
Benchmarks → Training data → Architecture → Side-by-side table

Phase 3: Validate (5 min)
Community signals → Discussions → Linked Spaces → Recency

Phase 4: Test (1–2 hours)
Quick test (widget) → Local test (your prompts) → Integration test (your pipeline)
The Mindset
A model card is not a sales brochure. It is an engineering specification. Read it like you would read a spec sheet for a hardware component: check compatibility first, compare performance second, verify with testing third. The practitioners who evaluate models most effectively are the ones who have a systematic process — not the ones who grab whatever is trending.
Key insight: A model card is not a sales brochure — it is an engineering specification. Read it like a spec sheet, not a blog post. The 7-question checklist turns model selection from an overwhelming choice into a manageable workflow.