Ch 7: The Model Selection Workflow

Ch 7 — The Model Selection Workflow

Card to decision — a practical checklist and side-by-side comparison walkthrough

Index

High Level

checklist

7 Questions

arrow_forward

filter_alt

Filter

arrow_forward

compare

Compare

arrow_forward

public

Providers

arrow_forward

groups

Community

arrow_forward

verified

Decide

Click play or press Space to begin...

Step- / 8

checklist

The 7-Question Checklist

Answer these before downloading any model

The Questions

Q1 Does the license allow my use case? Q2 Is the parameter count feasible for my hardware? Q3 Does the context length fit my application? Q4 Do the benchmarks cover my task type? Q5 Is the training data appropriate? Q6 Is there a quantized variant that fits? Q7 How active is the community?

The Order Matters

These questions are ordered by elimination efficiency. Q1 (license) kills the most candidates with the least effort. Q2 (hardware) eliminates the next largest set. By Q4, you’re comparing 3–5 finalists instead of 50 candidates. Don’t start with benchmarks (Q4) — that’s the mistake most people make. Start with constraints (Q1–Q3), then compare quality (Q4–Q5), then verify logistics (Q6–Q7).

Key insight: The checklist is a funnel, not a wish list. Start with hard constraints (license, hardware, context) that immediately eliminate incompatible models. Then compare the survivors on quality and community signals.

filter_alt

Using Hub Filters Effectively

Narrowing 2 million models to a shortlist of 10

The Filter Stack

On huggingface.co/models, apply filters in this order:

1. Task: Text Generation, Text-to-Image, etc. (pipeline_tag)
2. Library: Transformers, GGUF, diffusers (library_name)
3. License: Apache 2.0, MIT, or your requirement
4. Language: English, multilingual, etc.
5. Sort by: Most Downloads or Trending

This typically reduces millions of models to hundreds. Then scan the first 2–3 pages for models from reputable organizations (Meta, Mistral, Google, Microsoft, Alibaba, Nous Research, etc.).

Beyond Filters

Filters get you to a shortlist. From there, open the top 3–5 model cards in tabs and compare them side by side. Look for: recency (when was it released?), base model family (Llama, Mistral, Qwen?), benchmark table coverage, and quality of the model card itself. A thorough, well-written card is a signal of a well-built model.

Key insight: Don’t search for “the best model.” Search for “the best model for my constraints.” Filter by your requirements first, then compare quality within the constraint set.

compare

Side-by-Side Comparison

Walking through a real model selection: coding assistant on a single GPU

Scenario

You need a coding assistant. Constraints: commercial use, single 24GB GPU, 32K+ context, strong Python/JavaScript performance. You’ve filtered and found three candidates.

The Comparison

Model A Model B Model C Params 8B 14B 32B License Apache MIT Apache Context 128K 32K 128K HumanEval 72% 82% 88% FP16 VRAM 16GB 28GB 64GB Q4 VRAM 5GB 9GB 20GB

The Analysis

Model C: Best benchmarks, but at FP16 it needs 64GB (doesn’t fit). At Q4 it needs 20GB — fits, but tight with KV-cache. Only works with aggressive quantization.

Model B: Good benchmarks, fits at FP16 with slight squeeze. 32K context might be tight for full codebase analysis but fine for file-level work.

Model A: Lower benchmarks, but fits easily at FP16 with room for long context. The 128K window enables whole-project analysis.

The decision: No single “right” answer. If coding quality is paramount: Model B at FP16 or Model C at Q4. If context length matters: Model A at FP16. If easy deployment: Model A.

Key insight: Model selection is always a tradeoff between quality, hardware requirements, and features. The “best” model is the one that gives you the best results within your constraints.

public

Provider-Specific Documentation

Where to find equivalent information for OpenAI, Anthropic, Google, Meta

Finding the Info

OpenAI platform.openai.com/docs/models Model Spec: openai.com/index/the-model-spec Anthropic docs.anthropic.com/en/docs/about-claude/models System Cards: anthropic.com/system-cards Google ai.google.dev/gemini-api/docs/models Gemma cards on Hugging Face Meta ai.meta.com/llama (overview) Full cards on Hugging Face repos

What’s Different

For API-based models (GPT-4o, Claude, Gemini Pro), the model card equivalent is the pricing page + model docs + system card. You won’t find config.json or weight files — instead, look for: token limits, rate limits, pricing per million tokens, supported features (function calling, vision, streaming), and safety documentation. The same 7-question checklist applies; just replace “hardware” with “budget” and “quantization” with “pricing tier.”

Key insight: The skill of reading model documentation transfers across all providers. Whether it’s a Hugging Face model card, an OpenAI model spec, or an Anthropic system card — you’re asking the same questions, just finding answers in different places.

groups

Community Signals as Tiebreakers

When two models look equal on paper, the community breaks the tie

Signals to Check

Download trajectory: Is usage growing or declining? A model with 500K downloads last month but 100K this month is being replaced.

Discussion quality: Are people reporting success stories or filing bugs? Active Q&A means the model is being used in production.

Linked Spaces: Can you test it before downloading? Live demos are worth more than benchmark tables.

Third-party reviews: Has the model been covered by independent benchmarkers, tech blogs, or the Hugging Face blog?

Red Flags in Community

No discussions: Either nobody is using it, or the creator disabled comments. Neither is a great sign.

Unresolved bug reports: Multiple users reporting the same issue with no response from the creator.

Hype without substance: 10,000 likes but only 100 downloads. The model was upvoted for novelty, not utility.

Stale repo: No updates in 6+ months. The model is likely superseded.

Key insight: Community signals tell you what benchmarks can’t: is this model actually being used successfully by real people? A model with mediocre benchmarks but 10 million downloads per month is doing something right.

science

Test Before You Commit

Your shortlist isn’t final until you’ve tested the top 2–3 candidates

Testing Strategy

Step 1: Quick test. Use the HF widget or a linked Space to test with 3–5 prompts representative of your use case. Takes 5 minutes. Eliminates models that clearly don’t work for your task.

Step 2: Local test. Download the top 2–3 and test with 20–30 prompts from your actual domain. Score outputs on your own criteria (accuracy, format compliance, tone). Takes 1–2 hours. This is where you find the winner.

Step 3: Integration test. Deploy the winner in your actual pipeline and measure real-world performance over 100+ queries. Takes 1–2 days.

What Benchmarks Miss

Benchmarks can’t tell you: whether the model’s default tone matches your application, how it handles your specific domain (medical, legal, finance), whether it follows your output format requirements, or how it behaves with your system prompt. Only testing on your data answers these questions.

Key insight: The model card gets you to a shortlist. Testing gets you to a decision. Never skip testing — the card tells you the model’s potential; testing tells you its actual performance on your task.

swap_horiz

When to Reassess

Model selection is not a one-time decision

Triggers for Reassessment

New model release: Every 3–6 months, a new generation appears. Set a calendar reminder to check whether your model has been superseded.

Use case change: Your requirements evolved — new language, longer context, different task type.

Cost pressure: A smaller model can now match your current model’s quality at lower cost.

Community decline: Your model’s repo goes stale, downloads drop, maintainer stops responding.

The Switching Cost

Switching models isn’t free. You need to re-test, update prompts, potentially change infrastructure. The checklist helps minimize unnecessary switching by making your selection criteria explicit. If your current model still passes all 7 questions, don’t switch just because something new came out. Switch when a new model is meaningfully better on the dimensions that matter to your use case.

Key insight: The model landscape changes every 3–6 months. Build your selection workflow as a repeatable process, not a one-time decision. The 7-question checklist is a tool you’ll use again and again.

verified

The Decision Framework

A model card is an engineering specification — read it like one

The Complete Workflow

Phase 1: Filter (2 min)
License → Hardware fit → Context length → Task type

Phase 2: Compare (10 min)
Benchmarks → Training data → Architecture → Side-by-side table

Phase 3: Validate (5 min)
Community signals → Discussions → Linked Spaces → Recency

Phase 4: Test (1–2 hours)
Quick test (widget) → Local test (your prompts) → Integration test (your pipeline)

The Mindset

A model card is not a sales brochure. It is an engineering specification. Read it like you would read a spec sheet for a hardware component: check compatibility first, compare performance second, verify with testing third. The practitioners who evaluate models most effectively are the ones who have a systematic process — not the ones who grab whatever is trending.

Key insight: A model card is not a sales brochure — it is an engineering specification. Read it like a spec sheet, not a blog post. The 7-question checklist turns model selection from an overwhelming choice into a manageable workflow.

arrow_back Ch 6: Files & Versions Ch 8: Beyond the Card arrow_forward