Ch 4 — The Open Source Model Landscape

Llama, Mistral, Qwen, Gemma, DeepSeek, Phi — how to compare open models pragmatically
Foundation
travel_explore
Scan
arrow_forward
balance
Compare
arrow_forward
gavel
License
arrow_forward
memory
Fit
arrow_forward
rocket_launch
Ship
-
Click play or press Space to begin the journey...
Step- / 7
travel_explore
The Open Model Landscape Today
Capability now spans tiny edge models to frontier reasoning models that run in private infrastructure.
Capability Coverage
Open models now support core product categories: chat assistants, retrieval-backed Q&A, structured extraction, code generation, document workflows, and increasingly multimodal tasks. For many teams, the baseline question is no longer "can open models do this?" but "which open model tier is sufficient?"
Deployment Surfaces
The same project can run across very different environments: laptop prototyping, private VPC inference, regulated on-prem clusters, and edge endpoints. This flexibility is why open-weight adoption accelerated in enterprise and developer tooling.
What Actually Changed
Three layers matured together: models (better quality-per-parameter), runtimes (faster serving and quantized inference), and evaluation (stronger task-specific testing). Progress in one layer amplified progress in the others.
Practical Implication
Model selection became an engineering discipline: define task requirements, shortlist candidates, evaluate against private data, and deploy with rollback strategy. Teams that skip this process usually overpay or overfit to hype.
Key Point: The constraint has shifted from model availability to model selection discipline.
category
Families and Positioning
Different model families optimize for different tradeoffs: speed, quality, context, and licensing.
Common Families
Llama/Qwen: broad ecosystem coverage and strong community tooling. Mistral: efficient deployment profiles and practical serving behavior. Gemma/Phi: strong small-model efficiency for constrained hardware. DeepSeek variants: often prioritized for reasoning-heavy workloads.
Why Families Differ
Each family makes different tradeoffs in tokenizer design, context scaling strategy, instruction tuning, and release policy. That means two models with similar parameter counts can behave very differently on the same production task.
Selection Lens
Use a four-axis filter before testing: quality target (task success), latency budget (p95 constraints), memory fit (VRAM/RAM), and license fit (distribution/commercial terms). Then compare model families inside that constraint box.
Avoid the Usual Mistake
Starting from brand preference leads to expensive dead ends. Start from workload requirements, then select the smallest model that clears quality and safety thresholds.
Key Point: Pick by workload fit, not leaderboard rank alone.
insights
Benchmark Literacy
Benchmarks are useful but easy to misuse without context.
What Public Benchmarks Tell You
Benchmarks like MMLU, GPQA, HumanEval, and arena-style preference scores provide directional signal about broad capability classes: factual reasoning, code synthesis, and general response quality. They are useful for shortlisting, not for production sign-off.
What They Miss
Most benchmarks do not reflect your exact prompt format, retrieval chain behavior, safety policy, JSON schema requirements, or latency objectives. A model can score highly in public rankings and still fail your business-critical edge cases.
Private Eval Design
Build a reproducible internal set with representative task slices: normal cases, hard cases, and failure cases. Track both quality metrics (accuracy, groundedness, format adherence) and operational metrics (latency, token usage, timeout rate).
Decision Policy
Use public benchmarks to cut from many candidates to a shortlist. Use private evaluations to pick the default model and fallback model. This two-stage flow prevents both hype-driven choices and overfitting to tiny internal samples.
Key Point: Your private evaluation set should be the tie-breaker every time.
policy
Licenses and Commercial Risk
License terms directly affect distribution and legal exposure.
Open Source vs Open Weight
Open-source software licenses and model-weight licenses are different legal objects. A permissive codebase does not imply permissive model redistribution rights. Always evaluate model terms directly from the model repository.
License Risk Surfaces
Review these before launch: commercial-use clauses, redistribution permissions, attribution requirements, acceptable-use restrictions, and derivative model constraints. These terms determine whether your deployment pattern is allowed.
Governance Pattern
Treat model licenses like third-party dependencies: capture terms in registry metadata, gate promotion to production on policy review, and maintain an approved model list by use case (internal tooling, external SaaS, embedded distribution).
Operational Habit
When upgrading model versions, re-check license and policy fields even within the same family. Terms can differ between variants, repos, or release generations.
Key Point: Licensing is a deployment decision, not a footnote.
fit_screen
Context Length and Memory Fit
Long-context models are useful only if they fit your runtime constraints.
Context Tradeoff
Larger context windows increase memory pressure and can reduce effective throughput. In practice, quality gains from more context flatten quickly when prompts are noisy or retrieval quality is weak.
Memory Planning
Capacity is not just model weights. You also budget for KV cache growth, concurrent requests, and response length. A model that "fits" at low concurrency can fail under real traffic when sequence lengths spike.
Practical Fit Test
Before rollout, run load tests across realistic prompt-length distributions: short, medium, and worst-case sessions. Track p95 latency, token throughput, and out-of-memory events under target concurrency.
Design Rule
Default to the smallest context and model size that passes acceptance tests. Increase context only when measurable task quality improves.
Key Point: Right-sized context usually beats max context in production.
build_circle
Multimodal and Tool Use Readiness
Model capability should include structured outputs and tool reliability.
Beyond Chat Quality
If your product uses tool calls, structured JSON, or image/document input, evaluate those paths directly. Good free-form chat quality does not guarantee robust tool orchestration.
Common Failure Modes
Typical issues include malformed JSON, wrong tool selection, partial argument filling, hallucinated function names, and weak refusal behavior on unsafe requests. These are operational failures, not just answer-quality issues.
Reliability Tests
Track schema validity rate, tool-call precision/recall, retry rate, and policy-compliance rate under adversarial prompts. Add regression suites for multilingual and long-context variants of the same task.
Production Safeguards
Use strict output parsers, constrained tool routing, and fallback policies for invalid outputs. The best model choice is the one that minimizes incident frequency in your full pipeline.
Key Point: Operational quality is as important as linguistic quality.
check_circle
A Repeatable Model Selection Workflow
Use a simple lifecycle: shortlist, test, stage, observe, rotate.
Workflow
1) Scope: define task, latency SLO, budget, and policy constraints.
2) Shortlist: pick 3-5 candidates from model cards and benchmark signal.
3) Evaluate: run private quality + reliability + cost tests.
4) Stage: deploy top 2 models behind routing controls for canary traffic.
Promotion Gate
Promote only when one model clearly wins across acceptance thresholds and incident profile. Keep runner-up as a warm fallback for rollback and surge handling.
Rotation Policy
Re-run evaluation on a cadence and on trigger events: major model release, performance drift, cost shift, or policy change. Version your evaluation set so model changes remain comparable over time.
Long-Term Discipline
Model choice should be observable and reversible. Treat every upgrade like a production dependency change with rollout plans, monitoring, and rollback criteria.
Key Point: Model selection is a process, not a one-time event.