Ch 4: The Open Source Model Landscape — Open Source AI

Ch 4 — The Open Source Model Landscape

Llama, Mistral, Qwen, Gemma, DeepSeek, Phi — how to compare open models pragmatically

Index ← Prev Next →

Foundation

travel_explore

Scan

arrow_forward

balance

Compare

arrow_forward

gavel

License

arrow_forward

memory

Fit

arrow_forward

rocket_launch

Ship

-

Click play or press Space to begin the journey...

Step- / 7

travel_explore

The Open Model Landscape Today

Capability now spans tiny edge models to frontier reasoning models that run in private infrastructure.

Capability Coverage

Open models now support core product categories: chat assistants, retrieval-backed Q&A, structured extraction, code generation, document workflows, and increasingly multimodal tasks. For many teams, the baseline question is no longer "can open models do this?" but "which open model tier is sufficient?"

Deployment Surfaces

The same project can run across very different environments: laptop prototyping, private VPC inference, regulated on-prem clusters, and edge endpoints. This flexibility is why open-weight adoption accelerated in enterprise and developer tooling.

What Actually Changed

Three layers matured together: models (better quality-per-parameter), runtimes (faster serving and quantized inference), and evaluation (stronger task-specific testing). Progress in one layer amplified progress in the others.

Practical Implication

Model selection became an engineering discipline: define task requirements, shortlist candidates, evaluate against private data, and deploy with rollback strategy. Teams that skip this process usually overpay or overfit to hype.

Key Point: The constraint has shifted from model availability to model selection discipline.

category

Families and Positioning

Different model families optimize for different tradeoffs: speed, quality, context, and licensing.

Common Families

Llama/Qwen: broad ecosystem coverage and strong community tooling. Mistral: efficient deployment profiles and practical serving behavior. Gemma/Phi: strong small-model efficiency for constrained hardware. DeepSeek variants: often prioritized for reasoning-heavy workloads.

Why Families Differ

Each family makes different tradeoffs in tokenizer design, context scaling strategy, instruction tuning, and release policy. That means two models with similar parameter counts can behave very differently on the same production task.

Selection Lens

Use a four-axis filter before testing: quality target (task success), latency budget (p95 constraints), memory fit (VRAM/RAM), and license fit (distribution/commercial terms). Then compare model families inside that constraint box.

Avoid the Usual Mistake

Starting from brand preference leads to expensive dead ends. Start from workload requirements, then select the smallest model that clears quality and safety thresholds.

Key Point: Pick by workload fit, not leaderboard rank alone.

insights

Benchmark Literacy

Benchmarks are useful but easy to misuse without context.

What Public Benchmarks Tell You

Benchmarks like MMLU, GPQA, HumanEval, and arena-style preference scores provide directional signal about broad capability classes: factual reasoning, code synthesis, and general response quality. They are useful for shortlisting, not for production sign-off.

What They Miss

Most benchmarks do not reflect your exact prompt format, retrieval chain behavior, safety policy, JSON schema requirements, or latency objectives. A model can score highly in public rankings and still fail your business-critical edge cases.

Private Eval Design

Build a reproducible internal set with representative task slices: normal cases, hard cases, and failure cases. Track both quality metrics (accuracy, groundedness, format adherence) and operational metrics (latency, token usage, timeout rate).

Decision Policy

Use public benchmarks to cut from many candidates to a shortlist. Use private evaluations to pick the default model and fallback model. This two-stage flow prevents both hype-driven choices and overfitting to tiny internal samples.

Key Point: Your private evaluation set should be the tie-breaker every time.

policy

Licenses and Commercial Risk

License terms directly affect distribution and legal exposure.

Open Source vs Open Weight

Open-source software licenses and model-weight licenses are different legal objects. A permissive codebase does not imply permissive model redistribution rights. Always evaluate model terms directly from the model repository.

License Risk Surfaces

Review these before launch: commercial-use clauses, redistribution permissions, attribution requirements, acceptable-use restrictions, and derivative model constraints. These terms determine whether your deployment pattern is allowed.

Governance Pattern

Treat model licenses like third-party dependencies: capture terms in registry metadata, gate promotion to production on policy review, and maintain an approved model list by use case (internal tooling, external SaaS, embedded distribution).

Operational Habit

When upgrading model versions, re-check license and policy fields even within the same family. Terms can differ between variants, repos, or release generations.

Key Point: Licensing is a deployment decision, not a footnote.

fit_screen

Context Length and Memory Fit

Long-context models are useful only if they fit your runtime constraints.

Context Tradeoff

Larger context windows increase memory pressure and can reduce effective throughput. In practice, quality gains from more context flatten quickly when prompts are noisy or retrieval quality is weak.

Memory Planning

Capacity is not just model weights. You also budget for KV cache growth, concurrent requests, and response length. A model that "fits" at low concurrency can fail under real traffic when sequence lengths spike.

Practical Fit Test

Before rollout, run load tests across realistic prompt-length distributions: short, medium, and worst-case sessions. Track p95 latency, token throughput, and out-of-memory events under target concurrency.

Design Rule

Default to the smallest context and model size that passes acceptance tests. Increase context only when measurable task quality improves.

Key Point: Right-sized context usually beats max context in production.

build_circle

Multimodal and Tool Use Readiness

Model capability should include structured outputs and tool reliability.

Beyond Chat Quality

If your product uses tool calls, structured JSON, or image/document input, evaluate those paths directly. Good free-form chat quality does not guarantee robust tool orchestration.

Common Failure Modes

Typical issues include malformed JSON, wrong tool selection, partial argument filling, hallucinated function names, and weak refusal behavior on unsafe requests. These are operational failures, not just answer-quality issues.

Reliability Tests

Track schema validity rate, tool-call precision/recall, retry rate, and policy-compliance rate under adversarial prompts. Add regression suites for multilingual and long-context variants of the same task.

Production Safeguards

Use strict output parsers, constrained tool routing, and fallback policies for invalid outputs. The best model choice is the one that minimizes incident frequency in your full pipeline.

Key Point: Operational quality is as important as linguistic quality.

check_circle

A Repeatable Model Selection Workflow

Use a simple lifecycle: shortlist, test, stage, observe, rotate.

Workflow

1) Scope: define task, latency SLO, budget, and policy constraints.
2) Shortlist: pick 3-5 candidates from model cards and benchmark signal.
3) Evaluate: run private quality + reliability + cost tests.
4) Stage: deploy top 2 models behind routing controls for canary traffic.

Promotion Gate

Promote only when one model clearly wins across acceptance thresholds and incident profile. Keep runner-up as a warm fallback for rollback and surge handling.

Rotation Policy

Re-run evaluation on a cadence and on trigger events: major model release, performance drift, cost shift, or policy change. Version your evaluation set so model changes remain comparable over time.

Long-Term Discipline

Model choice should be observable and reversible. Treat every upgrade like a production dependency change with rollout plans, monitoring, and rollback criteria.

Key Point: Model selection is a process, not a one-time event.