Ch 14 — The Open Source AI Stack End-to-End

Reference architecture from model selection to application delivery and operations
Applications
category
Select
arrow_forward
model_training
Adapt
arrow_forward
rocket_launch
Serve
arrow_forward
integration_instructions
Compose
arrow_forward
monitoring
Operate
-
Click play or press Space to begin the journey...
Step- / 7
layers
Architecture Overview
End-to-end systems require alignment across model, serving, app, and ops layers.
Layer Model
Model and adapter choices should map cleanly to runtime and application constraints. Validate each layer with tests that reflect end-user behavior.
Failure Pattern
Many teams optimize one layer while ignoring bottlenecks in another. Assign clear ownership so incidents are resolved quickly.
System View
End-to-end quality depends on cross-layer alignment; strong models cannot compensate for weak retrieval, unstable serving, or missing observability. Track cross-layer metrics to prevent hidden bottlenecks.
Key Point: End-to-end performance is limited by the weakest operational layer.
dataset
Model and Data Layer
Foundation starts with model choice and data strategy.
Inputs
Base model family, licensing posture, domain data, and evaluation criteria. Update runbooks as architecture and traffic patterns evolve.
Outputs
A validated model+adapter package with clear constraints and known behavior. Validate each layer with tests that reflect end-user behavior.
Data Contract
Define data ownership, refresh cadence, and quality checks so model behavior remains explainable as source data evolves. Assign clear ownership so incidents are resolved quickly.
Key Point: Version model artifacts with the same rigor as application releases.
cloud
Serving Layer
Serving translates model quality into user-facing responsiveness.
Core Decisions
Engine selection, autoscaling policy, caching strategy, and safety middleware placement. Track cross-layer metrics to prevent hidden bottlenecks.
Reliability
Implement request limits, timeouts, and fallback routes before broad traffic ramp-up. Update runbooks as architecture and traffic patterns evolve.
Serving Contract
Expose stable interfaces for models, retries, and fallback behavior so application teams can evolve safely without runtime surprises. Validate each layer with tests that reflect end-user behavior.
Key Point: Resilience patterns should be in place before growth, not after incidents.
integration_instructions
Application Layer
Application logic orchestrates retrieval, tools, and response formatting.
Core Components
Prompt templates, retrieval flows, tool execution, and policy enforcement. Assign clear ownership so incidents are resolved quickly.
Design Rule
Keep business logic outside model prompts wherever possible for testability. Track cross-layer metrics to prevent hidden bottlenecks.
Application Risk
Prompt-only business logic is brittle under model updates. Critical rules should live in deterministic application code.
Key Point: Good application design reduces prompt brittleness.
analytics
Observability and Evaluation
Operational visibility turns AI systems from opaque to manageable.
Essential Signals
Latency, token usage, cost, quality scores, refusal rates, and incident traces. Update runbooks as architecture and traffic patterns evolve.
Iteration Loop
Use offline eval + online telemetry to guide model and prompt updates. Validate each layer with tests that reflect end-user behavior.
Eval Cadence
Run scheduled regression checks and event-driven checks after model, prompt, or data changes to catch drift before user impact grows. Assign clear ownership so incidents are resolved quickly.
Key Point: Observability is the control plane for sustained quality.
payments
Cost and Capacity Planning
Cost efficiency is an architecture output, not a post-hoc fix.
Levers
Model size, quantization, batching strategy, cache hit rate, and routing policy all impact cost. Track cross-layer metrics to prevent hidden bottlenecks.
Planning
Model monthly spend across normal and peak traffic with explicit margin buffers. Update runbooks as architecture and traffic patterns evolve.
Cost Guardrail
Set budget thresholds with automated alerts and routing controls so cost anomalies are handled as operational incidents, not month-end surprises. Validate each layer with tests that reflect end-user behavior.
Key Point: Capacity planning should include failure scenarios and peak bursts.
map
Reference Implementation Roadmap
Implement in phases to reduce risk and rework.
Phases
Prototype, staged validation, guarded production rollout, then continuous optimization. Assign clear ownership so incidents are resolved quickly.
Governance
Document ownership by layer so incidents and improvements move quickly. Track cross-layer metrics to prevent hidden bottlenecks.
Ownership Model
Assign clear accountable owners for model, retrieval, serving, and product layers to reduce handoff delays during incidents and upgrades. Update runbooks as architecture and traffic patterns evolve.
Key Point: A phased roadmap keeps systems reliable while velocity stays high.