Ch 13 — Building Multimodal Applications

Architecture patterns, pipelines, production deployment, and real-world case studies
High Level
architecture
Design
arrow_forward
input
Ingest
arrow_forward
psychology
Process
arrow_forward
output
Output
arrow_forward
monitoring
Monitor
arrow_forward
rocket_launch
Scale
-
Click play or press Space to begin...
Step- / 8
architecture
Architecture Patterns
Common patterns for multimodal applications
Pattern 1: Direct VLM Call
Simplest pattern: send image + prompt to a VLM API, get text response. Good for prototyping and low-volume use cases.

Use cases: Image captioning, visual Q&A, content moderation, screenshot analysis
Pattern 2: Multimodal RAG
Retrieve relevant images/documents from a vector database, then send them to a VLM with the user’s query. Grounds responses in your data.

Use cases: Document Q&A, product search, knowledge bases with diagrams
Pattern 3: Pipeline Orchestration
Chain multiple models: OCR → extraction → validation → structured output. Each step uses the best model for that task.

Use cases: Invoice processing, medical image analysis, quality inspection
Pattern 4: Multimodal Agent
VLM with tool access: can take screenshots, browse the web, interact with UIs, and reason about visual feedback in a loop.

Use cases: Web automation, GUI testing, visual debugging, interactive assistants
Key insight: Start with Pattern 1 (direct VLM call) to validate your idea. Move to Pattern 2 (RAG) when you need grounding. Pattern 3 (pipeline) when you need reliability. Pattern 4 (agent) when you need autonomy.
input
Input Processing Pipeline
Handling images, documents, video, and audio at scale
Image Processing
// Image ingestion pipeline 1. Validate Check format, size, corruption 2. Resize Fit within model's max resolution 3. Optimize Compress for API transfer (JPEG 85%) 4. Metadata Extract EXIF, dimensions, file size 5. Cache Store processed version for reuse // Document processing 1. Parse PDF → pages (pdf2image / PyMuPDF) 2. Detect Identify text, tables, charts, images 3. Extract OCR text, screenshot visual elements 4. Chunk Split into semantic units 5. Embed Generate multimodal embeddings
Video & Audio Processing
Video: Extract key frames (1 per second or scene-change detection), process as image sequence. For long videos, use Gemini’s native video input.
Audio: Transcribe with Whisper, then process as text. For native audio understanding, use GPT-4o audio mode.
Multimodal documents: Presentations, web pages with embedded media — decompose into constituent modalities, process each, then recombine.
Key insight: Input processing is 60% of the work in multimodal applications. A robust ingestion pipeline that handles edge cases (corrupted files, unusual formats, very large images) is more important than the model choice.
output
Structured Output & Validation
Getting reliable, parseable results from VLMs
Structured Extraction
// Force structured output from VLMs Approach 1: JSON mode GPT-4V: response_format={"type":"json_object"} Gemini: response_mime_type="application/json" Approach 2: Schema enforcement Define Pydantic model / JSON Schema Use instructor library for validation Auto-retry on schema violations Approach 3: Function calling Define tools with typed parameters VLM extracts data into function args Most reliable for complex schemas
Validation Strategies
Schema validation: Ensure output matches expected JSON schema
Range checks: Extracted numbers within plausible ranges
Cross-field consistency: Line items sum to total, dates are sequential
Confidence scoring: Ask the model to rate its confidence (1–10)
Human-in-the-loop: Route low-confidence results to human review
Dual extraction: Run twice with different prompts, compare results
Key insight: Never trust VLM output without validation. Structured output + schema enforcement + confidence scoring + human review for edge cases is the production-grade pattern. The “instructor” library makes this easy in Python.
cases
Real-World Case Studies
How companies use multimodal AI in production
Document Intelligence
Insurance claims processing: Upload claim photos + forms → VLM extracts damage assessment, policy numbers, and claim details → auto-populate claim system. Result: 80% reduction in manual data entry, 3x faster processing.
E-Commerce
Visual product search: Customer uploads a photo of a product they like → CLIP embeddings find similar products in catalog → VLM generates comparison descriptions. Result: 25% increase in conversion from search.
Manufacturing
Quality inspection: Camera captures product images on assembly line → fine-tuned VLM detects defects (scratches, misalignment, color variation) → auto-reject or flag for review. Result: 99.2% defect detection vs 95% human baseline.
Healthcare
Medical image triage: Upload X-ray/CT scan → VLM provides preliminary analysis and flags urgent findings → radiologist reviews flagged cases first. Result: 40% reduction in time-to-diagnosis for critical cases.
Key insight: The highest-value multimodal applications automate visual tasks that currently require expensive human experts: radiologists, quality inspectors, claims adjusters. The ROI is clearest when you can quantify the cost of human visual analysis.
speed
Performance & Latency
Making multimodal apps fast enough for production
Latency Optimization
Image preprocessing: Resize and compress before sending to API (saves 50–80% transfer time)
Resolution selection: Use low-res for classification, high-res only when needed
Streaming: Stream VLM responses for perceived speed improvement
Caching: Cache results for identical or near-identical inputs
Parallel processing: Send multiple images concurrently
Edge preprocessing: Run lightweight models on-device, send only when needed
Latency Targets by Use Case
// Typical latency targets Real-time chat: <2s first token Document processing: <10s per page Batch analysis: <60s per item Quality inspection: <500ms (edge) Search: <200ms // Bottleneck breakdown (API call) Image upload: 200-500ms (depends on size) Processing: 500-2000ms (model inference) Generation: 500-3000ms (response length) Total: 1.2-5.5s typical
Key insight: Image upload time is often the biggest latency contributor. Compress images to JPEG 85% quality and resize to the model’s native resolution before sending. This alone can cut total latency by 40%.
monitoring
Monitoring & Observability
Tracking quality, cost, and performance in production
Key Metrics
Accuracy: Spot-check VLM outputs against ground truth (sample 1–5%)
Latency: P50, P95, P99 response times by endpoint
Cost: Tokens consumed per request, daily/monthly spend
Error rate: API failures, timeout rate, malformed outputs
Confidence distribution: Track model confidence scores over time
Human escalation rate: What % of requests need human review?
Observability Stack
LangSmith / LangFuse: LLM-specific tracing with prompt, response, and cost tracking
Weights & Biases: Experiment tracking for model evaluation
Custom dashboards: Grafana/Datadog for latency, error rates, cost
Alerting: Spike in error rate, latency degradation, cost anomalies
Data flywheel: Log inputs/outputs for continuous evaluation and fine-tuning
Key insight: The data flywheel is the most valuable long-term asset. Log every VLM input/output pair (with user consent). Use this data for evaluation, fine-tuning, and identifying failure patterns. Teams that build this flywheel improve 10x faster.
rocket_launch
Scaling to Production
From prototype to millions of requests
Production Checklist
// Multimodal app production readiness ✓ Input validation File type, size limits, corruption check ✓ Rate limiting Per-user and global request limits ✓ Error handling Retry logic, fallback models, graceful degradation ✓ Cost controls Budget alerts, token limits per request ✓ Security Input sanitization, output filtering, PII detection ✓ Monitoring Latency, accuracy, cost, error rate dashboards ✓ Evaluation Automated eval suite, regression testing ✓ Fallback Degrade gracefully when model is unavailable
Scaling Strategies
Model routing: Cheap model for easy tasks, expensive for hard ones
Async processing: Queue non-urgent requests for batch processing
Multi-provider: Use multiple API providers for redundancy and cost optimization
Caching layer: Cache results for repeated or similar inputs
Progressive enhancement: Start with fast/cheap analysis, upgrade if needed
Pro tip: The most common production failure is cost overrun, not technical failure. Set hard budget limits, implement model routing, and monitor cost per request from day one. A single misconfigured high-res mode can 10x your bill overnight.
school
Key Takeaways
Building multimodal applications that work
Essential Concepts
1. Four patterns: Direct VLM call → Multimodal RAG → Pipeline orchestration → Multimodal agent

2. Input processing is 60% of the work: Robust ingestion pipelines matter more than model choice

3. Always validate outputs: Schema enforcement + confidence scoring + human review

4. Data flywheel: Log everything, use it for eval and fine-tuning

5. Cost is the #1 production risk: Model routing and budget controls from day one
The Build Playbook
Week 1: Prototype with API (GPT-4V/Gemini), validate the use case
Week 2–4: Build input pipeline, structured output, basic eval
Month 2: Add monitoring, cost controls, error handling
Month 3: Fine-tune open-source model if needed, implement model routing
Ongoing: Data flywheel → better evals → better models → better product
Next up: Chapter 14 explores multimodal agents — AI systems that can see, hear, reason, and take actions in the real world.