Ch 13: Building Multimodal Applications

Ch 13 — Building Multimodal Applications

Architecture patterns, pipelines, production deployment, and real-world case studies

Index

High Level

architecture

Design

arrow_forward

input

Ingest

arrow_forward

psychology

Process

arrow_forward

output

Output

arrow_forward

monitoring

Monitor

arrow_forward

rocket_launch

Scale

Click play or press Space to begin...

Step- / 8

architecture

Architecture Patterns

Common patterns for multimodal applications

Pattern 1: Direct VLM Call

Simplest pattern: send image + prompt to a VLM API, get text response. Good for prototyping and low-volume use cases.

Use cases: Image captioning, visual Q&A, content moderation, screenshot analysis

Pattern 2: Multimodal RAG

Retrieve relevant images/documents from a vector database, then send them to a VLM with the user’s query. Grounds responses in your data.

Use cases: Document Q&A, product search, knowledge bases with diagrams

Pattern 3: Pipeline Orchestration

Chain multiple models: OCR → extraction → validation → structured output. Each step uses the best model for that task.

Use cases: Invoice processing, medical image analysis, quality inspection

Pattern 4: Multimodal Agent

VLM with tool access: can take screenshots, browse the web, interact with UIs, and reason about visual feedback in a loop.

Use cases: Web automation, GUI testing, visual debugging, interactive assistants

Key insight: Start with Pattern 1 (direct VLM call) to validate your idea. Move to Pattern 2 (RAG) when you need grounding. Pattern 3 (pipeline) when you need reliability. Pattern 4 (agent) when you need autonomy.

input

Input Processing Pipeline

Handling images, documents, video, and audio at scale

Image Processing

// Image ingestion pipeline 1. Validate Check format, size, corruption 2. Resize Fit within model's max resolution 3. Optimize Compress for API transfer (JPEG 85%) 4. Metadata Extract EXIF, dimensions, file size 5. Cache Store processed version for reuse // Document processing 1. Parse PDF → pages (pdf2image / PyMuPDF) 2. Detect Identify text, tables, charts, images 3. Extract OCR text, screenshot visual elements 4. Chunk Split into semantic units 5. Embed Generate multimodal embeddings

Video & Audio Processing

• Video: Extract key frames (1 per second or scene-change detection), process as image sequence. For long videos, use Gemini’s native video input.
• Audio: Transcribe with Whisper, then process as text. For native audio understanding, use GPT-4o audio mode.
• Multimodal documents: Presentations, web pages with embedded media — decompose into constituent modalities, process each, then recombine.

Key insight: Input processing is 60% of the work in multimodal applications. A robust ingestion pipeline that handles edge cases (corrupted files, unusual formats, very large images) is more important than the model choice.

output

Structured Output & Validation

Getting reliable, parseable results from VLMs

Structured Extraction

// Force structured output from VLMs Approach 1: JSON mode GPT-4V: response_format={"type":"json_object"} Gemini: response_mime_type="application/json" Approach 2: Schema enforcement Define Pydantic model / JSON Schema Use instructor library for validation Auto-retry on schema violations Approach 3: Function calling Define tools with typed parameters VLM extracts data into function args Most reliable for complex schemas

Validation Strategies

• Schema validation: Ensure output matches expected JSON schema
• Range checks: Extracted numbers within plausible ranges
• Cross-field consistency: Line items sum to total, dates are sequential
• Confidence scoring: Ask the model to rate its confidence (1–10)
• Human-in-the-loop: Route low-confidence results to human review
• Dual extraction: Run twice with different prompts, compare results

Key insight: Never trust VLM output without validation. Structured output + schema enforcement + confidence scoring + human review for edge cases is the production-grade pattern. The “instructor” library makes this easy in Python.

cases

Real-World Case Studies

How companies use multimodal AI in production

Document Intelligence

Insurance claims processing: Upload claim photos + forms → VLM extracts damage assessment, policy numbers, and claim details → auto-populate claim system. Result: 80% reduction in manual data entry, 3x faster processing.

E-Commerce

Visual product search: Customer uploads a photo of a product they like → CLIP embeddings find similar products in catalog → VLM generates comparison descriptions. Result: 25% increase in conversion from search.

Manufacturing

Quality inspection: Camera captures product images on assembly line → fine-tuned VLM detects defects (scratches, misalignment, color variation) → auto-reject or flag for review. Result: 99.2% defect detection vs 95% human baseline.

Healthcare

Medical image triage: Upload X-ray/CT scan → VLM provides preliminary analysis and flags urgent findings → radiologist reviews flagged cases first. Result: 40% reduction in time-to-diagnosis for critical cases.

Key insight: The highest-value multimodal applications automate visual tasks that currently require expensive human experts: radiologists, quality inspectors, claims adjusters. The ROI is clearest when you can quantify the cost of human visual analysis.

speed

Performance & Latency

Making multimodal apps fast enough for production

Latency Optimization

• Image preprocessing: Resize and compress before sending to API (saves 50–80% transfer time)
• Resolution selection: Use low-res for classification, high-res only when needed
• Streaming: Stream VLM responses for perceived speed improvement
• Caching: Cache results for identical or near-identical inputs
• Parallel processing: Send multiple images concurrently
• Edge preprocessing: Run lightweight models on-device, send only when needed

Latency Targets by Use Case

// Typical latency targets Real-time chat: <2s first token Document processing: <10s per page Batch analysis: <60s per item Quality inspection: <500ms (edge) Search: <200ms // Bottleneck breakdown (API call) Image upload: 200-500ms (depends on size) Processing: 500-2000ms (model inference) Generation: 500-3000ms (response length) Total: 1.2-5.5s typical

Key insight: Image upload time is often the biggest latency contributor. Compress images to JPEG 85% quality and resize to the model’s native resolution before sending. This alone can cut total latency by 40%.

monitoring

Monitoring & Observability

Tracking quality, cost, and performance in production

Key Metrics

• Accuracy: Spot-check VLM outputs against ground truth (sample 1–5%)
• Latency: P50, P95, P99 response times by endpoint
• Cost: Tokens consumed per request, daily/monthly spend
• Error rate: API failures, timeout rate, malformed outputs
• Confidence distribution: Track model confidence scores over time
• Human escalation rate: What % of requests need human review?

Observability Stack

• LangSmith / LangFuse: LLM-specific tracing with prompt, response, and cost tracking
• Weights & Biases: Experiment tracking for model evaluation
• Custom dashboards: Grafana/Datadog for latency, error rates, cost
• Alerting: Spike in error rate, latency degradation, cost anomalies
• Data flywheel: Log inputs/outputs for continuous evaluation and fine-tuning

Key insight: The data flywheel is the most valuable long-term asset. Log every VLM input/output pair (with user consent). Use this data for evaluation, fine-tuning, and identifying failure patterns. Teams that build this flywheel improve 10x faster.

rocket_launch

Scaling to Production

From prototype to millions of requests

Production Checklist

// Multimodal app production readiness ✓ Input validation File type, size limits, corruption check ✓ Rate limiting Per-user and global request limits ✓ Error handling Retry logic, fallback models, graceful degradation ✓ Cost controls Budget alerts, token limits per request ✓ Security Input sanitization, output filtering, PII detection ✓ Monitoring Latency, accuracy, cost, error rate dashboards ✓ Evaluation Automated eval suite, regression testing ✓ Fallback Degrade gracefully when model is unavailable

Scaling Strategies

• Model routing: Cheap model for easy tasks, expensive for hard ones
• Async processing: Queue non-urgent requests for batch processing
• Multi-provider: Use multiple API providers for redundancy and cost optimization
• Caching layer: Cache results for repeated or similar inputs
• Progressive enhancement: Start with fast/cheap analysis, upgrade if needed

Pro tip: The most common production failure is cost overrun, not technical failure. Set hard budget limits, implement model routing, and monitor cost per request from day one. A single misconfigured high-res mode can 10x your bill overnight.

school

Key Takeaways

Building multimodal applications that work

Essential Concepts

1. Four patterns: Direct VLM call → Multimodal RAG → Pipeline orchestration → Multimodal agent

2. Input processing is 60% of the work: Robust ingestion pipelines matter more than model choice

3. Always validate outputs: Schema enforcement + confidence scoring + human review

4. Data flywheel: Log everything, use it for eval and fine-tuning

5. Cost is the #1 production risk: Model routing and budget controls from day one

The Build Playbook

• Week 1: Prototype with API (GPT-4V/Gemini), validate the use case
• Week 2–4: Build input pipeline, structured output, basic eval
• Month 2: Add monitoring, cost controls, error handling
• Month 3: Fine-tune open-source model if needed, implement model routing
• Ongoing: Data flywheel → better evals → better models → better product

Next up: Chapter 14 explores multimodal agents — AI systems that can see, hear, reason, and take actions in the real world.

arrow_back Ch 12: Training Multimodal Ch 14: Multimodal Agents arrow_forward