Ch 10 — Production Deployment & Serving

Model merging, quantization, vLLM, Ollama, LoRA hot-swapping, continuous fine-tuning, and cost analysis
High Level
merge
Merging
arrow_forward
compress
Quantize
arrow_forward
dns
vLLM
arrow_forward
laptop_mac
Local
arrow_forward
swap_horiz
LoRA Swap
arrow_forward
loop
Continuous
arrow_forward
payments
Cost
-
Click play or press Space to begin the journey...
Step- / 7
merge
Model Merging
Combining multiple fine-tuned models into one — no extra training needed
Why Merge Models?
The idea: You fine-tuned Model A for coding and Model B for medical Q&A. Merging combines their strengths into a single model without retraining. No GPU needed — merging is a CPU operation on the weight tensors.

Real-world impact: Merged models regularly top the Open LLM Leaderboard. The community has turned merging into a “sport” — sharing recipes and discovering that certain merge combinations outperform the individual models.

When to merge: You have multiple fine-tunes of the same base model, each specialized for a different task, and you want a single generalist model that handles all tasks.
Merging Methods
SLERP (Spherical Linear Interpolation): Smoothly interpolates between two models while preserving magnitude in high-dimensional weight space. Most popular method. Limited to 2 models at a time. One parameter: t (0.0–1.0, blend ratio).

TIES-Merging (Yadav et al. 2023): Handles 2+ models. Trims redundant parameters, resolves sign conflicts between models, then averages. Key parameter: density (0.2–1.0, how much to trim).

DARE (Yu et al. 2024): Randomly drops delta parameters before merging as regularization. Enables blending more models without interference. Works well with TIES.

Linear: Simple weighted average. merged = w1*A + w2*B. Fast but less sophisticated. Good baseline.
mergekit (Arcee AI) is the standard tool. Define a YAML config specifying models, method, and parameters. Run mergekit-yaml config.yaml ./output. Supports SLERP, TIES, DARE, linear, passthrough, and per-layer recipes. Works on CPU — no GPU required for merging.
compress
Quantization for Deployment
Shrinking model size 2–4x while preserving quality
The Three Formats
GGUF (llama.cpp): Cross-platform, single-file format. Runs on CPU, Apple Silicon, and GPU with offloading. No calibration needed. Best for local/edge deployment. Quantization levels: Q2_K through Q8_0. Q4_K_M is the sweet spot (~4.5 GB for 7B, ~92% quality retention).

GPTQ (Frantar et al. 2022): GPU-focused, designed for NVIDIA inference. Requires a calibration dataset (512–2048 samples). 20% faster throughput on NVIDIA GPUs. Supports 2/3/4/8-bit. Best for production GPU serving.

AWQ (Lin et al. 2024): Activation-aware — identifies and protects critical weights. Lowest accuracy loss of the three (~0.7% drop vs ~2.8% for GPTQ at 4-bit). Best for quality-critical applications and instruction-tuned models.
Size & Quality Comparison (7B Model)
bf16 (no quantization): ~14 GB, 100% quality
GGUF Q8_0: ~7.5 GB, ~99% quality
AWQ 4-bit: ~4 GB, ~98.5% quality
GGUF Q4_K_M: ~4.5 GB, ~97% quality
GPTQ 4-bit: ~4 GB, ~96% quality
GGUF Q2_K: ~3 GB, ~88% quality
GGUF
CPU + GPU + Apple
No calibration
Single file
Best for: local/edge
GPTQ
NVIDIA GPU only
Needs calibration
Fastest throughput
Best for: GPU serving
AWQ
GPU focused
Activation-aware
Best accuracy
Best for: quality-critical
bf16/fp16
No quantization
Full quality
2x memory
Best for: when you can afford it
dns
vLLM — Production GPU Serving
The standard inference engine for high-throughput LLM serving
Why vLLM?
PagedAttention: vLLM’s core innovation. Manages KV cache like virtual memory pages, eliminating memory waste from fragmentation. Achieves 2–4x higher throughput than naive HuggingFace inference.

Continuous batching: New requests join the batch as old ones finish, maximizing GPU utilization. No waiting for the slowest request in a batch.

OpenAI-compatible API: Drop-in replacement for the OpenAI API. /v1/chat/completions, /v1/completions, /v1/models. Switch from OpenAI to self-hosted by changing the base URL.

Quantization support: Serves AWQ, GPTQ, and bitsandbytes quantized models natively. Also supports FP8 on H100 GPUs.
Key Features
Tensor parallelism: Split a large model across multiple GPUs. Serve 70B models on 2x A100 or 4x A100.

Speculative decoding: Use a small draft model to propose tokens, verify with the large model. 2–3x faster for long outputs.

Prefix caching: Cache KV states for common system prompts. Eliminates redundant computation when many requests share the same prefix.

Structured output: Guided generation with JSON schemas, regex patterns, or grammar constraints. Guarantees valid output format.
vLLM is the recommended production engine. HuggingFace’s TGI (Text Generation Inference) entered maintenance mode in December 2025, with HuggingFace recommending vLLM and SGLang as replacements. vLLM has the largest community, best documentation, and broadest hardware support (NVIDIA, AMD, Intel, TPU).
laptop_mac
Local Serving — Ollama & llama.cpp
Running your fine-tuned model on consumer hardware
Ollama
What: A command-line tool for running LLMs locally. One-command install, one-command run. Manages model downloads, quantization, and serving automatically.

API: OpenAI-compatible REST API on localhost:11434. Works with LangChain, LangGraph, and any OpenAI client library by changing the base URL.

GGUF support: Runs any of the 45,000+ GGUF models on HuggingFace Hub. Create a Modelfile to import your fine-tuned GGUF model.

Performance: 20+ tokens/second on Apple M-series. GPU offloading on NVIDIA. Runs on CPU (slower but works everywhere).
llama.cpp
What: The C/C++ inference engine that powers Ollama, LM Studio, and many other local tools. Directly runs GGUF files. Maximum performance on CPU and Apple Silicon.

When to use directly: When you need maximum control, custom sampling parameters, or embedding extraction. Ollama is a wrapper around llama.cpp with a nicer UX.
Other Local Tools
LM Studio: Desktop GUI for running local models. Download, chat, and serve models with a visual interface. Good for non-developers.

Jan: Open-source ChatGPT alternative that runs locally. Clean UI, extension system, OpenAI-compatible API.
The local deployment path: Fine-tune with QLoRA on cloud GPU → merge LoRA into base model → export to GGUF (Q4_K_M) with Unsloth → import into Ollama → serve locally with zero ongoing cost. A 7B model in Q4_K_M runs comfortably on a MacBook with 16 GB RAM or any machine with a 6 GB+ GPU.
swap_horiz
LoRA Hot-Swapping & Multi-Tenant Serving
Serving dozens of fine-tuned models from a single GPU
The Multi-Tenant Problem
Scenario: You have 10 customers, each with their own fine-tuned model. Loading 10 separate 8B models requires 10 GPUs. At $1/hr per GPU, that’s $7,200/month — and most GPUs sit idle 90% of the time.

The solution: LoRA adapters. Keep one base model in GPU memory. Load/unload tiny LoRA adapters (50–200 MB each) per request. 10 customers share 1 GPU. Cost drops from $7,200 to $720/month.
How vLLM Does It
LRU cache: vLLM maintains a least-recently-used cache of LoRA adapters in GPU memory. When a request arrives for a specific adapter, it’s loaded if not cached, or served immediately if cached.

Runtime loading: Load new adapters via API (/v1/load_lora_adapter) without restarting the server. Adapters can be loaded from local disk, S3, or HuggingFace Hub.

Per-request routing: Each API request specifies which adapter to use via the model field. The base model handles requests with no adapter specified.
Economics
Base model memory: ~16 GB (8B model in bf16)
Per-adapter memory: ~50–200 MB (depending on rank and target modules)
Adapters per GPU: 50–100+ (limited by total GPU memory)

Cost comparison (10 customers, 8B model):
Separate models: 10 GPUs × $720/mo = $7,200/mo
LoRA hot-swap: 1 GPU × $720/mo = $720/mo
Savings: 90%
This is why LoRA is the default for production fine-tuning. Beyond training efficiency, LoRA adapters enable multi-tenant serving that is economically impossible with full fine-tuned models. Train with LoRA, serve with LoRA hot-swapping, and you get both cheap training and cheap inference.
loop
Continuous Fine-Tuning & Model Lifecycle
Keeping your model fresh as data and requirements evolve
The Lifecycle
Fine-tuning is not a one-time event. Your data changes, user needs evolve, the base model gets updated, and you discover failure modes in production. You need a pipeline that supports iterative improvement.

The cycle:
1. Fine-tune on initial dataset
2. Deploy and monitor
3. Collect feedback and failure cases
4. Add new data to training set
5. Re-fine-tune (from base model, not from previous fine-tune)
6. Evaluate against previous version
7. Deploy if better, rollback if not
8. Repeat
Best Practices
Always fine-tune from the base model, not from a previous fine-tune. Stacking fine-tunes compounds errors and causes drift.

Version everything: Dataset version, base model version, hyperparameters, adapter weights. Use git for code, HuggingFace Hub for models, and WandB for experiments.

A/B test in production: Route 5–10% of traffic to the new model. Compare metrics (latency, user satisfaction, error rate) before full rollout.

Monitor for drift: Track output quality over time. If scores degrade, it may be time to retrain with fresh data or update the base model.
When to update the base model: When a new version of Llama/Mistral/Qwen is released, re-fine-tune on the new base. The new base is usually better, so your fine-tune benefits from the improved foundation. Keep your training data and pipeline ready for quick re-runs.
payments
Cost Analysis & Decision Framework
The economics of self-hosted vs API-based inference
Self-Hosted Cost Breakdown
GPU rental (A100 80GB):
- Lambda Labs: ~$1.10/hr = ~$800/mo
- RunPod (secure): ~$1.64/hr = ~$1,180/mo
- RunPod (community): ~$0.44/hr = ~$320/mo

Throughput (8B model, A100):
- bf16: ~2,000 tokens/sec
- AWQ 4-bit: ~4,000 tokens/sec
- With vLLM continuous batching: 10–50x more concurrent users

Cost per 1M tokens (self-hosted, A100):
- bf16: ~$0.11/1M tokens
- AWQ 4-bit: ~$0.06/1M tokens

Compare to API pricing:
- GPT-4o-mini: $0.15/1M input, $0.60/1M output
- GPT-4o: $2.50/1M input, $10/1M output
Decision Framework
Use API (OpenAI/Anthropic) when:
- Low volume (<10M tokens/day)
- Need frontier model quality
- No ML infrastructure team
- Fast time-to-market matters most

Self-host with vLLM when:
- High volume (>10M tokens/day)
- Data privacy requirements (HIPAA, GDPR)
- Need custom fine-tuned models
- Cost optimization is critical
- Need LoRA multi-tenant serving

Use Ollama/local when:
- Development and testing
- Offline/air-gapped environments
- Personal use / side projects
- Zero ongoing cost after hardware
The crossover point: Self-hosting becomes cheaper than API at roughly 10–50M tokens/day, depending on the API model and your GPU costs. Below that, the operational overhead of managing GPUs outweighs the savings. Above that, self-hosting can be 5–20x cheaper. Always factor in engineering time, not just GPU costs.