Ch 10: Production Deployment & Serving

Ch 10 — Production Deployment & Serving

Model merging, quantization, vLLM, Ollama, LoRA hot-swapping, continuous fine-tuning, and cost analysis

Index Under the Hood →

High Level

merge

Merging

arrow_forward

compress

Quantize

arrow_forward

dns

vLLM

arrow_forward

laptop_mac

Local

arrow_forward

swap_horiz

LoRA Swap

arrow_forward

loop

Continuous

arrow_forward

payments

Cost

Click play or press Space to begin the journey...

Step- / 7

merge

Model Merging

Combining multiple fine-tuned models into one — no extra training needed

Why Merge Models?

The idea: You fine-tuned Model A for coding and Model B for medical Q&A. Merging combines their strengths into a single model without retraining. No GPU needed — merging is a CPU operation on the weight tensors.

Real-world impact: Merged models regularly top the Open LLM Leaderboard. The community has turned merging into a “sport” — sharing recipes and discovering that certain merge combinations outperform the individual models.

When to merge: You have multiple fine-tunes of the same base model, each specialized for a different task, and you want a single generalist model that handles all tasks.

Merging Methods

SLERP (Spherical Linear Interpolation): Smoothly interpolates between two models while preserving magnitude in high-dimensional weight space. Most popular method. Limited to 2 models at a time. One parameter: t (0.0–1.0, blend ratio).

TIES-Merging (Yadav et al. 2023): Handles 2+ models. Trims redundant parameters, resolves sign conflicts between models, then averages. Key parameter: density (0.2–1.0, how much to trim).

DARE (Yu et al. 2024): Randomly drops delta parameters before merging as regularization. Enables blending more models without interference. Works well with TIES.

Linear: Simple weighted average. merged = w1*A + w2*B. Fast but less sophisticated. Good baseline.

mergekit (Arcee AI) is the standard tool. Define a YAML config specifying models, method, and parameters. Run mergekit-yaml config.yaml ./output. Supports SLERP, TIES, DARE, linear, passthrough, and per-layer recipes. Works on CPU — no GPU required for merging.

compress

Quantization for Deployment

Shrinking model size 2–4x while preserving quality

The Three Formats

GGUF (llama.cpp): Cross-platform, single-file format. Runs on CPU, Apple Silicon, and GPU with offloading. No calibration needed. Best for local/edge deployment. Quantization levels: Q2_K through Q8_0. Q4_K_M is the sweet spot (~4.5 GB for 7B, ~92% quality retention).

GPTQ (Frantar et al. 2022): GPU-focused, designed for NVIDIA inference. Requires a calibration dataset (512–2048 samples). 20% faster throughput on NVIDIA GPUs. Supports 2/3/4/8-bit. Best for production GPU serving.

AWQ (Lin et al. 2024): Activation-aware — identifies and protects critical weights. Lowest accuracy loss of the three (~0.7% drop vs ~2.8% for GPTQ at 4-bit). Best for quality-critical applications and instruction-tuned models.

Size & Quality Comparison (7B Model)

bf16 (no quantization): ~14 GB, 100% quality
GGUF Q8_0: ~7.5 GB, ~99% quality
AWQ 4-bit: ~4 GB, ~98.5% quality
GGUF Q4_K_M: ~4.5 GB, ~97% quality
GPTQ 4-bit: ~4 GB, ~96% quality
GGUF Q2_K: ~3 GB, ~88% quality

GGUF
CPU + GPU + Apple
No calibration
Single file
Best for: local/edge

GPTQ
NVIDIA GPU only
Needs calibration
Fastest throughput
Best for: GPU serving

AWQ
GPU focused
Activation-aware
Best accuracy
Best for: quality-critical

bf16/fp16
No quantization
Full quality
2x memory
Best for: when you can afford it

dns

vLLM — Production GPU Serving

The standard inference engine for high-throughput LLM serving

Why vLLM?

PagedAttention: vLLM’s core innovation. Manages KV cache like virtual memory pages, eliminating memory waste from fragmentation. Achieves 2–4x higher throughput than naive HuggingFace inference.

Continuous batching: New requests join the batch as old ones finish, maximizing GPU utilization. No waiting for the slowest request in a batch.

OpenAI-compatible API: Drop-in replacement for the OpenAI API. /v1/chat/completions, /v1/completions, /v1/models. Switch from OpenAI to self-hosted by changing the base URL.

Quantization support: Serves AWQ, GPTQ, and bitsandbytes quantized models natively. Also supports FP8 on H100 GPUs.

Key Features

Tensor parallelism: Split a large model across multiple GPUs. Serve 70B models on 2x A100 or 4x A100.

Speculative decoding: Use a small draft model to propose tokens, verify with the large model. 2–3x faster for long outputs.

Prefix caching: Cache KV states for common system prompts. Eliminates redundant computation when many requests share the same prefix.

Structured output: Guided generation with JSON schemas, regex patterns, or grammar constraints. Guarantees valid output format.

vLLM is the recommended production engine. HuggingFace’s TGI (Text Generation Inference) entered maintenance mode in December 2025, with HuggingFace recommending vLLM and SGLang as replacements. vLLM has the largest community, best documentation, and broadest hardware support (NVIDIA, AMD, Intel, TPU).

laptop_mac

Local Serving — Ollama & llama.cpp

Running your fine-tuned model on consumer hardware

Ollama

What: A command-line tool for running LLMs locally. One-command install, one-command run. Manages model downloads, quantization, and serving automatically.

API: OpenAI-compatible REST API on localhost:11434. Works with LangChain, LangGraph, and any OpenAI client library by changing the base URL.

GGUF support: Runs any of the 45,000+ GGUF models on HuggingFace Hub. Create a Modelfile to import your fine-tuned GGUF model.

Performance: 20+ tokens/second on Apple M-series. GPU offloading on NVIDIA. Runs on CPU (slower but works everywhere).

llama.cpp

What: The C/C++ inference engine that powers Ollama, LM Studio, and many other local tools. Directly runs GGUF files. Maximum performance on CPU and Apple Silicon.

When to use directly: When you need maximum control, custom sampling parameters, or embedding extraction. Ollama is a wrapper around llama.cpp with a nicer UX.

Other Local Tools

LM Studio: Desktop GUI for running local models. Download, chat, and serve models with a visual interface. Good for non-developers.

Jan: Open-source ChatGPT alternative that runs locally. Clean UI, extension system, OpenAI-compatible API.

The local deployment path: Fine-tune with QLoRA on cloud GPU → merge LoRA into base model → export to GGUF (Q4_K_M) with Unsloth → import into Ollama → serve locally with zero ongoing cost. A 7B model in Q4_K_M runs comfortably on a MacBook with 16 GB RAM or any machine with a 6 GB+ GPU.

swap_horiz

LoRA Hot-Swapping & Multi-Tenant Serving

Serving dozens of fine-tuned models from a single GPU

The Multi-Tenant Problem

Scenario: You have 10 customers, each with their own fine-tuned model. Loading 10 separate 8B models requires 10 GPUs. At $1/hr per GPU, that’s $7,200/month — and most GPUs sit idle 90% of the time.

The solution: LoRA adapters. Keep one base model in GPU memory. Load/unload tiny LoRA adapters (50–200 MB each) per request. 10 customers share 1 GPU. Cost drops from $7,200 to $720/month.

How vLLM Does It

LRU cache: vLLM maintains a least-recently-used cache of LoRA adapters in GPU memory. When a request arrives for a specific adapter, it’s loaded if not cached, or served immediately if cached.

Runtime loading: Load new adapters via API (/v1/load_lora_adapter) without restarting the server. Adapters can be loaded from local disk, S3, or HuggingFace Hub.

Per-request routing: Each API request specifies which adapter to use via the model field. The base model handles requests with no adapter specified.

Economics

Base model memory: ~16 GB (8B model in bf16)
Per-adapter memory: ~50–200 MB (depending on rank and target modules)
Adapters per GPU: 50–100+ (limited by total GPU memory)

Cost comparison (10 customers, 8B model):
Separate models: 10 GPUs × $720/mo = $7,200/mo
LoRA hot-swap: 1 GPU × $720/mo = $720/mo
Savings: 90%

This is why LoRA is the default for production fine-tuning. Beyond training efficiency, LoRA adapters enable multi-tenant serving that is economically impossible with full fine-tuned models. Train with LoRA, serve with LoRA hot-swapping, and you get both cheap training and cheap inference.

loop

Continuous Fine-Tuning & Model Lifecycle

Keeping your model fresh as data and requirements evolve

The Lifecycle

Fine-tuning is not a one-time event. Your data changes, user needs evolve, the base model gets updated, and you discover failure modes in production. You need a pipeline that supports iterative improvement.

The cycle:
1. Fine-tune on initial dataset
2. Deploy and monitor
3. Collect feedback and failure cases
4. Add new data to training set
5. Re-fine-tune (from base model, not from previous fine-tune)
6. Evaluate against previous version
7. Deploy if better, rollback if not
8. Repeat

Best Practices

Always fine-tune from the base model, not from a previous fine-tune. Stacking fine-tunes compounds errors and causes drift.

Version everything: Dataset version, base model version, hyperparameters, adapter weights. Use git for code, HuggingFace Hub for models, and WandB for experiments.

A/B test in production: Route 5–10% of traffic to the new model. Compare metrics (latency, user satisfaction, error rate) before full rollout.

Monitor for drift: Track output quality over time. If scores degrade, it may be time to retrain with fresh data or update the base model.

When to update the base model: When a new version of Llama/Mistral/Qwen is released, re-fine-tune on the new base. The new base is usually better, so your fine-tune benefits from the improved foundation. Keep your training data and pipeline ready for quick re-runs.

payments

Cost Analysis & Decision Framework

The economics of self-hosted vs API-based inference

Self-Hosted Cost Breakdown

GPU rental (A100 80GB):
- Lambda Labs: ~$1.10/hr = ~$800/mo
- RunPod (secure): ~$1.64/hr = ~$1,180/mo
- RunPod (community): ~$0.44/hr = ~$320/mo

Throughput (8B model, A100):
- bf16: ~2,000 tokens/sec
- AWQ 4-bit: ~4,000 tokens/sec
- With vLLM continuous batching: 10–50x more concurrent users

Cost per 1M tokens (self-hosted, A100):
- bf16: ~$0.11/1M tokens
- AWQ 4-bit: ~$0.06/1M tokens

Compare to API pricing:
- GPT-4o-mini: $0.15/1M input, $0.60/1M output
- GPT-4o: $2.50/1M input, $10/1M output

Decision Framework

Use API (OpenAI/Anthropic) when:
- Low volume (<10M tokens/day)
- Need frontier model quality
- No ML infrastructure team
- Fast time-to-market matters most

Self-host with vLLM when:
- High volume (>10M tokens/day)
- Data privacy requirements (HIPAA, GDPR)
- Need custom fine-tuned models
- Cost optimization is critical
- Need LoRA multi-tenant serving

Use Ollama/local when:
- Development and testing
- Offline/air-gapped environments
- Personal use / side projects
- Zero ongoing cost after hardware

The crossover point: Self-hosting becomes cheaper than API at roughly 10–50M tokens/day, depending on the API model and your GPU costs. Below that, the operational overhead of managing GPUs outweighs the savings. Above that, self-hosting can be 5–20x cheaper. Always factor in engineering time, not just GPU costs.