Ch 10: Production Deployment & Serving

Ch 10 — Production Deployment & Serving — Under the Hood

mergekit configs, GPTQ/AWQ/GGUF quantization, vLLM setup, Ollama Modelfiles, LoRA multi-tenant, and deployment scripts

Under the Hood

Click play or press Space to begin...

Step- / 10

AModel Merging with mergekitYAML configs for SLERP, TIES, and DARE

merge

SLERP Merge

2-model blend

merge

TIES / DARE

Multi-model merge

compressQuantize: GPTQ (GPU), AWQ (accuracy), GGUF (cross-platform)

BQuantization for DeploymentGPTQ, AWQ, and GGUF conversion scripts

compress

GPTQ / AWQ

GPU quantization

description

GGUF Export

llama.cpp format

CvLLM Production ServingServer setup, Docker, and multi-LoRA

dns

vLLM Server

Launch + config

multi

swap_horiz

Multi-LoRA

Hot-swap adapters

laptop_macOllama: import GGUF, create Modelfile, serve locally

DLocal Serving & ContainerizationOllama, Docker, and Kubernetes deployment

laptop_mac

Ollama Setup

Modelfile + serve

prod

inventory_2

Docker Deploy

Container + K8s

EMonitoring & Full PipelineHealth checks, metrics, and end-to-end deployment

monitoring

Monitoring

Metrics + alerts

full

checklist

Full Pipeline

Train to deploy