Ch 5 — Compression and Quantization for Deployment

PTQ, QAT, pruning, and distillation without hidden quality regressions.
Modeling
compress
Compress
arrow_forward
tune
Calibrate
arrow_forward
rule
Validate
arrow_forward
speed
Benchmark
arrow_forward
inventory_2
Release
-
Click play or press Space to begin the chapter walkthrough...
Step- / 7
compress
PTQ vs QAT
Post-training and quantization-aware paths have different risk profiles.
PTQ Strength
Post-training quantization is fast to apply and useful for initial feasibility checks across candidate models. It can be sufficient when activation ranges are stable and task boundaries are simple.
QAT Strength
Quantization-aware training generally preserves quality better for sensitive tasks because the model learns with quantization effects in the training loop. It costs more training effort but often reduces surprise regressions later.
Practical Pattern
Treat compression experiments like controlled releases with versioned calibration data and benchmark reports. This keeps gains reproducible across teams and releases.
Note: Key Point: Use PTQ for fast screening and QAT when quality margins are tight.
calculate
Integer-Only Inference
Integer kernels are central to tiny-device efficiency.
Why Integer Paths
Integer execution lowers compute and memory pressure on microcontrollers and many accelerators. It also improves determinism in constrained runtimes with fixed kernel support.
Calibration Risk
Bad calibration data can distort activation scales and hurt model behavior in real traffic. Calibration sets should mirror deployment distributions, including hard negatives and boundary cases.
Failure Pattern
Compression pipelines often fail when calibration data is narrow or outdated relative to deployment traffic. Calibration drift can erase expected gains.
Note: Key Point: Integer-only performance gains are real, but calibration quality determines whether accuracy survives.
content_cut
Pruning and Distillation
Use these techniques when quantization alone is not enough.
Pruning Use Case
Pruning can reduce model size and inference cost when redundant parameters are present, but aggressive pruning often destabilizes quality. Structured approaches are easier to operationalize than arbitrary sparse patterns in tiny runtimes.
Distillation Use Case
Distillation transfers behavior from a stronger teacher model to a compact student architecture. It is valuable when the student must satisfy strict constraints while retaining task-specific nuances.
Validation Signal
Track class-level regressions and threshold sensitivity after each compression change. These signals catch silent degradations before they become incidents.
Note: Key Point: Combine techniques only with tight regression controls; complexity without validation creates risk.
verified
Regression Validation
Compression changes must pass task and operations gates together.
Quality Gates
Evaluate class-level metrics, false-trigger rates, and threshold stability rather than a single aggregate score. Compression often affects edge-case behavior before it affects headline accuracy.
Ops Gates
Track latency, memory peak, startup time, and energy impacts for each compressed variant on real hardware. Promote only variants that improve efficiency without violating reliability targets.
Governance Rule
Require rollback-ready baseline artifacts for every promoted compressed model. Recovery speed matters as much as optimization speed in production.
Note: Key Point: Compression acceptance requires both quality and operational pass conditions.
publish
Release and Rollback Strategy
Treat compressed variants as managed artifacts, not ad-hoc exports.
Artifact Metadata
Record source checkpoint, quantization method, calibration set version, and benchmark results for each artifact. Strong metadata makes incident response and rollback fast and auditable.
Progressive Rollout
Roll out new compressed variants gradually with monitoring and fallback paths to the previous stable release. This limits blast radius when edge conditions reveal untested behavior.
Handoff Artifact
Attach compression metadata and evaluation snapshots to each artifact in your model registry for auditability and incident triage. Review it at each release checkpoint so assumptions remain current.
Note: Key Point: Compression should plug into the same release discipline as firmware and runtime updates.
warning
Compression Pitfalls in Practice
Most quality failures come from process shortcuts, not from the methods themselves.
Frequent Pitfalls
Common pitfalls include mixing calibration sets across model versions, skipping long-tail regression cases, and over-pruning to meet binary size targets. These shortcuts often create delayed production regressions.
Mitigation Plan
Use staged promotion with strict regression suites and mandatory artifact lineage checks. Controlled process discipline prevents compression work from destabilizing deployed systems.
Note: Key Point: Compression quality depends on process rigor as much as algorithm choice.
playlist_add_check
Release Checklist for Compressed Models
Use a standard go/no-go checklist before deployment.
Checklist Items
Confirm calibration freshness, class-level quality gates, latency and power metrics, and compatibility with target runtime kernels. Every item should reference a versioned report artifact.
Approval Path
Require joint sign-off from model and platform owners when compression affects safety-critical or high-volume features. Shared accountability reduces operational surprises after launch.
Note: Key Point: A disciplined checklist makes compression a reliable engineering tool, not a risky optimization gamble.