Ch 5: Compression and Quantization for Deployment

Ch 5 — Compression and Quantization for Deployment

PTQ, QAT, pruning, and distillation without hidden quality regressions.

Index ← Prev Next →

Modeling

compress

Compress

arrow_forward

tune

Calibrate

arrow_forward

rule

Validate

arrow_forward

speed

Benchmark

arrow_forward

inventory_2

Release

Click play or press Space to begin the chapter walkthrough...

Step- / 7

compress

PTQ vs QAT

Post-training and quantization-aware paths have different risk profiles.

PTQ Strength

Post-training quantization is fast to apply and useful for initial feasibility checks across candidate models. It can be sufficient when activation ranges are stable and task boundaries are simple.

QAT Strength

Quantization-aware training generally preserves quality better for sensitive tasks because the model learns with quantization effects in the training loop. It costs more training effort but often reduces surprise regressions later.

Practical Pattern

Treat compression experiments like controlled releases with versioned calibration data and benchmark reports. This keeps gains reproducible across teams and releases.

Note: Key Point: Use PTQ for fast screening and QAT when quality margins are tight.

calculate

Integer-Only Inference

Integer kernels are central to tiny-device efficiency.

Why Integer Paths

Integer execution lowers compute and memory pressure on microcontrollers and many accelerators. It also improves determinism in constrained runtimes with fixed kernel support.

Calibration Risk

Bad calibration data can distort activation scales and hurt model behavior in real traffic. Calibration sets should mirror deployment distributions, including hard negatives and boundary cases.

Failure Pattern

Compression pipelines often fail when calibration data is narrow or outdated relative to deployment traffic. Calibration drift can erase expected gains.

Note: Key Point: Integer-only performance gains are real, but calibration quality determines whether accuracy survives.

content_cut

Pruning and Distillation

Use these techniques when quantization alone is not enough.

Pruning Use Case

Pruning can reduce model size and inference cost when redundant parameters are present, but aggressive pruning often destabilizes quality. Structured approaches are easier to operationalize than arbitrary sparse patterns in tiny runtimes.

Distillation Use Case

Distillation transfers behavior from a stronger teacher model to a compact student architecture. It is valuable when the student must satisfy strict constraints while retaining task-specific nuances.

Validation Signal

Track class-level regressions and threshold sensitivity after each compression change. These signals catch silent degradations before they become incidents.

Note: Key Point: Combine techniques only with tight regression controls; complexity without validation creates risk.

verified

Regression Validation

Compression changes must pass task and operations gates together.

Quality Gates

Evaluate class-level metrics, false-trigger rates, and threshold stability rather than a single aggregate score. Compression often affects edge-case behavior before it affects headline accuracy.

Ops Gates

Track latency, memory peak, startup time, and energy impacts for each compressed variant on real hardware. Promote only variants that improve efficiency without violating reliability targets.

Governance Rule

Require rollback-ready baseline artifacts for every promoted compressed model. Recovery speed matters as much as optimization speed in production.

Note: Key Point: Compression acceptance requires both quality and operational pass conditions.

publish

Release and Rollback Strategy

Treat compressed variants as managed artifacts, not ad-hoc exports.

Artifact Metadata

Record source checkpoint, quantization method, calibration set version, and benchmark results for each artifact. Strong metadata makes incident response and rollback fast and auditable.

Progressive Rollout

Roll out new compressed variants gradually with monitoring and fallback paths to the previous stable release. This limits blast radius when edge conditions reveal untested behavior.

Handoff Artifact

Attach compression metadata and evaluation snapshots to each artifact in your model registry for auditability and incident triage. Review it at each release checkpoint so assumptions remain current.

Note: Key Point: Compression should plug into the same release discipline as firmware and runtime updates.

warning

Compression Pitfalls in Practice

Most quality failures come from process shortcuts, not from the methods themselves.

Frequent Pitfalls

Common pitfalls include mixing calibration sets across model versions, skipping long-tail regression cases, and over-pruning to meet binary size targets. These shortcuts often create delayed production regressions.

Mitigation Plan

Use staged promotion with strict regression suites and mandatory artifact lineage checks. Controlled process discipline prevents compression work from destabilizing deployed systems.

Note: Key Point: Compression quality depends on process rigor as much as algorithm choice.

playlist_add_check

Release Checklist for Compressed Models

Use a standard go/no-go checklist before deployment.

Checklist Items

Confirm calibration freshness, class-level quality gates, latency and power metrics, and compatibility with target runtime kernels. Every item should reference a versioned report artifact.

Approval Path

Require joint sign-off from model and platform owners when compression affects safety-critical or high-volume features. Shared accountability reduces operational surprises after launch.

Note: Key Point: A disciplined checklist makes compression a reliable engineering tool, not a risky optimization gamble.