Ch 10 — Benchmarking TinyML Correctly

Latency, memory, and energy measurement with reproducible methodology.
Performance
target
Metrics
arrow_forward
schedule
Latency
arrow_forward
battery_charging_full
Energy
arrow_forward
memory
Memory
arrow_forward
summarize
Report
-
Click play or press Space to begin the chapter walkthrough...
Step- / 7
target
Benchmark Intent
Benchmark goals must map to product outcomes, not generic leaderboard comparisons.
Goal Definition
Define which decisions the benchmark will support: architecture selection, runtime tuning, or release approval. Different decisions require different metric emphasis and workload construction.
Workload Fidelity
Use representative input distributions and sequence patterns rather than synthetic happy-path samples. Realistic workloads expose resource contention and tail behavior that static demos miss.
Practical Pattern
Use one benchmark harness per product class with versioned workload definitions and measurement scripts. Harness consistency enables trustworthy trend analysis.
Note: Key Point: A benchmark is useful only when it mirrors production decision context.
timer
Latency Methodology
Measure both central tendency and tail under realistic concurrency.
Latency Metrics
Track cold-start, warm-start, p50, and p95 latency on target hardware with production preprocessing enabled. This captures user-facing responsiveness and startup constraints together.
Stress Conditions
Run latency tests under concurrent firmware tasks and bursty input sequences to capture tail degradation. Tail metrics are often the first signal of deployment instability.
Failure Pattern
Benchmark drift happens when test inputs, firmware load, or measurement methodology change without version control. Drift makes historical comparisons unreliable.
Note: Key Point: Tail latency under stress is a better release signal than average latency in isolation.
bolt
Energy Measurement
Energy-per-inference is critical for always-on and battery-constrained products.
Measurement Practice
Measure current draw across idle, acquisition, inference, and post-processing phases to isolate costly stages. Phase-level visibility enables targeted optimization instead of blind model changes.
Product Translation
Translate energy metrics into daily battery impact for expected event rates and duty cycles. This connects engineering metrics directly to user-facing battery-life promises.
Validation Signal
Track benchmark confidence through repeated runs and variance reporting, not single-shot numbers. Variance spikes often indicate hidden system instability.
Note: Key Point: Energy accounting should be scenario-based, not a single static power number.
memory
Memory and Startup Metrics
Memory peak and startup behavior are frequent hidden blockers for release.
Memory Profile
Capture RAM peak, arena usage, stack high-water marks, and flash footprint with all production services active. Memory metrics must include update and rollback partitions where relevant.
Startup Readiness
Record model load and first-inference times as part of startup performance budgets. Delayed readiness can break user expectations even when steady-state inference is fast.
Governance Rule
Require benchmark evidence for every model-runtime promotion decision and keep results attached to release artifacts. Enforcing this consistently prevents scope drift between releases.
Note: Key Point: Startup and memory metrics are first-class release gates in TinyML products.
fact_check
Reporting and Governance
Benchmark outputs should be standardized for cross-version comparison.
Report Format
Use a fixed report schema that captures model version, runtime version, device build, workload profile, and measured metrics. Standardization makes trend analysis and regression detection reliable.
Governance Pattern
Adopt benchmark governance similar to formal tiny benchmark practices: reproducibility, versioned workloads, and transparent methodology changes. This protects decision quality as teams scale.
Handoff Artifact
Document benchmark assumptions and known limitations so downstream teams interpret results correctly. Review it at each release checkpoint so assumptions remain current.
Note: Key Point: Reproducibility discipline is the difference between benchmark theater and benchmark engineering.
insights
Benchmark Anti-Patterns
Weak benchmark design creates false confidence and poor release choices.
Anti-Pattern Examples
Common anti-patterns include synthetic easy inputs, disabled background tasks, and ignoring cold-start behavior. These shortcuts make deployments look better than they actually are.
Correction Approach
Adopt production-like workloads, include stress scenarios, and report full metric sets with variance. Strong methodology is more valuable than a larger benchmark volume with weak realism.
Note: Key Point: Benchmark realism matters more than benchmark size.
checklist_rtl
Benchmark Governance Checklist
Standardize measurement practice across teams and releases.
Checklist Items
Confirm workload version, hardware configuration, firmware image version, run count, variance bounds, and metric definitions before comparing results. Consistency is mandatory for valid decisions.
Decision Integration
Tie benchmark outcomes directly to release gates and rollout plans. Benchmarking should drive action, not just reporting.
Note: Key Point: Benchmark governance turns performance data into reliable operational decisions.