Ch 8: Edge Deployment — Small Models & Local AI

Ch 8 — Edge Deployment: Phones, Browsers, IoT

Running AI on the device itself — no server, no cloud, no internet required

arrow_backIndex

Applied

devices

Why Edge

arrow_forward

bolt

ExecuTorch

arrow_forward

phone_iphone

Mobile

arrow_forward

language

WebLLM

arrow_forward

hub

ONNX

arrow_forward

apps

Use Cases

arrow_forward

warning

Constraints

arrow_forward

checklist

Decision Map

Click play or press Space to begin...

Step- / 8

devices

Why Edge Deployment?

Zero latency, zero cost, works offline — AI that lives on the device

Edge vs Local vs Cloud

Cloud: Model runs on provider servers Latency: 200ms-8s | Cost: per token Internet: Required | Privacy: Low Local: Model runs on your laptop/desktop Latency: 50-200ms | Cost: hardware Internet: Not needed | Privacy: High Edge: Model runs on phone/browser/IoT Latency: 20-100ms | Cost: $0 Internet: Not needed | Privacy: Maximum Size limit: 1-3B models (2-4GB RAM)

The Edge Advantage

Instant response: No network hop. The model is already on the device, loaded in memory. Response starts in milliseconds.

Works offline: Airplane mode, subway, rural areas, disaster zones. The AI works wherever the device works.

Maximum privacy: Data never leaves the device. Not even to a local server. The computation happens in the same process as your app.

Zero marginal cost: No server to maintain. No API to pay for. The user’s device does the work.

Key insight: Edge AI is the ultimate form of local AI. Instead of “your server,” it’s “the user’s device.” The trade-off: you’re limited to 1–3B models that fit in mobile RAM. But for the right tasks, these tiny models are more than enough.

bolt

ExecuTorch: PyTorch to Mobile

Meta’s framework for deploying models to phones and edge devices

What ExecuTorch Does

PyTorch Model (.pt) ↓ torch.export() Exported Model ↓ ExecuTorch compiler Optimized .pte file ↓ deploy to device Runs on phone/IoT Runtime size: ~50KB base Platforms: iOS, Android, Linux, MCU Backends: XNNPack (CPU), CoreML (Apple), QNN (Qualcomm), Vulkan (GPU) Models: Llama 3.2 1B/3B officially supported by Meta

Llama 3.2 on Mobile

Llama 3.2 1B on iPhone 16 Pro: Speed: ~50 tok/s (ANE + GPU) RAM: ~1.5 GB Load: ~2 seconds Llama 3.2 3B on iPhone 16 Pro: Speed: ~25 tok/s RAM: ~2.5 GB Load: ~4 seconds Llama 3.2 1B on Samsung Galaxy S24: Speed: ~30 tok/s (Qualcomm QNN) RAM: ~1.5 GB

Key insight: ExecuTorch is the official path from PyTorch to mobile. Meta uses it in production across their apps (Instagram, WhatsApp). The 50KB runtime means it adds almost nothing to your app size. The model file itself (1–2GB) is the main cost.

phone_iphone

Mobile Deployment Patterns

How to actually ship an LLM inside a mobile app

Deployment Options

Option 1: Bundle with app Model included in app binary Pro: Works immediately, no download Con: App size = 1-2GB (large!) Option 2: Download on first run App downloads model after install Pro: Small initial app size Con: Needs internet for first setup Option 3: On-demand download Download model when feature is used Pro: Most users never download Con: Delay when first using AI feature Option 2 is most common. Apple and Google both support "on-demand resources" for large assets.

Platform-Specific Acceleration

Apple (iPhone/iPad): Apple Neural Engine (ANE): 15.8 TOPS CoreML backend via ExecuTorch Best for: 1-3B models Qualcomm (Android flagship): Hexagon NPU: up to 45 TOPS QNN backend via ExecuTorch Best for: 1-3B models MediaTek (Android mid-range): APU: up to 36 TOPS NeuroPilot SDK Best for: 1B models Samsung (Exynos): NPU: up to 34.7 TOPS Samsung Neural SDK Best for: 1-3B models

Key insight: Modern phones have dedicated AI hardware (NPUs) that are specifically designed for neural network inference. A 2024+ flagship phone can run a 1B model at 30–50 tok/s — fast enough for interactive use. The hardware is already there; the software ecosystem is catching up.

language

WebLLM: AI in the Browser

Run models directly in Chrome/Edge using WebGPU — no installation needed

How WebLLM Works

WebLLM (by MLC AI) Browser → WebGPU API → GPU ↓ Compiled model (Wasm + WebGPU shaders) ↓ Inference runs entirely in browser No server. No installation. No data leaves the browser tab. Supported browsers: ✓ Chrome 113+ (WebGPU) ✓ Edge 113+ ✓ Safari 18+ (partial) ✗ Firefox (WebGPU in development)

Example: WebLLM in 10 Lines

import { CreateMLCEngine } from "@mlc-ai/web-llm"; const engine = await CreateMLCEngine( "Llama-3.2-1B-Instruct-q4f16_1-MLC" ); const reply = await engine.chat.completions .create({ messages: [{ role: "user", content: "What is quantization?" }] }); console.log(reply.choices[0].message.content); // Model downloaded to browser cache // on first use (~800MB for 1B Q4)

Key insight: WebLLM is remarkable: a user visits your website, the model downloads to their browser cache, and all inference happens on their GPU. No server costs, no privacy concerns, no installation. The catch: first load downloads 0.5–2GB, and performance depends on the user’s GPU.

hub

ONNX Runtime: Cross-Platform

One model format, runs everywhere — Windows, Mac, Linux, mobile, web

What ONNX Runtime Does

ONNX (Open Neural Network Exchange) is a standard format for ML models. ONNX Runtime is Microsoft’s inference engine that runs ONNX models on any platform.

Unlike GGUF (llama.cpp-specific) or .pte (ExecuTorch-specific), ONNX is a universal format supported by virtually every ML framework and hardware vendor.

Execution Providers

ONNX Runtime backends: CPU: Default, works everywhere CUDA: NVIDIA GPUs DirectML: Windows GPU (any vendor) CoreML: Apple devices QNN: Qualcomm NPUs WebGPU: Browser-based TensorRT: NVIDIA optimized OpenVINO: Intel hardware

ONNX Runtime GenAI

pip install onnxruntime-genai import onnxruntime_genai as og model = og.Model("phi-4-mini-onnx") tokenizer = og.Tokenizer(model) params = og.GeneratorParams(model) params.set_search_options( max_length=200, temperature=0.7 ) prompt = "Explain quantization briefly." tokens = tokenizer.encode(prompt) params.input_ids = tokens generator = og.Generator(model, params) while not generator.is_done(): generator.compute_logits() generator.generate_next_token() output = tokenizer.decode( generator.get_sequence(0) )

Key insight: ONNX Runtime is the “write once, run anywhere” approach. Convert your model to ONNX once, deploy to Windows (DirectML), Mac (CoreML), Android (QNN), web (WebGPU), or server (CUDA). Microsoft uses it for Phi-4 deployment across all their platforms.

apps

Real-World Edge AI Use Cases

What people are actually building with on-device models

Production Use Cases

Offline Assistants Smart reply suggestions in messaging Email draft assistance without internet Meeting note summarization on-device On-Device Translation Real-time translation without cloud Privacy for sensitive conversations Works in areas with no connectivity Smart Keyboards Next-word prediction (already in iOS/Android) Grammar correction Tone adjustment suggestions Code Completion IDE autocomplete running locally No code sent to external servers Works offline (airplane, VPN issues)

Emerging Use Cases

Smart Home / IoT Voice commands processed locally No "always listening" cloud service Works during internet outages Automotive In-car voice assistant (no cell signal) Real-time navigation instructions Passenger entertainment Healthcare Patient intake form assistance Symptom triage on tablets Clinical note summarization Education Personalized tutoring on tablets Offline learning in rural schools Language learning without internet

Key insight: The common thread: edge AI shines when you need privacy (healthcare, messaging), offline capability (rural, automotive, IoT), or instant response (keyboards, autocomplete). If your use case has any of these requirements, edge deployment is worth the constraints.

warning

Edge Constraints: The Hard Limits

Memory, battery, thermal throttling — the realities of on-device AI

The Constraints

Memory iPhone 16 Pro: 8GB total RAM Your app gets: ~3-4GB max Model + context must fit in that Battery LLM inference is power-hungry 1B model: ~2-5W during generation Continuous use: noticeable drain Background inference: not practical Thermal Throttling Sustained inference heats the device After ~2-3 min, phone throttles CPU/GPU Speed drops 20-40% when hot Storage Model files: 0.5-2GB each Users may not want to download App store size limits apply

Practical Limits

Model size: Realistically 1B–3B on phones. Anything larger won’t fit in available RAM alongside the OS and other apps.

Context window: Keep it short (2K–4K tokens). Longer contexts consume more RAM and slow down generation.

Generation length: Short responses (50–200 tokens) work well. Long generation (1000+ tokens) causes thermal throttling and battery drain.

Frequency: Occasional use (user-triggered) is fine. Continuous background inference is not practical on mobile.

Key insight: Edge AI is not “laptop AI on a phone.” It’s a fundamentally different environment with hard constraints. Design for short, focused tasks: classify this, extract that, suggest a reply. Don’t try to run a full chatbot conversation on a phone — that’s what local (laptop) or cloud is for.

checklist

Edge Deployment Decision Map

Which framework for which platform and use case

Framework Selection

Target: iOS / Android native app → ExecuTorch (Meta) → Models: Llama 3.2 1B/3B → Best performance, official support Target: Web application (browser) → WebLLM (MLC AI) → Models: Llama 3.2 1B, Gemma 2B → No installation, WebGPU required Target: Cross-platform (all devices) → ONNX Runtime → Models: Phi-4-mini, any ONNX model → One format, many backends Target: Desktop app (Electron/Tauri) → llama.cpp (embedded) → Models: Any GGUF model → Maximum flexibility

Quick Decision

Building a mobile app? → ExecuTorch + Llama 3.2 1B

Building a web app? → WebLLM + Llama 3.2 1B

Need cross-platform? → ONNX Runtime + Phi-4-mini

Building a desktop app? → Ollama/llama.cpp + any model

Just prototyping? → Ollama on your laptop (Ch 5)

Key insight: Edge deployment is the frontier of local AI. The tools are maturing fast — ExecuTorch, WebLLM, and ONNX Runtime all reached production quality in 2024–2025. The hardware (NPUs in every phone) is already there. Chapter 9 ties everything together with a decision framework for choosing between edge, local, and cloud.

arrow_back Ch 7: Building Local Apps Ch 9: Local vs Cloud arrow_forward