Ch 8 — Edge Deployment: Phones, Browsers, IoT

Running AI on the device itself — no server, no cloud, no internet required
Applied
devices
Why Edge
arrow_forward
bolt
ExecuTorch
arrow_forward
phone_iphone
Mobile
arrow_forward
language
WebLLM
arrow_forward
hub
ONNX
arrow_forward
apps
Use Cases
arrow_forward
warning
Constraints
arrow_forward
checklist
Decision Map
-
Click play or press Space to begin...
Step- / 8
devices
Why Edge Deployment?
Zero latency, zero cost, works offline — AI that lives on the device
Edge vs Local vs Cloud
Cloud: Model runs on provider servers Latency: 200ms-8s | Cost: per token Internet: Required | Privacy: Low Local: Model runs on your laptop/desktop Latency: 50-200ms | Cost: hardware Internet: Not needed | Privacy: High Edge: Model runs on phone/browser/IoT Latency: 20-100ms | Cost: $0 Internet: Not needed | Privacy: Maximum Size limit: 1-3B models (2-4GB RAM)
The Edge Advantage
Instant response: No network hop. The model is already on the device, loaded in memory. Response starts in milliseconds.

Works offline: Airplane mode, subway, rural areas, disaster zones. The AI works wherever the device works.

Maximum privacy: Data never leaves the device. Not even to a local server. The computation happens in the same process as your app.

Zero marginal cost: No server to maintain. No API to pay for. The user’s device does the work.
Key insight: Edge AI is the ultimate form of local AI. Instead of “your server,” it’s “the user’s device.” The trade-off: you’re limited to 1–3B models that fit in mobile RAM. But for the right tasks, these tiny models are more than enough.
bolt
ExecuTorch: PyTorch to Mobile
Meta’s framework for deploying models to phones and edge devices
What ExecuTorch Does
PyTorch Model (.pt) ↓ torch.export() Exported Model ↓ ExecuTorch compiler Optimized .pte file ↓ deploy to device Runs on phone/IoT Runtime size: ~50KB base Platforms: iOS, Android, Linux, MCU Backends: XNNPack (CPU), CoreML (Apple), QNN (Qualcomm), Vulkan (GPU) Models: Llama 3.2 1B/3B officially supported by Meta
Llama 3.2 on Mobile
Llama 3.2 1B on iPhone 16 Pro: Speed: ~50 tok/s (ANE + GPU) RAM: ~1.5 GB Load: ~2 seconds Llama 3.2 3B on iPhone 16 Pro: Speed: ~25 tok/s RAM: ~2.5 GB Load: ~4 seconds Llama 3.2 1B on Samsung Galaxy S24: Speed: ~30 tok/s (Qualcomm QNN) RAM: ~1.5 GB
Key insight: ExecuTorch is the official path from PyTorch to mobile. Meta uses it in production across their apps (Instagram, WhatsApp). The 50KB runtime means it adds almost nothing to your app size. The model file itself (1–2GB) is the main cost.
phone_iphone
Mobile Deployment Patterns
How to actually ship an LLM inside a mobile app
Deployment Options
Option 1: Bundle with app Model included in app binary Pro: Works immediately, no download Con: App size = 1-2GB (large!) Option 2: Download on first run App downloads model after install Pro: Small initial app size Con: Needs internet for first setup Option 3: On-demand download Download model when feature is used Pro: Most users never download Con: Delay when first using AI feature Option 2 is most common. Apple and Google both support "on-demand resources" for large assets.
Platform-Specific Acceleration
Apple (iPhone/iPad): Apple Neural Engine (ANE): 15.8 TOPS CoreML backend via ExecuTorch Best for: 1-3B models Qualcomm (Android flagship): Hexagon NPU: up to 45 TOPS QNN backend via ExecuTorch Best for: 1-3B models MediaTek (Android mid-range): APU: up to 36 TOPS NeuroPilot SDK Best for: 1B models Samsung (Exynos): NPU: up to 34.7 TOPS Samsung Neural SDK Best for: 1-3B models
Key insight: Modern phones have dedicated AI hardware (NPUs) that are specifically designed for neural network inference. A 2024+ flagship phone can run a 1B model at 30–50 tok/s — fast enough for interactive use. The hardware is already there; the software ecosystem is catching up.
language
WebLLM: AI in the Browser
Run models directly in Chrome/Edge using WebGPU — no installation needed
How WebLLM Works
WebLLM (by MLC AI) Browser → WebGPU API → GPU ↓ Compiled model (Wasm + WebGPU shaders) ↓ Inference runs entirely in browser No server. No installation. No data leaves the browser tab. Supported browsers: ✓ Chrome 113+ (WebGPU) ✓ Edge 113+ ✓ Safari 18+ (partial) ✗ Firefox (WebGPU in development)
Example: WebLLM in 10 Lines
import { CreateMLCEngine } from "@mlc-ai/web-llm"; const engine = await CreateMLCEngine( "Llama-3.2-1B-Instruct-q4f16_1-MLC" ); const reply = await engine.chat.completions .create({ messages: [{ role: "user", content: "What is quantization?" }] }); console.log(reply.choices[0].message.content); // Model downloaded to browser cache // on first use (~800MB for 1B Q4)
Key insight: WebLLM is remarkable: a user visits your website, the model downloads to their browser cache, and all inference happens on their GPU. No server costs, no privacy concerns, no installation. The catch: first load downloads 0.5–2GB, and performance depends on the user’s GPU.
hub
ONNX Runtime: Cross-Platform
One model format, runs everywhere — Windows, Mac, Linux, mobile, web
What ONNX Runtime Does
ONNX (Open Neural Network Exchange) is a standard format for ML models. ONNX Runtime is Microsoft’s inference engine that runs ONNX models on any platform.

Unlike GGUF (llama.cpp-specific) or .pte (ExecuTorch-specific), ONNX is a universal format supported by virtually every ML framework and hardware vendor.
Execution Providers
ONNX Runtime backends: CPU: Default, works everywhere CUDA: NVIDIA GPUs DirectML: Windows GPU (any vendor) CoreML: Apple devices QNN: Qualcomm NPUs WebGPU: Browser-based TensorRT: NVIDIA optimized OpenVINO: Intel hardware
ONNX Runtime GenAI
pip install onnxruntime-genai import onnxruntime_genai as og model = og.Model("phi-4-mini-onnx") tokenizer = og.Tokenizer(model) params = og.GeneratorParams(model) params.set_search_options( max_length=200, temperature=0.7 ) prompt = "Explain quantization briefly." tokens = tokenizer.encode(prompt) params.input_ids = tokens generator = og.Generator(model, params) while not generator.is_done(): generator.compute_logits() generator.generate_next_token() output = tokenizer.decode( generator.get_sequence(0) )
Key insight: ONNX Runtime is the “write once, run anywhere” approach. Convert your model to ONNX once, deploy to Windows (DirectML), Mac (CoreML), Android (QNN), web (WebGPU), or server (CUDA). Microsoft uses it for Phi-4 deployment across all their platforms.
apps
Real-World Edge AI Use Cases
What people are actually building with on-device models
Production Use Cases
Offline Assistants Smart reply suggestions in messaging Email draft assistance without internet Meeting note summarization on-device On-Device Translation Real-time translation without cloud Privacy for sensitive conversations Works in areas with no connectivity Smart Keyboards Next-word prediction (already in iOS/Android) Grammar correction Tone adjustment suggestions Code Completion IDE autocomplete running locally No code sent to external servers Works offline (airplane, VPN issues)
Emerging Use Cases
Smart Home / IoT Voice commands processed locally No "always listening" cloud service Works during internet outages Automotive In-car voice assistant (no cell signal) Real-time navigation instructions Passenger entertainment Healthcare Patient intake form assistance Symptom triage on tablets Clinical note summarization Education Personalized tutoring on tablets Offline learning in rural schools Language learning without internet
Key insight: The common thread: edge AI shines when you need privacy (healthcare, messaging), offline capability (rural, automotive, IoT), or instant response (keyboards, autocomplete). If your use case has any of these requirements, edge deployment is worth the constraints.
warning
Edge Constraints: The Hard Limits
Memory, battery, thermal throttling — the realities of on-device AI
The Constraints
Memory iPhone 16 Pro: 8GB total RAM Your app gets: ~3-4GB max Model + context must fit in that Battery LLM inference is power-hungry 1B model: ~2-5W during generation Continuous use: noticeable drain Background inference: not practical Thermal Throttling Sustained inference heats the device After ~2-3 min, phone throttles CPU/GPU Speed drops 20-40% when hot Storage Model files: 0.5-2GB each Users may not want to download App store size limits apply
Practical Limits
Model size: Realistically 1B–3B on phones. Anything larger won’t fit in available RAM alongside the OS and other apps.

Context window: Keep it short (2K–4K tokens). Longer contexts consume more RAM and slow down generation.

Generation length: Short responses (50–200 tokens) work well. Long generation (1000+ tokens) causes thermal throttling and battery drain.

Frequency: Occasional use (user-triggered) is fine. Continuous background inference is not practical on mobile.
Key insight: Edge AI is not “laptop AI on a phone.” It’s a fundamentally different environment with hard constraints. Design for short, focused tasks: classify this, extract that, suggest a reply. Don’t try to run a full chatbot conversation on a phone — that’s what local (laptop) or cloud is for.
checklist
Edge Deployment Decision Map
Which framework for which platform and use case
Framework Selection
Target: iOS / Android native app → ExecuTorch (Meta) → Models: Llama 3.2 1B/3B → Best performance, official support Target: Web application (browser) → WebLLM (MLC AI) → Models: Llama 3.2 1B, Gemma 2B → No installation, WebGPU required Target: Cross-platform (all devices) → ONNX Runtime → Models: Phi-4-mini, any ONNX model → One format, many backends Target: Desktop app (Electron/Tauri) → llama.cpp (embedded) → Models: Any GGUF model → Maximum flexibility
Quick Decision
Building a mobile app? → ExecuTorch + Llama 3.2 1B

Building a web app? → WebLLM + Llama 3.2 1B

Need cross-platform? → ONNX Runtime + Phi-4-mini

Building a desktop app? → Ollama/llama.cpp + any model

Just prototyping? → Ollama on your laptop (Ch 5)
Key insight: Edge deployment is the frontier of local AI. The tools are maturing fast — ExecuTorch, WebLLM, and ONNX Runtime all reached production quality in 2024–2025. The hardware (NPUs in every phone) is already there. Chapter 9 ties everything together with a decision framework for choosing between edge, local, and cloud.