The Trend
LLMs are moving from cloud to device. Apple Intelligence runs a ~3B model on iPhone. Google runs Gemini Nano on Pixel. Qualcomm’s Snapdragon runs 7B models. The recipe: small model (1-3B) + aggressive quantization (4-bit) + hardware-specific optimization. A 3B model at 4-bit needs only 1.5 GB — fits easily in phone RAM. Latency is better (no network round-trip) and privacy is preserved (data never leaves the device).
Key insight: The combination of overtraining (Ch 5), distillation, and quantization means a 3B model in 2025 can match a 13B model from 2023. Apple’s on-device model handles autocomplete, summarization, and rewriting. The trade-off: on-device models are less capable than cloud models for complex reasoning, but for common tasks, they’re fast, private, and free.
On-Device Stack
# On-device LLM requirements:
# Memory: < 4 GB (phone RAM budget)
# Speed: > 10 tokens/sec (usable)
# Power: < 5W (battery-friendly)
# How to fit:
# 3B model × 4-bit = 1.5 GB ✓
# 7B model × 4-bit = 3.5 GB (tight)
# 13B model × 4-bit = 6.5 GB ✗ (too big)
# Frameworks:
# llama.cpp: CPU, cross-platform
# Apple MLX: Apple Silicon optimized
# MLC-LLM: mobile (Android/iOS)
# Ollama: desktop, easy to use
# Performance (Llama 3.2 3B, 4-bit):
# iPhone 15 Pro: ~15 tokens/sec
# M3 MacBook: ~40 tokens/sec
# RTX 4090: ~100 tokens/sec