Models
Tiny (1-3B) — Phone/Edge:
Llama 3.2 1B, 3B
Gemma 2B
Small (4-9B) — Laptop:
Qwen 3.5 4B, 9B ← Best quality/size
Gemma 3 4B
Phi-4-mini 3.8B ← MIT license
Medium (14-24B) — Desktop/Server:
Qwen 3.5 14B
Mistral Small 3.1 (24B)
Quantization
Default: Q4_K_M (best balance)
Quality: Q5_K_M (if RAM allows)
Max: Q8_0 (near-lossless)
Format: GGUF (universal)
Tools
Daily driver: Ollama
Power user: llama.cpp
Mobile: ExecuTorch
Browser: WebLLM
Cross-platform: ONNX Runtime
Vector store: ChromaDB
Framework: LangChain / LlamaIndex
Fine-tuning: QLoRA + PEFT
Architecture
Simple: App → Ollama API → Model
RAG: App → ChromaDB → Ollama
Hybrid: Router → Local / Cloud
Edge: ExecuTorch / WebLLM
Key insight: You now have a complete toolkit. The specific model names and version numbers will change — new models drop monthly. But the concepts (quantization, distillation, GGUF, Ollama, hybrid routing) are stable. You have the framework to evaluate and adopt whatever comes next.