memory
Small Models & Local AI
Quantization, distillation, Ollama, edge deployment — run AI on your own hardware
Co-Created by Kiran Shirol and Claude
Topics
Quantization
Distillation
Ollama
Edge Deploy
Local Apps
home
Learning Portal
play_arrow
Start Learning
summarize
Key Insights
dictionary
Glossary
10 chapters
· 5 sections
Section 1
Foundation — Why Small?
The cost, latency, and privacy case for running models locally.
1
lightbulb
Why Small Models Matter
95% quality at 5% cost — when 3B parameters beats 70B.
arrow_forward
Learn
2
landscape
The Small Model Landscape
Llama 3.2, Gemma 3, Phi-4, Qwen 3.5, Mistral Small — benchmarks and task matching.
arrow_forward
Learn
Section 2
Core Techniques — Making Models Smaller
Quantization and distillation techniques to shrink without breaking.
3
compress
Quantization: Shrinking Without Breaking
FP32 → INT4, GGUF format, Q4_K_M vs Q5_K_M, and RAM requirements.
arrow_forward
Learn
4
science
Distillation & Pruning
Teacher-student training, soft labels, structured pruning, and when to use each.
arrow_forward
Learn
Section 3
Hands-On — Running Models Locally
Ollama, llama.cpp, and GGUF in practice.
5
terminal
Ollama: Your Local AI Runtime
Install, pull, run — your first local AI in 5 minutes.
arrow_forward
Learn
6
code
llama.cpp & GGUF Deep Dive
The C++ engine under the hood, converting models, and performance tuning.
arrow_forward
Learn
Section 4
Real-World Applications
Building local apps and deploying to phones, browsers, and IoT.
7
build
Building Local AI Applications
Ollama + Python/JS, local RAG with ChromaDB, and document Q&A.
arrow_forward
Learn
8
phone_iphone
Edge Deployment: Phones, Browsers, IoT
ExecuTorch, WebLLM, ONNX Runtime — Llama on iPhones and in browsers.
arrow_forward
Learn
Section 5
Strategy — Choosing Wisely
Local vs cloud decisions and the future of small models.
9
compare
Local vs Cloud: The Decision Framework
Cost break-even, latency, privacy, hybrid architectures, and routing.
arrow_forward
Learn
10
rocket_launch
The Future of Small Models
Speculative decoding, on-device fine-tuning, NPU trends, and your toolkit.
arrow_forward
Learn
explore
Explore Related Courses
neurology
How LLMs Work
Transformers & Attention
tune
Fine-Tuning
Adapting Models to Your Data
dns
AI Infrastructure
GPUs, Serving & MLOps