memory

Small Models & Local AI

Quantization, distillation, Ollama, edge deployment — run AI on your own hardware

Co-Created by Kiran Shirol and Claude

TopicsQuantizationDistillationOllamaEdge DeployLocal Apps

home Learning Portal play_arrow Start Learning summarize Key Insights dictionary Glossary10 chapters · 5 sections

Section 1

Foundation — Why Small?

The cost, latency, and privacy case for running models locally.

lightbulb

Why Small Models Matter

95% quality at 5% cost — when 3B parameters beats 70B.

arrow_forward Learn

landscape

The Small Model Landscape

Llama 3.2, Gemma 3, Phi-4, Qwen 3.5, Mistral Small — benchmarks and task matching.

arrow_forward Learn

Section 2

Core Techniques — Making Models Smaller

Quantization and distillation techniques to shrink without breaking.

compress

Quantization: Shrinking Without Breaking

FP32 → INT4, GGUF format, Q4_K_M vs Q5_K_M, and RAM requirements.

arrow_forward Learn

science

Distillation & Pruning

Teacher-student training, soft labels, structured pruning, and when to use each.

arrow_forward Learn

Section 3

Hands-On — Running Models Locally

Ollama, llama.cpp, and GGUF in practice.

terminal

Ollama: Your Local AI Runtime

Install, pull, run — your first local AI in 5 minutes.

arrow_forward Learn

code

llama.cpp & GGUF Deep Dive

The C++ engine under the hood, converting models, and performance tuning.

arrow_forward Learn

Section 4

Real-World Applications

Building local apps and deploying to phones, browsers, and IoT.

build

Building Local AI Applications

Ollama + Python/JS, local RAG with ChromaDB, and document Q&A.

arrow_forward Learn

phone_iphone

Edge Deployment: Phones, Browsers, IoT

ExecuTorch, WebLLM, ONNX Runtime — Llama on iPhones and in browsers.

arrow_forward Learn

Section 5

Strategy — Choosing Wisely

Local vs cloud decisions and the future of small models.

compare

Local vs Cloud: The Decision Framework

Cost break-even, latency, privacy, hybrid architectures, and routing.

arrow_forward Learn

rocket_launch

The Future of Small Models

Speculative decoding, on-device fine-tuning, NPU trends, and your toolkit.

arrow_forward Learn

Small Models & Local AI

Foundation — Why Small?

Core Techniques — Making Models Smaller

Hands-On — Running Models Locally

Real-World Applications

Strategy — Choosing Wisely

Explore Related Courses