Ch 2: The Small Model Landscape — Small Models & Local AI

Ch 2 — The Small Model Landscape

Who’s who in sub-10B models: families, benchmarks, and how to pick the right one

arrow_backIndex

Foundation

family_restroom

Families

arrow_forward

pets

Llama

arrow_forward

diamond

Gemma & Phi

arrow_forward

star

Qwen & Mistral

arrow_forward

leaderboard

Benchmarks

arrow_forward

analytics

SHIFT

arrow_forward

route

Task Matching

arrow_forward

checklist

Cheat Sheet

Click play or press Space to begin...

Step- / 8

family_restroom

The Model Families

Five companies dominate the sub-10B open-weight model space

The Big Five

Meta — Llama 3.2 Sizes: 1B, 3B (text), 11B, 90B (vision) License: Llama Community License Strength: Edge deployment, ecosystem Google — Gemma 3 Sizes: 1B, 4B, 12B, 27B License: Gemma Terms of Use Strength: Efficiency, reasoning per param Microsoft — Phi-4-mini Sizes: 3.8B License: MIT Strength: Math, code, data quality Alibaba — Qwen 3.5 Sizes: 0.6B, 1.7B, 4B, 9B, 14B, 32B License: Apache 2.0 Strength: Leaderboard king, multilingual Mistral — Mistral Small 3.1 Sizes: 24B License: Apache 2.0 Strength: Best "medium" model

What “Open-Weight” Means

Open-weight means the trained model weights are publicly available for download. You can run, fine-tune, and deploy them. This is different from “open-source” which implies the training data and code are also available.

Why it matters: Open weights let you run the model on your hardware, modify it, and deploy it without API keys or usage limits. Some licenses (like Llama’s) have restrictions for very large commercial use (>700M monthly users), but for most use cases, they’re effectively free.

Key insight: The open-weight model ecosystem has exploded since 2023. You now have multiple high-quality options at every size point from 1B to 30B. Competition between Meta, Google, Microsoft, Alibaba, and Mistral means rapid improvement — each new release pushes the others to do better.

pets

Llama 3.2: The Edge Champion

Meta’s smallest models — designed specifically for on-device deployment

Model Specs

Llama 3.2 1B Parameters: 1.24 billion Context: 128K tokens RAM (Q4): ~1.5 GB Use case: On-device, mobile, IoT Speed: Very fast, even on CPU Llama 3.2 3B Parameters: 3.21 billion Context: 128K tokens RAM (Q4): ~2.5 GB Use case: Edge, laptop, simple tasks Speed: Fast on any modern hardware

Benchmarks

Llama 3.2 3B: GSM8K (math): 77.7% ARC-C (reasoning): 78.6% MMLU (knowledge): 63.4% HumanEval (code): ~45% Llama 3.2 1B: GSM8K: 44.4% ARC-C: 59.4% MMLU: 49.3%

Best For

Llama 3.2 1B: On-device classification, simple extraction, text formatting, autocomplete suggestions. Think of it as a smart text processor, not a conversationalist.

Llama 3.2 3B: Simple chat, summarization, customer FAQ, basic code completion. The sweet spot for mobile and edge deployment.

Key insight: Llama 3.2’s 1B and 3B models were specifically designed for edge deployment — they’re not just scaled-down versions of the 70B. Meta optimized the architecture for inference efficiency on phones and laptops. ExecuTorch support is built-in.

diamond

Gemma 3 & Phi-4: The Efficiency Kings

Punching way above their weight class through data quality and architecture tricks

Gemma 3 (Google)

Gemma 3 4B Parameters: 4 billion Context: 128K tokens RAM (Q4): ~3 GB GSM8K: 89.2% HumanEval: 71.3% ARC-C: ~80% Gemma 3n E4B Effective: 8B (but runs like 4B) RAM: ~3 GB LMArena: >1300 (first sub-10B!) Uses selective parameter activation — only activates needed params

Phi-4-mini (Microsoft)

Phi-4-mini Parameters: 3.8 billion Context: 128K tokens RAM (Q4): ~3 GB MMLU-Pro: 52.8 GSM8K: 88.6% ARC-C: 83.7% License: MIT (most permissive!) Microsoft's secret: trained on high-quality synthetic data generated by GPT-4. Quality of training data matters more than quantity.

Key insight: Gemma 3 and Phi-4 prove that architecture innovation and data quality can compensate for fewer parameters. Gemma 3 4B scores 89.2% on GSM8K math — better than many 13B models from 2023. Phi-4-mini’s MIT license makes it the most commercially friendly option.

star

Qwen 3.5 & Mistral Small: The Power Players

The current leaderboard champion and the best “medium” model

Qwen 3.5 (Alibaba)

Qwen 3.5 4B MMLU-Pro: 79.1 RAM (Q4): ~3 GB Qwen 3.5 9B ← Leaderboard #1 MMLU-Pro: 82.5 GPQA Diamond:81.7 RAM (Q4): ~6 GB RAM (Q5): ~7 GB Qwen 3.5 9B beats models 3x its size on MMLU-Pro. Best quality-per-parameter ratio available today.

Mistral Small 3.1

Mistral Small 3.1 Parameters: 24 billion Context: 128K tokens RAM (Q4): ~14 GB License: Apache 2.0 Not technically "small" but fits on a 16GB GPU (RTX 4090, M2 Pro). Best option when you need more capability than a 9B model but can't run a 70B.

The Size Tiers

Tiny (1–3B): Mobile, IoT, edge. Llama 3.2 1B/3B.
Small (4–9B): Laptop, desktop. Gemma 3 4B, Qwen 3.5 9B, Phi-4-mini.
Medium (14–24B): Workstation, server. Qwen 3.5 14B, Mistral Small 3.1.

Key insight: Qwen 3.5 9B is the current sweet spot for local deployment — frontier-class reasoning in 6GB of RAM. If you need more, Mistral Small 3.1 at 24B fits on a 16GB GPU and competes with much larger models. The Apache 2.0 license on both means zero commercial restrictions.

leaderboard

Benchmarks That Actually Matter

What MMLU, HumanEval, and GSM8K measure — and what they don’t

The Key Benchmarks

MMLU-Pro (knowledge + reasoning) 14,000 questions across 14 domains Multiple choice, harder than MMLU Tests: broad knowledge, reasoning HumanEval (code generation) 164 Python programming problems Tests: function implementation from docstring. Pass@1 metric. GSM8K (math reasoning) 8,500 grade-school math problems Tests: multi-step arithmetic reasoning ARC-C (common-sense reasoning) Challenge set of science questions Tests: reasoning about everyday concepts GPQA Diamond (expert reasoning) Graduate-level science questions Tests: deep domain expertise

What Benchmarks Don’t Tell You

1. Benchmark contamination: Some models may have seen benchmark questions during training. Scores can be inflated.

2. Real-world performance differs: A model scoring 90% on GSM8K might struggle with your specific math format. Benchmarks test general capability, not your use case.

3. Instruction following: Most benchmarks don’t test whether the model follows complex instructions, maintains format, or handles edge cases.

4. Speed isn’t measured: A model scoring 85% at 100 tokens/sec might be more useful than one scoring 90% at 20 tokens/sec.

Key insight: Benchmarks are useful for shortlisting models, not for final selection. Always test your top 2–3 candidates on YOUR actual data and tasks. A model that scores lower on MMLU but handles your specific extraction format perfectly is the better choice for you.

analytics

The SHIFT Framework

A 5-axis evaluation designed specifically for edge and local deployment

SHIFT (Smol AI WorldCup 2026)

S — Size Parameter count and memory footprint. Smaller = deployable on more devices. H — Honesty Hallucination resistance, calibration. Does the model know what it doesn't know? I — Intelligence Reasoning across 7 languages. Not just English — multilingual matters. F — Fast Inference throughput (tokens/sec). How quickly does it generate output? T — Thrift Resource consumption (RAM, CPU, power). Can it run on a phone without draining the battery in 10 minutes?

Why SHIFT Matters

Traditional benchmarks (MMLU, HumanEval) only measure Intelligence. But for local deployment, you also care about:

• Will it fit? (Size, Thrift)
• Is it fast enough? (Fast)
• Can I trust it? (Honesty)

A model that scores 90% on MMLU but needs 32GB RAM and hallucinates frequently is worse for local deployment than one scoring 80% that fits in 4GB and rarely hallucinates.

Key Finding

The SHIFT framework revealed that smaller models can achieve 95% of larger models’ quality at 36% of RAM requirements. The efficiency frontier is much closer than raw benchmark scores suggest.

Key insight: When evaluating models for local deployment, don’t just look at accuracy scores. Use a multi-axis framework: does it fit on my hardware? Is it fast enough for my use case? Does it hallucinate on my domain? The best model is the one that balances all five axes for YOUR constraints.

route

Matching Models to Tasks

A practical guide: which model for which job?

Task → Model Map

Classification / Routing → Llama 3.2 1B or Phi-4-mini Why: Fast, tiny, accuracy is high for binary/multi-class tasks Data Extraction (JSON) → Qwen 3.5 4B or Gemma 3 4B Why: Good instruction following, structured output support Summarization → Qwen 3.5 9B or Mistral Small Why: Needs more language understanding than classification Code Completion → Qwen 3.5 9B or Phi-4-mini Why: Strong code benchmarks, fast enough for IDE integration Conversational Chat → Mistral Small 3.1 or Qwen 3.5 9B Why: Needs nuance, context tracking, personality consistency Mobile / On-Device → Llama 3.2 1B or 3B Why: Designed for edge, ExecuTorch support, minimal RAM

The Selection Process

1. Define the task: What exactly does the model need to do? Be specific.

2. Set hardware constraints: How much RAM? GPU or CPU only? Mobile or server?

3. Shortlist 2–3 models: Use the task map above and benchmark scores.

4. Test on your data: Run 50–100 examples from your actual use case through each model.

5. Measure what matters: Accuracy on YOUR task, speed, RAM usage. Not benchmark scores.

Key insight: Don’t pick the “best” model — pick the best model for your task and constraints. A Llama 3.2 1B doing classification at 200 tokens/sec on a phone is better than a Qwen 3.5 9B doing the same task at 40 tokens/sec on a server, if classification is all you need.

checklist

The Small Model Cheat Sheet

Quick reference for the rest of this course

By Hardware

Phone (2-4GB RAM) Llama 3.2 1B (Q4) → 1.5 GB Llama 3.2 3B (Q4) → 2.5 GB Laptop (8-16GB RAM) Gemma 3 4B (Q4) → 3 GB Phi-4-mini (Q4) → 3 GB Qwen 3.5 9B (Q4) → 6 GB Desktop / GPU (16-24GB) Qwen 3.5 9B (Q5) → 7 GB Qwen 3.5 14B (Q4) → 9 GB Mistral Small (Q4) → 14 GB Server (48GB+) Llama 3.1 70B (Q4) → 40 GB At this point, consider cloud

By License

Most permissive (MIT): Phi-4-mini — do anything Very permissive (Apache 2.0): Qwen 3.5, Mistral Small Permissive with limits: Llama 3.2 — free under 700M MAU Gemma 3 — Gemma Terms of Use

Coming Up Next

Now that you know the players, Chapter 3 dives into quantization — the technique that makes these models fit on consumer hardware. You’ll learn how FP32 becomes INT4, what GGUF files contain, and how to choose the right quantization level.

Key insight: The small model landscape changes fast — new models drop monthly. But the evaluation framework stays the same: define your task, set your constraints, shortlist by benchmarks, test on your data. The specific model names will change; the selection process won’t.

arrow_back Ch 1: Why Small Models Matter Ch 3: Quantization arrow_forward