Ch 2 — The Small Model Landscape

Who’s who in sub-10B models: families, benchmarks, and how to pick the right one
Foundation
family_restroom
Families
arrow_forward
pets
Llama
arrow_forward
diamond
Gemma & Phi
arrow_forward
star
Qwen & Mistral
arrow_forward
leaderboard
Benchmarks
arrow_forward
analytics
SHIFT
arrow_forward
route
Task Matching
arrow_forward
checklist
Cheat Sheet
-
Click play or press Space to begin...
Step- / 8
family_restroom
The Model Families
Five companies dominate the sub-10B open-weight model space
The Big Five
Meta — Llama 3.2 Sizes: 1B, 3B (text), 11B, 90B (vision) License: Llama Community License Strength: Edge deployment, ecosystem Google — Gemma 3 Sizes: 1B, 4B, 12B, 27B License: Gemma Terms of Use Strength: Efficiency, reasoning per param Microsoft — Phi-4-mini Sizes: 3.8B License: MIT Strength: Math, code, data quality Alibaba — Qwen 3.5 Sizes: 0.6B, 1.7B, 4B, 9B, 14B, 32B License: Apache 2.0 Strength: Leaderboard king, multilingual Mistral — Mistral Small 3.1 Sizes: 24B License: Apache 2.0 Strength: Best "medium" model
What “Open-Weight” Means
Open-weight means the trained model weights are publicly available for download. You can run, fine-tune, and deploy them. This is different from “open-source” which implies the training data and code are also available.

Why it matters: Open weights let you run the model on your hardware, modify it, and deploy it without API keys or usage limits. Some licenses (like Llama’s) have restrictions for very large commercial use (>700M monthly users), but for most use cases, they’re effectively free.
Key insight: The open-weight model ecosystem has exploded since 2023. You now have multiple high-quality options at every size point from 1B to 30B. Competition between Meta, Google, Microsoft, Alibaba, and Mistral means rapid improvement — each new release pushes the others to do better.
pets
Llama 3.2: The Edge Champion
Meta’s smallest models — designed specifically for on-device deployment
Model Specs
Llama 3.2 1B Parameters: 1.24 billion Context: 128K tokens RAM (Q4): ~1.5 GB Use case: On-device, mobile, IoT Speed: Very fast, even on CPU Llama 3.2 3B Parameters: 3.21 billion Context: 128K tokens RAM (Q4): ~2.5 GB Use case: Edge, laptop, simple tasks Speed: Fast on any modern hardware
Benchmarks
Llama 3.2 3B: GSM8K (math): 77.7% ARC-C (reasoning): 78.6% MMLU (knowledge): 63.4% HumanEval (code): ~45% Llama 3.2 1B: GSM8K: 44.4% ARC-C: 59.4% MMLU: 49.3%
Best For
Llama 3.2 1B: On-device classification, simple extraction, text formatting, autocomplete suggestions. Think of it as a smart text processor, not a conversationalist.

Llama 3.2 3B: Simple chat, summarization, customer FAQ, basic code completion. The sweet spot for mobile and edge deployment.
Key insight: Llama 3.2’s 1B and 3B models were specifically designed for edge deployment — they’re not just scaled-down versions of the 70B. Meta optimized the architecture for inference efficiency on phones and laptops. ExecuTorch support is built-in.
diamond
Gemma 3 & Phi-4: The Efficiency Kings
Punching way above their weight class through data quality and architecture tricks
Gemma 3 (Google)
Gemma 3 4B Parameters: 4 billion Context: 128K tokens RAM (Q4): ~3 GB GSM8K: 89.2% HumanEval: 71.3% ARC-C: ~80% Gemma 3n E4B Effective: 8B (but runs like 4B) RAM: ~3 GB LMArena: >1300 (first sub-10B!) Uses selective parameter activation — only activates needed params
Phi-4-mini (Microsoft)
Phi-4-mini Parameters: 3.8 billion Context: 128K tokens RAM (Q4): ~3 GB MMLU-Pro: 52.8 GSM8K: 88.6% ARC-C: 83.7% License: MIT (most permissive!) Microsoft's secret: trained on high-quality synthetic data generated by GPT-4. Quality of training data matters more than quantity.
Key insight: Gemma 3 and Phi-4 prove that architecture innovation and data quality can compensate for fewer parameters. Gemma 3 4B scores 89.2% on GSM8K math — better than many 13B models from 2023. Phi-4-mini’s MIT license makes it the most commercially friendly option.
star
Qwen 3.5 & Mistral Small: The Power Players
The current leaderboard champion and the best “medium” model
Qwen 3.5 (Alibaba)
Qwen 3.5 4B MMLU-Pro: 79.1 RAM (Q4): ~3 GB Qwen 3.5 9BLeaderboard #1 MMLU-Pro: 82.5 GPQA Diamond:81.7 RAM (Q4): ~6 GB RAM (Q5): ~7 GB Qwen 3.5 9B beats models 3x its size on MMLU-Pro. Best quality-per-parameter ratio available today.
Mistral Small 3.1
Mistral Small 3.1 Parameters: 24 billion Context: 128K tokens RAM (Q4): ~14 GB License: Apache 2.0 Not technically "small" but fits on a 16GB GPU (RTX 4090, M2 Pro). Best option when you need more capability than a 9B model but can't run a 70B.
The Size Tiers
Tiny (1–3B): Mobile, IoT, edge. Llama 3.2 1B/3B.
Small (4–9B): Laptop, desktop. Gemma 3 4B, Qwen 3.5 9B, Phi-4-mini.
Medium (14–24B): Workstation, server. Qwen 3.5 14B, Mistral Small 3.1.
Key insight: Qwen 3.5 9B is the current sweet spot for local deployment — frontier-class reasoning in 6GB of RAM. If you need more, Mistral Small 3.1 at 24B fits on a 16GB GPU and competes with much larger models. The Apache 2.0 license on both means zero commercial restrictions.
leaderboard
Benchmarks That Actually Matter
What MMLU, HumanEval, and GSM8K measure — and what they don’t
The Key Benchmarks
MMLU-Pro (knowledge + reasoning) 14,000 questions across 14 domains Multiple choice, harder than MMLU Tests: broad knowledge, reasoning HumanEval (code generation) 164 Python programming problems Tests: function implementation from docstring. Pass@1 metric. GSM8K (math reasoning) 8,500 grade-school math problems Tests: multi-step arithmetic reasoning ARC-C (common-sense reasoning) Challenge set of science questions Tests: reasoning about everyday concepts GPQA Diamond (expert reasoning) Graduate-level science questions Tests: deep domain expertise
What Benchmarks Don’t Tell You
1. Benchmark contamination: Some models may have seen benchmark questions during training. Scores can be inflated.

2. Real-world performance differs: A model scoring 90% on GSM8K might struggle with your specific math format. Benchmarks test general capability, not your use case.

3. Instruction following: Most benchmarks don’t test whether the model follows complex instructions, maintains format, or handles edge cases.

4. Speed isn’t measured: A model scoring 85% at 100 tokens/sec might be more useful than one scoring 90% at 20 tokens/sec.
Key insight: Benchmarks are useful for shortlisting models, not for final selection. Always test your top 2–3 candidates on YOUR actual data and tasks. A model that scores lower on MMLU but handles your specific extraction format perfectly is the better choice for you.
analytics
The SHIFT Framework
A 5-axis evaluation designed specifically for edge and local deployment
SHIFT (Smol AI WorldCup 2026)
S — Size Parameter count and memory footprint. Smaller = deployable on more devices. H — Honesty Hallucination resistance, calibration. Does the model know what it doesn't know? I — Intelligence Reasoning across 7 languages. Not just English — multilingual matters. F — Fast Inference throughput (tokens/sec). How quickly does it generate output? T — Thrift Resource consumption (RAM, CPU, power). Can it run on a phone without draining the battery in 10 minutes?
Why SHIFT Matters
Traditional benchmarks (MMLU, HumanEval) only measure Intelligence. But for local deployment, you also care about:

Will it fit? (Size, Thrift)
Is it fast enough? (Fast)
Can I trust it? (Honesty)

A model that scores 90% on MMLU but needs 32GB RAM and hallucinates frequently is worse for local deployment than one scoring 80% that fits in 4GB and rarely hallucinates.
Key Finding
The SHIFT framework revealed that smaller models can achieve 95% of larger models’ quality at 36% of RAM requirements. The efficiency frontier is much closer than raw benchmark scores suggest.
Key insight: When evaluating models for local deployment, don’t just look at accuracy scores. Use a multi-axis framework: does it fit on my hardware? Is it fast enough for my use case? Does it hallucinate on my domain? The best model is the one that balances all five axes for YOUR constraints.
route
Matching Models to Tasks
A practical guide: which model for which job?
Task → Model Map
Classification / Routing → Llama 3.2 1B or Phi-4-mini Why: Fast, tiny, accuracy is high for binary/multi-class tasks Data Extraction (JSON) → Qwen 3.5 4B or Gemma 3 4B Why: Good instruction following, structured output support Summarization → Qwen 3.5 9B or Mistral Small Why: Needs more language understanding than classification Code Completion → Qwen 3.5 9B or Phi-4-mini Why: Strong code benchmarks, fast enough for IDE integration Conversational Chat → Mistral Small 3.1 or Qwen 3.5 9B Why: Needs nuance, context tracking, personality consistency Mobile / On-Device → Llama 3.2 1B or 3B Why: Designed for edge, ExecuTorch support, minimal RAM
The Selection Process
1. Define the task: What exactly does the model need to do? Be specific.

2. Set hardware constraints: How much RAM? GPU or CPU only? Mobile or server?

3. Shortlist 2–3 models: Use the task map above and benchmark scores.

4. Test on your data: Run 50–100 examples from your actual use case through each model.

5. Measure what matters: Accuracy on YOUR task, speed, RAM usage. Not benchmark scores.
Key insight: Don’t pick the “best” model — pick the best model for your task and constraints. A Llama 3.2 1B doing classification at 200 tokens/sec on a phone is better than a Qwen 3.5 9B doing the same task at 40 tokens/sec on a server, if classification is all you need.
checklist
The Small Model Cheat Sheet
Quick reference for the rest of this course
By Hardware
Phone (2-4GB RAM) Llama 3.2 1B (Q4) → 1.5 GB Llama 3.2 3B (Q4) → 2.5 GB Laptop (8-16GB RAM) Gemma 3 4B (Q4) → 3 GB Phi-4-mini (Q4) → 3 GB Qwen 3.5 9B (Q4) → 6 GB Desktop / GPU (16-24GB) Qwen 3.5 9B (Q5) → 7 GB Qwen 3.5 14B (Q4) → 9 GB Mistral Small (Q4) → 14 GB Server (48GB+) Llama 3.1 70B (Q4) → 40 GB At this point, consider cloud
By License
Most permissive (MIT): Phi-4-mini — do anything Very permissive (Apache 2.0): Qwen 3.5, Mistral Small Permissive with limits: Llama 3.2 — free under 700M MAU Gemma 3 — Gemma Terms of Use
Coming Up Next
Now that you know the players, Chapter 3 dives into quantization — the technique that makes these models fit on consumer hardware. You’ll learn how FP32 becomes INT4, what GGUF files contain, and how to choose the right quantization level.
Key insight: The small model landscape changes fast — new models drop monthly. But the evaluation framework stays the same: define your task, set your constraints, shortlist by benchmarks, test on your data. The specific model names will change; the selection process won’t.