Ch 3 — Zero-Shot vs Few-Shot

When to give examples, how many, and how to pick good ones — the most powerful formatting tool you have
Foundations
psychology
Why Examples
arrow_forward
warning
Zero-Shot Fail
arrow_forward
looks_one
One-Shot
arrow_forward
looks_3
Three-Shot
arrow_forward
email
Real Domain
arrow_forward
tune
How Many?
arrow_forward
rule
Bad Examples
arrow_forward
check_circle
Checklist
-
Click play or press Space to begin...
Step- / 8
psychology
Why Examples Work: In-Context Learning
The model doesn’t learn from your examples — it pattern-matches them
The Mechanism
When you give an LLM examples in your prompt, you’re not fine-tuning it. The model weights don’t change. Instead, you’re exploiting something called in-context learning: the model sees the pattern in your examples and continues it. Think of it like this — if you show someone three sentences where every answer is in the format “CATEGORY: reason”, the fourth answer will follow the same format. The model is a pattern-completion engine (recall Ch 1), and examples are the most direct way to set the pattern.
The Analogy
Imagine you’re training a new employee to categorize support tickets. You could describe the rules in a long document (“If the customer mentions billing, tag it as BILLING. If they mention a bug, tag it as TECHNICAL...”). Or you could show them 3 completed tickets and say “do it like these.” The second approach is faster, clearer, and less ambiguous. That’s exactly what few-shot prompting does.
Zero-Shot vs Few-Shot — Defined
Zero-shot: You give the model only instructions, no examples. Works great for simple, well-defined tasks.

One-shot: You provide 1 example. Often enough to lock in format.

Few-shot: You provide 2–5 examples. Needed for complex classification, extraction, or when format consistency matters.
Key insight: Few-shot examples don’t teach the model new knowledge — it already knows how to classify sentiment or extract data. What examples do is eliminate ambiguity about format and decision boundaries. You’re not teaching; you’re calibrating.
warning
The Zero-Shot Problem: Inconsistent Output
A real scenario — classifying customer support emails for a SaaS dashboard
The Scenario
You’re building an internal tool that auto-classifies incoming customer support emails into categories: BILLING, TECHNICAL, FEATURE_REQUEST, or ACCOUNT. Your backend calls the LLM API and parses the response. If the format is inconsistent, your parser breaks.
Zero-Shot Prompt
Classify this customer email into one of these categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Email: "Hi, I was charged twice for my Pro plan this month. My card shows two $49 charges on March 3rd. Can you refund the duplicate? My account email is sarah@example.com"
What You Get (Run 1)
This email is about a billing issue. The customer was charged twice and is requesting a refund. I would classify this as BILLING.
What You Get (Run 2)
Category: Billing Reason: Duplicate charge on credit card
What You Get (Run 3)
BILLING
The problem: The model gets the answer right every time (it’s billing), but the format is different every time. Run 1 is a paragraph, Run 2 is “Category: Billing” (wrong case), Run 3 is what you actually want. Your JSON.parse() or regex will fail 2 out of 3 times.
looks_one
One-Shot: A Single Example Changes Everything
One example locks in the format — the model mirrors what it sees
One-Shot Prompt
Classify customer emails into exactly one category. Reply with ONLY the category name. Categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Example: Email: "I can't log in to my account. I've tried resetting my password three times but the reset email never arrives." Category: ACCOUNT Now classify: Email: "Hi, I was charged twice for my Pro plan this month. My card shows two $49 charges on March 3rd. Can you refund the duplicate? My account email is sarah@example.com" Category:
Model Output (Consistent Across Runs)
BILLING
Why This Works
The example establishes three things simultaneously:

1. Format: Just the category name, nothing else
2. Case: ALL_CAPS, matching the category list
3. Structure: “Email: ... Category: ...” pattern

The model sees the pattern and continues it. You didn’t need to say “respond with only one word” or “use uppercase” — the example showed all of that implicitly.
Key insight: One example is often enough for format locking. The model is remarkably good at inferring “oh, I should respond like that” from a single demonstration. Save multi-shot for when the decision logic is ambiguous, not just the format.
looks_3
Three-Shot: Handling Ambiguous Boundaries
When emails could belong to multiple categories, examples define the decision rules
The Ambiguity Problem
Consider this email: “Your API keeps returning 500 errors when I try to upgrade my plan.” Is that TECHNICAL (API error) or BILLING (plan upgrade)? Without examples showing how you want edge cases handled, the model guesses — and different runs give different answers.
Three-Shot Prompt (Edge Cases)
Classify customer emails. Reply with ONLY the category name. When an email touches multiple areas, classify by the ROOT CAUSE. Categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Example 1: Email: "I can't log in. Password reset emails never arrive." Category: ACCOUNT Example 2: Email: "The export feature crashes when I select more than 1000 rows." Category: TECHNICAL Example 3: Email: "I'm getting timeout errors when trying to change my subscription from monthly to annual." Category: TECHNICAL Now classify: Email: "Your API keeps returning 500 errors when I try to upgrade my plan." Category:
Model Output
TECHNICAL
What the Examples Taught
Example 3 is the critical one. It shows a billing-adjacent situation (subscription change) that’s classified as TECHNICAL because the root cause is a timeout error. The model now has a precedent: when a technical issue blocks a billing action, classify as TECHNICAL. Without Example 3, the model would flip between BILLING and TECHNICAL randomly.
Without Edge-Case Example
Run 1: BILLING
Run 2: TECHNICAL
Run 3: BILLING

Inconsistent — the model has no precedent for this boundary.
With Edge-Case Example
Run 1: TECHNICAL
Run 2: TECHNICAL
Run 3: TECHNICAL

Consistent — Example 3 set the decision rule.
email
Domain Example: Extracting Data from Support Emails
Zero-shot misses fields and invents formats — two examples fix everything
The Task
Your support team receives hundreds of emails daily. You need to extract structured data: customer name, product, issue type, and urgency (low/medium/high). This feeds into a Jira ticket creation pipeline.
Zero-Shot Attempt
Prompt: “Extract the customer name, product, issue, and urgency from this email: [email text]”

Output:
The customer is John Martinez. He’s having trouble with the Analytics Dashboard — specifically, the real-time charts aren’t updating. This seems moderately urgent since it’s affecting their daily standups.

Problem: Prose format. No structured fields. “moderately urgent” instead of “medium”. Can’t parse this.
Two-Shot Prompt
Extract structured data from support emails. Follow the exact format shown. Example 1: Email: "Hi, this is Lisa Chen. Our team relies on the Reporting Module for weekly client updates, but PDF exports have been broken since Tuesday. This is blocking our Friday deliverable." --- name: Lisa Chen product: Reporting Module issue: PDF export broken since Tuesday urgency: high Example 2: Email: "Hey, Mike here from Acme Corp. Just noticed the dark mode toggle in Settings doesn't save the preference. Minor thing but it resets every time I log in." --- name: Mike (Acme Corp) product: Settings issue: Dark mode preference not persisting urgency: low Now extract: Email: "This is John Martinez. The real-time charts on our Analytics Dashboard stopped updating about 2 hours ago. Our team uses this for daily standups at 9am — we need this fixed before tomorrow morning." ---
Model Output
name: John Martinez product: Analytics Dashboard issue: Real-time charts stopped updating urgency: high
Key insight: The examples taught the model five things without a single explicit instruction: (1) use the “---” separator, (2) use lowercase field names, (3) keep issue descriptions concise, (4) map urgency to exactly low/medium/high, (5) include company name in parentheses when available. Show, don’t tell.
tune
How Many Examples Do You Actually Need?
Research says 2–5 is the sweet spot — here’s the decision framework
The Diminishing Returns Curve
Research on few-shot prompting consistently shows:

0 → 1 example: Biggest jump. Format consistency improves dramatically.
1 → 2 examples: Significant gain. Decision boundaries become clearer.
2 → 3 examples: Moderate gain. Edge cases get covered.
3 → 5 examples: Diminishing returns. Only helps for very complex classification.
5+ examples: Rarely worth it. Token cost increases linearly while accuracy plateaus.
The Decision Framework
# How many examples do I need? Simple format locking (JSON, CSV, etc.) → 1 example is usually enough Classification with clear categories2 examples (one per tricky category) Classification with fuzzy boundaries3 examples (include edge cases) Complex extraction or transformation3-5 examples (show variety) Style/tone matching2-3 examples of the target style
Token Cost Reality Check
Each example adds ~50–200 tokens to your prompt. At GPT-4 pricing (~$10/M input tokens), 3 examples on every API call across 10,000 daily requests adds roughly $1.50–$6/day. That’s almost nothing compared to the cost of inconsistent outputs breaking your pipeline.

However, if you’re hitting context window limits (e.g., processing long documents), every token counts. In those cases, invest in 1 perfect example rather than 3 mediocre ones.
Quality Over Quantity
One well-chosen example that covers an edge case is worth more than five examples of obvious cases. If all your examples are straightforward (“I want a refund” → BILLING), the model learns nothing it didn’t already know. Pick examples that show the hard decisions.
Key insight: The question isn’t “how many examples?” — it’s “what do my examples teach that my instructions alone can’t?” If the answer is “nothing,” you don’t need examples. If the answer is “format + edge cases,” you need 2–3.
rule
Bad Examples: What Poisons Your Few-Shot Prompt
Common mistakes that make few-shot worse than zero-shot
Mistake 1: Inconsistent Format Across Examples
❌ BAD — format varies between examples Example 1: Email: "Can't log in" → Category: ACCOUNT, Urgency: high Example 2: Email: "Dark mode broken" → TECHNICAL (low urgency) Example 3: Email: "Refund request" → This is a billing issue. BILLING. Medium.
The model sees three different output formats and has no idea which one to follow. It might pick any of them, or worse, invent a fourth format that blends all three. Every example must use the exact same structure.
Mistake 2: All Easy Cases, No Edge Cases
If every example is obvious (“I want a refund” → BILLING, “app crashes” → TECHNICAL), you’re wasting tokens. The model already knows these. Include at least one example where the answer isn’t obvious — that’s where few-shot actually adds value.
Mistake 3: Examples That Contradict Your Instructions
❌ BAD — instruction says "one word" but example gives a sentence Reply with ONLY the category name. Example: Email: "Can't export data" Category: This is a TECHNICAL issue because the export feature is broken.
When instructions and examples conflict, the model follows the examples. Examples are stronger than instructions because they’re concrete patterns, not abstract rules. Always make sure your examples perfectly demonstrate what your instructions describe.
Key insight: If your few-shot prompt gives worse results than zero-shot, the problem is almost always one of these three: inconsistent format, trivial examples, or examples that contradict your instructions. Fix the examples before adding more.
check_circle
The Few-Shot Checklist
A practical decision tree for every prompt you write
When to Use Zero-Shot
Use zero-shot when:
• The task is simple and well-defined (summarize, translate, explain)
• Format doesn’t matter (human will read the output)
• You’re exploring / prototyping and don’t need consistency yet
• Context window is tight and you can’t afford extra tokens
When to Use Few-Shot
Use few-shot when:
• Output feeds into code (API responses, parsers, pipelines)
• Classification has ambiguous boundaries
• You need a specific tone, style, or format
• Zero-shot gives correct answers but inconsistent format
• The task involves domain-specific conventions the model might not default to
The Example Selection Checklist
# Before adding examples, verify: □ Format consistency Every example uses identical structure □ Diversity Examples cover different categories/cases □ Edge cases included At least 1 example shows a hard decision □ No contradictions Examples match your written instructions □ Minimal but sufficient Start with 2, add more only if needed □ Representative Examples look like real inputs, not toy data
Key insight: Few-shot prompting is the bridge between “the model understands my task” and “the model does my task exactly how I need it done.” Instructions tell the model what to do. Examples show it how you want it done. Master this distinction and you’ll never fight with output format again.