Ch 3: Zero-Shot vs Few-Shot — Prompt Engineering Mastery

Ch 3 — Zero-Shot vs Few-Shot

When to give examples, how many, and how to pick good ones — the most powerful formatting tool you have

arrow_backIndex

Foundations

psychology

Why Examples

arrow_forward

warning

Zero-Shot Fail

arrow_forward

looks_one

One-Shot

arrow_forward

looks_3

Three-Shot

arrow_forward

Real Domain

arrow_forward

tune

How Many?

arrow_forward

rule

Bad Examples

arrow_forward

check_circle

Checklist

Click play or press Space to begin...

Step- / 8

psychology

Why Examples Work: In-Context Learning

The model doesn’t learn from your examples — it pattern-matches them

The Mechanism

When you give an LLM examples in your prompt, you’re not fine-tuning it. The model weights don’t change. Instead, you’re exploiting something called in-context learning: the model sees the pattern in your examples and continues it. Think of it like this — if you show someone three sentences where every answer is in the format “CATEGORY: reason”, the fourth answer will follow the same format. The model is a pattern-completion engine (recall Ch 1), and examples are the most direct way to set the pattern.

The Analogy

Imagine you’re training a new employee to categorize support tickets. You could describe the rules in a long document (“If the customer mentions billing, tag it as BILLING. If they mention a bug, tag it as TECHNICAL...”). Or you could show them 3 completed tickets and say “do it like these.” The second approach is faster, clearer, and less ambiguous. That’s exactly what few-shot prompting does.

Zero-Shot vs Few-Shot — Defined

Zero-shot: You give the model only instructions, no examples. Works great for simple, well-defined tasks.

One-shot: You provide 1 example. Often enough to lock in format.

Few-shot: You provide 2–5 examples. Needed for complex classification, extraction, or when format consistency matters.

Key insight: Few-shot examples don’t teach the model new knowledge — it already knows how to classify sentiment or extract data. What examples do is eliminate ambiguity about format and decision boundaries. You’re not teaching; you’re calibrating.

warning

The Zero-Shot Problem: Inconsistent Output

A real scenario — classifying customer support emails for a SaaS dashboard

The Scenario

You’re building an internal tool that auto-classifies incoming customer support emails into categories: BILLING, TECHNICAL, FEATURE_REQUEST, or ACCOUNT. Your backend calls the LLM API and parses the response. If the format is inconsistent, your parser breaks.

Zero-Shot Prompt

Classify this customer email into one of these categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Email: "Hi, I was charged twice for my Pro plan this month. My card shows two $49 charges on March 3rd. Can you refund the duplicate? My account email is sarah@example.com"

What You Get (Run 1)

This email is about a billing issue. The customer was charged twice and is requesting a refund. I would classify this as BILLING.

What You Get (Run 2)

Category: Billing Reason: Duplicate charge on credit card

What You Get (Run 3)

BILLING

The problem: The model gets the answer right every time (it’s billing), but the format is different every time. Run 1 is a paragraph, Run 2 is “Category: Billing” (wrong case), Run 3 is what you actually want. Your JSON.parse() or regex will fail 2 out of 3 times.

looks_one

One-Shot: A Single Example Changes Everything

One example locks in the format — the model mirrors what it sees

One-Shot Prompt

Classify customer emails into exactly one category. Reply with ONLY the category name. Categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Example: Email: "I can't log in to my account. I've tried resetting my password three times but the reset email never arrives." Category: ACCOUNT Now classify: Email: "Hi, I was charged twice for my Pro plan this month. My card shows two $49 charges on March 3rd. Can you refund the duplicate? My account email is sarah@example.com" Category:

Model Output (Consistent Across Runs)

BILLING

Why This Works

The example establishes three things simultaneously:

1. Format: Just the category name, nothing else
2. Case: ALL_CAPS, matching the category list
3. Structure: “Email: ... Category: ...” pattern

The model sees the pattern and continues it. You didn’t need to say “respond with only one word” or “use uppercase” — the example showed all of that implicitly.

Key insight: One example is often enough for format locking. The model is remarkably good at inferring “oh, I should respond like that” from a single demonstration. Save multi-shot for when the decision logic is ambiguous, not just the format.

looks_3

Three-Shot: Handling Ambiguous Boundaries

When emails could belong to multiple categories, examples define the decision rules

The Ambiguity Problem

Consider this email: “Your API keeps returning 500 errors when I try to upgrade my plan.” Is that TECHNICAL (API error) or BILLING (plan upgrade)? Without examples showing how you want edge cases handled, the model guesses — and different runs give different answers.

Three-Shot Prompt (Edge Cases)

Classify customer emails. Reply with ONLY the category name. When an email touches multiple areas, classify by the ROOT CAUSE. Categories: BILLING, TECHNICAL, FEATURE_REQUEST, ACCOUNT Example 1: Email: "I can't log in. Password reset emails never arrive." Category: ACCOUNT Example 2: Email: "The export feature crashes when I select more than 1000 rows." Category: TECHNICAL Example 3: Email: "I'm getting timeout errors when trying to change my subscription from monthly to annual." Category: TECHNICAL Now classify: Email: "Your API keeps returning 500 errors when I try to upgrade my plan." Category:

Model Output

TECHNICAL

What the Examples Taught

Example 3 is the critical one. It shows a billing-adjacent situation (subscription change) that’s classified as TECHNICAL because the root cause is a timeout error. The model now has a precedent: when a technical issue blocks a billing action, classify as TECHNICAL. Without Example 3, the model would flip between BILLING and TECHNICAL randomly.

Without Edge-Case Example

Run 1: BILLING
Run 2: TECHNICAL
Run 3: BILLING

Inconsistent — the model has no precedent for this boundary.

With Edge-Case Example

Run 1: TECHNICAL
Run 2: TECHNICAL
Run 3: TECHNICAL

Consistent — Example 3 set the decision rule.

Domain Example: Extracting Data from Support Emails

Zero-shot misses fields and invents formats — two examples fix everything

The Task

Your support team receives hundreds of emails daily. You need to extract structured data: customer name, product, issue type, and urgency (low/medium/high). This feeds into a Jira ticket creation pipeline.

Zero-Shot Attempt

Prompt: “Extract the customer name, product, issue, and urgency from this email: [email text]”

Output:
The customer is John Martinez. He’s having trouble with the Analytics Dashboard — specifically, the real-time charts aren’t updating. This seems moderately urgent since it’s affecting their daily standups.

Problem: Prose format. No structured fields. “moderately urgent” instead of “medium”. Can’t parse this.

Two-Shot Prompt

Extract structured data from support emails. Follow the exact format shown. Example 1: Email: "Hi, this is Lisa Chen. Our team relies on the Reporting Module for weekly client updates, but PDF exports have been broken since Tuesday. This is blocking our Friday deliverable." --- name: Lisa Chen product: Reporting Module issue: PDF export broken since Tuesday urgency: high Example 2: Email: "Hey, Mike here from Acme Corp. Just noticed the dark mode toggle in Settings doesn't save the preference. Minor thing but it resets every time I log in." --- name: Mike (Acme Corp) product: Settings issue: Dark mode preference not persisting urgency: low Now extract: Email: "This is John Martinez. The real-time charts on our Analytics Dashboard stopped updating about 2 hours ago. Our team uses this for daily standups at 9am — we need this fixed before tomorrow morning." ---

Model Output

name: John Martinez product: Analytics Dashboard issue: Real-time charts stopped updating urgency: high

Key insight: The examples taught the model five things without a single explicit instruction: (1) use the “---” separator, (2) use lowercase field names, (3) keep issue descriptions concise, (4) map urgency to exactly low/medium/high, (5) include company name in parentheses when available. Show, don’t tell.

tune

How Many Examples Do You Actually Need?

Research says 2–5 is the sweet spot — here’s the decision framework

The Diminishing Returns Curve

Research on few-shot prompting consistently shows:

0 → 1 example: Biggest jump. Format consistency improves dramatically.
1 → 2 examples: Significant gain. Decision boundaries become clearer.
2 → 3 examples: Moderate gain. Edge cases get covered.
3 → 5 examples: Diminishing returns. Only helps for very complex classification.
5+ examples: Rarely worth it. Token cost increases linearly while accuracy plateaus.

The Decision Framework

# How many examples do I need? Simple format locking (JSON, CSV, etc.) → 1 example is usually enough Classification with clear categories → 2 examples (one per tricky category) Classification with fuzzy boundaries → 3 examples (include edge cases) Complex extraction or transformation → 3-5 examples (show variety) Style/tone matching → 2-3 examples of the target style

Token Cost Reality Check

Each example adds ~50–200 tokens to your prompt. At GPT-4 pricing (~$10/M input tokens), 3 examples on every API call across 10,000 daily requests adds roughly $1.50–$6/day. That’s almost nothing compared to the cost of inconsistent outputs breaking your pipeline.

However, if you’re hitting context window limits (e.g., processing long documents), every token counts. In those cases, invest in 1 perfect example rather than 3 mediocre ones.

Quality Over Quantity

One well-chosen example that covers an edge case is worth more than five examples of obvious cases. If all your examples are straightforward (“I want a refund” → BILLING), the model learns nothing it didn’t already know. Pick examples that show the hard decisions.

Key insight: The question isn’t “how many examples?” — it’s “what do my examples teach that my instructions alone can’t?” If the answer is “nothing,” you don’t need examples. If the answer is “format + edge cases,” you need 2–3.

rule

Bad Examples: What Poisons Your Few-Shot Prompt

Common mistakes that make few-shot worse than zero-shot

Mistake 1: Inconsistent Format Across Examples

❌ BAD — format varies between examples Example 1: Email: "Can't log in" → Category: ACCOUNT, Urgency: high Example 2: Email: "Dark mode broken" → TECHNICAL (low urgency) Example 3: Email: "Refund request" → This is a billing issue. BILLING. Medium.

The model sees three different output formats and has no idea which one to follow. It might pick any of them, or worse, invent a fourth format that blends all three. Every example must use the exact same structure.

Mistake 2: All Easy Cases, No Edge Cases

If every example is obvious (“I want a refund” → BILLING, “app crashes” → TECHNICAL), you’re wasting tokens. The model already knows these. Include at least one example where the answer isn’t obvious — that’s where few-shot actually adds value.

Mistake 3: Examples That Contradict Your Instructions

❌ BAD — instruction says "one word" but example gives a sentence Reply with ONLY the category name. Example: Email: "Can't export data" Category: This is a TECHNICAL issue because the export feature is broken.

When instructions and examples conflict, the model follows the examples. Examples are stronger than instructions because they’re concrete patterns, not abstract rules. Always make sure your examples perfectly demonstrate what your instructions describe.

Key insight: If your few-shot prompt gives worse results than zero-shot, the problem is almost always one of these three: inconsistent format, trivial examples, or examples that contradict your instructions. Fix the examples before adding more.

check_circle

The Few-Shot Checklist

A practical decision tree for every prompt you write

When to Use Zero-Shot

Use zero-shot when:
• The task is simple and well-defined (summarize, translate, explain)
• Format doesn’t matter (human will read the output)
• You’re exploring / prototyping and don’t need consistency yet
• Context window is tight and you can’t afford extra tokens

When to Use Few-Shot

Use few-shot when:
• Output feeds into code (API responses, parsers, pipelines)
• Classification has ambiguous boundaries
• You need a specific tone, style, or format
• Zero-shot gives correct answers but inconsistent format
• The task involves domain-specific conventions the model might not default to

The Example Selection Checklist

# Before adding examples, verify: □ Format consistency Every example uses identical structure □ Diversity Examples cover different categories/cases □ Edge cases included At least 1 example shows a hard decision □ No contradictions Examples match your written instructions □ Minimal but sufficient Start with 2, add more only if needed □ Representative Examples look like real inputs, not toy data

Key insight: Few-shot prompting is the bridge between “the model understands my task” and “the model does my task exactly how I need it done.” Instructions tell the model what to do. Examples show it how you want it done. Master this distinction and you’ll never fight with output format again.

arrow_back Ch 2: Anatomy of a Prompt Ch 4: Chain-of-Thought arrow_forward