Ch 14: The Prompt Engineer's Toolkit — Prompt Engineering Mastery

Ch 14 — The Prompt Engineer’s Toolkit

Decision trees, prompt chaining, DSPy, and the complete mental model for any prompting challenge

arrow_backIndex

Capstone

account_tree

Decision Tree

arrow_forward

link

Chaining

arrow_forward

precision_manufacturing

DSPy

arrow_forward

engineering

Case Study

arrow_forward

auto_awesome

Model Selection

arrow_forward

update

The Future

arrow_forward

school

Course Map

arrow_forward

rocket_launch

What’s Next

Click play or press Space to begin...

Step- / 8

account_tree

The Prompt Engineering Decision Tree

When you face a new prompting challenge, start here

The Decision Tree

Q1: What type of task? Classification / Extraction → Few-shot with examples (Ch 3) → Structured output (Ch 7) → Temperature = 0 Reasoning / Analysis → Chain-of-thought (Ch 4) → Decomposition for complex tasks (Ch 5) → Temperature = 0 Generation / Creative → System prompt with persona (Ch 6) → Critic pattern for quality (Ch 8) → Temperature = 0.7-1.0 Conversation / Agent → System prompt + tools (Ch 6, 12) → Multi-turn management (Ch 11) → Temperature = 0.3-0.7

Q2: What Quality Level?

Quick & cheap (internal tools, prototypes) → Zero-shot or simple few-shot → Smaller model (GPT-4o-mini, Haiku) → Minimal testing Reliable (customer-facing, moderate risk) → Few-shot + format constraints → Mid-tier model (GPT-4o, Sonnet) → Test suite of 20-30 cases Production-critical (financial, medical, legal) → Full prompt engineering (CoT, patterns) → Best model available → LLM-as-judge + human review → Comprehensive test suite

Key insight: Not every task needs the full toolkit. A simple classification might need just a few-shot prompt with temperature 0. A production chatbot needs system prompts, tools, multi-turn management, evaluation, and monitoring. Match the investment to the stakes.

link

Prompt Chaining: When One Prompt Isn’t Enough

Break complex tasks into a pipeline of focused prompts, each one’s output feeding the next

Why Chain?

Some tasks are too complex for a single prompt. Signs you need chaining:

• The prompt is over 500 words and still not specific enough
• The task has distinct phases (analyze → decide → generate)
• You need different models or temperatures for different steps
• Quality degrades when you add more instructions to one prompt

Chain Architecture

Example: Automated Code Review Pipeline Prompt 1 — Analyze (GPT-4o, temp=0) "List all issues in this code: bugs, security, performance, style. Output as JSON array." ↓ issues_json Prompt 2 — Prioritize (GPT-4o, temp=0) "Given these issues, rank by severity. Group into: must-fix, should-fix, nice-to-fix. Output as JSON." ↓ prioritized_json Prompt 3 — Generate Fixes (GPT-4o, temp=0.2) "For each must-fix issue, generate the corrected code. Show the diff." ↓ fixes Prompt 4 — Write Review (GPT-4o, temp=0.5) "Write a code review comment for the PR author. Tone: constructive, specific. Include the fixes as suggestions."

Benefits of Chaining

1. Each prompt is focused: One task per prompt = higher quality per step

2. Different settings per step: Analysis at temp=0, writing at temp=0.5

3. Intermediate validation: Check the output of each step before proceeding. If step 1 finds no issues, skip steps 2–4.

4. Debuggable: When the final output is wrong, you can inspect each intermediate result to find where it went wrong.

5. Reusable: The “Prioritize” prompt works for any list of issues, not just code review.

Key insight: Prompt chaining is the prompt engineering equivalent of the Unix philosophy: do one thing well, then pipe the output to the next tool. Each prompt in the chain is simpler, more testable, and more reliable than a single mega-prompt.

precision_manufacturing

DSPy: Programmatic Prompt Optimization

Let the framework find the best prompt automatically — prompts as code, not strings

The Problem DSPy Solves

Manual prompt engineering is:

• Brittle: Prompts break when you switch models
• Tedious: Manually testing wording variations
• Unscalable: Can’t optimize 50 prompts by hand

DSPy’s approach: Define what you want (input/output types, evaluation metric), and let the framework optimize the prompt automatically through compilation.

DSPy in 30 Seconds

import dspy # 1. Define the signature class ClassifyTicket(dspy.Signature): """Classify a support ticket.""" ticket = dspy.InputField() category = dspy.OutputField( desc="BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST, or NONE") # 2. Create a module classify = dspy.Predict(ClassifyTicket) # 3. Compile with examples optimizer = dspy.BootstrapFewShot( metric=exact_match) compiled = optimizer.compile( classify, trainset=labeled_tickets) # 4. Use it — DSPy found the best # prompt + examples automatically result = compiled( ticket="I was charged twice") print(result.category) # BILLING

When to Use DSPy vs Manual

Use manual prompt engineering when:
• You have 1–5 prompts
• The task is well-understood
• You need full control over the prompt text
• You’re prototyping

Use DSPy when:
• You have 10+ prompts to optimize
• You have labeled evaluation data
• You need to switch between models
• You want reproducible optimization
• You’re building a production pipeline

Other Tools in the Space

LangChain / LangGraph: Orchestration framework for chains and agents. Good for building complex pipelines.

LlamaIndex: Specialized for RAG pipelines. Best when your primary task is querying documents.

Instructor: Structured output extraction with Pydantic validation. Lightweight, focused.

Guidance: Template-based prompt construction with constrained generation.

Key insight: DSPy represents the future of prompt engineering: defining what you want and letting the system figure out how to prompt for it. But you still need to understand the fundamentals (this course) to define good signatures, choose the right modules, and debug when things go wrong.

engineering

Case Study: End-to-End Code Review Pipeline

Applying every technique from this course to a real-world task

The Pipeline

Input: A GitHub PR diff Step 1: Triage (Ch 3 — Classification) Few-shot classify: does this PR need a detailed review? (Trivial changes like typo fixes → auto-approve) Step 2: Analyze (Ch 4 — CoT) "Think step by step: identify bugs, security issues, performance problems, and style violations." Step 3: Prioritize (Ch 5 — Decomposition) Decompose issues by severity. Filter out noise. Keep must-fix and should-fix. Step 4: Generate Review (Ch 6 — Persona) System prompt: "You are a senior engineer. Tone: constructive, specific. For each issue, explain WHY it's a problem and provide the fix." Step 5: Format (Ch 7 — Structured Output) Output as GitHub review comments JSON, mapped to specific lines in the diff. Step 6: Safety Check (Ch 13 — Evaluation) LLM-as-judge: verify no hallucinated line numbers, no incorrect fixes.

Techniques Used

Ch 2 (Anatomy): Each prompt has clear role, context, task, format, constraints
Ch 3 (Few-shot): Triage step uses 3 examples of trivial vs non-trivial PRs
Ch 4 (CoT): Analysis step uses step-by-step reasoning
Ch 5 (Decomposition): Prioritization breaks down by category
Ch 6 (System prompt): Review generation uses a senior engineer persona
Ch 7 (Structured output): Final output is JSON for GitHub API
Ch 10 (RAG): Injects the team’s style guide as context
Ch 13 (Evaluation): LLM-as-judge validates the output

Key insight: Real-world prompt engineering is rarely a single prompt. It’s a pipeline where each step uses the right technique for that specific sub-task. The art is knowing which technique to apply where — and that’s what this course has been building toward.

auto_awesome

Model Selection: Matching the Model to the Task

Not every task needs the most expensive model — and sometimes the most expensive isn’t the best

The Model Tiers (as of 2025)

Tier 1: Frontier ($10-30/M tokens) GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro Best for: complex reasoning, code gen, nuanced analysis, agentic workflows Tier 2: Mid-range ($0.50-3/M tokens) GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash Best for: classification, extraction, simple generation, high-volume tasks Tier 3: Open-source (self-hosted cost) Llama 3.1 70B, Mistral Large, Qwen 2.5 Best for: on-premise, data privacy, fine-tuning, cost optimization Tier 4: Reasoning ($15-60/M tokens) o1, o3, Claude 3.5 with extended thinking Best for: math, logic, complex multi-step reasoning, scientific analysis

Selection Strategy

Start small, scale up: Try GPT-4o-mini first. If quality is insufficient, move to GPT-4o. Don’t start with the most expensive model — you might not need it.

Different models for different steps: In a chain, use a cheap model for classification (step 1) and a frontier model for generation (step 4).

Test across models: A prompt optimized for GPT-4o might not work well on Claude. Test your prompts on 2–3 models to avoid vendor lock-in.

Cost Optimization

Caching: If the same input appears often, cache the output. Most providers offer prompt caching.

Batching: Process multiple inputs in one API call where possible.

Prompt length: Shorter prompts = lower cost. Remove unnecessary instructions once the model consistently gets it right.

Key insight: Model selection is a prompt engineering decision. A well-crafted prompt on a mid-tier model often outperforms a lazy prompt on a frontier model — and costs 10x less. Invest in prompt quality before upgrading the model.

update

The Future: Will Prompt Engineering Still Matter?

As models get smarter, prompt engineering evolves from “tricks” to “system design”

What’s Changing

Models are getting better at understanding vague prompts. GPT-4 handles ambiguity better than GPT-3.5. Future models will handle it even better. Some “tricks” (like “take a deep breath”) will become unnecessary.

But the fundamentals remain:
• Clear instructions will always outperform vague ones
• Structured output will always need format specifications
• Complex tasks will always benefit from decomposition
• Safety constraints will always need explicit rules
• Evaluation will always need systematic testing

The Evolution

2022: Prompt engineering = "tricks" "Let's think step by step" "You are an expert..." "Take a deep breath" 2024: Prompt engineering = "craft" System prompts, few-shot, CoT, structured output, tool use, evaluation, testing 2026+: Prompt engineering = "system design" Pipeline architecture, model selection, orchestration, monitoring, cost optimization, safety frameworks, programmatic optimization (DSPy)

Key insight: Prompt engineering isn’t going away — it’s growing up. The “tricks” era is ending. The “system design” era is beginning. The engineers who understand both the fundamentals (how models process prompts) and the systems (how to build reliable AI pipelines) will be the most valuable.

school

The Complete Course Map

All 14 chapters connected into a mental model you can use for any prompting challenge

Foundation (Ch 1-2)

Ch 1: How prompts work (tokens, probability, temperature)
Ch 2: Anatomy of a great prompt (role, context, task, format, constraints)

These are the building blocks. Every technique in the course builds on them.

Core Techniques (Ch 3-7)

Ch 3: Zero-shot vs few-shot (when to add examples)
Ch 4: Chain-of-thought (make the model reason)
Ch 5: Advanced reasoning (decomposition, ToT, self-reflection)
Ch 6: System prompts & personas (shape behavior)
Ch 7: Output formatting (get structured data)

Applied Skills (Ch 8-11)

Ch 8: Prompt patterns (Critic, Persona Chain, Flip)
Ch 9: Prompting for code (specs, debugging, refactoring)
Ch 10: RAG & context injection (ground in documents)
Ch 11: Multi-turn conversations (manage context)

Mastery (Ch 12-14)

Ch 12: Tool use & function calling (agent engineering)
Ch 13: Evaluation & debugging (prompts as software)
Ch 14: The toolkit (decision trees, chaining, DSPy, the future)

Key insight: This course is not a list of tricks — it’s a progression. Foundation → Techniques → Application → Mastery. When you face a new challenge, start at the decision tree (this chapter), pick the relevant techniques, and combine them. The whole is greater than the sum of its parts.

rocket_launch

What’s Next: Your Prompt Engineering Journey

You have the toolkit — now go build something

Immediate Next Steps

1. Pick a real project Not a toy example. Something you actually need: a support bot, a content pipeline, a code review tool, a data extraction system. 2. Start with the decision tree What type of task? What quality level? Which techniques apply? 3. Build iteratively Start with a simple prompt. Test. Add techniques one at a time. Test after each addition. Don't over- engineer from the start. 4. Build a test suite early Even 10 test cases will save you hours of debugging later. 5. Share what you learn Prompt engineering is still a young field. Your discoveries help everyone.

The One Rule

Be explicit.

Every technique in this course — few-shot examples, chain-of-thought, system prompts, structured output, tool descriptions, evaluation rubrics — is a form of being explicit.

When a prompt fails, it’s almost always because something was left implicit. The model filled in the blanks differently than you expected.

The cure is always the same: make the implicit explicit.

Key insight: You now have a complete mental model for prompt engineering. Not a collection of tricks, but a systematic approach: understand the task, choose the technique, write the prompt, test it, debug it, ship it, monitor it. The best prompt engineers aren’t the ones who know the most tricks — they’re the ones who are the most systematic. Go build something great.

arrow_back Ch 13: Evaluation & Debugging Back to Course Index arrow_forward