Ch 14 — The Prompt Engineer’s Toolkit

Decision trees, prompt chaining, DSPy, and the complete mental model for any prompting challenge
Capstone
account_tree
Decision Tree
arrow_forward
link
Chaining
arrow_forward
precision_manufacturing
DSPy
arrow_forward
engineering
Case Study
arrow_forward
auto_awesome
Model Selection
arrow_forward
update
The Future
arrow_forward
school
Course Map
arrow_forward
rocket_launch
What’s Next
-
Click play or press Space to begin...
Step- / 8
account_tree
The Prompt Engineering Decision Tree
When you face a new prompting challenge, start here
The Decision Tree
Q1: What type of task? Classification / Extraction → Few-shot with examples (Ch 3) → Structured output (Ch 7) → Temperature = 0 Reasoning / Analysis → Chain-of-thought (Ch 4) → Decomposition for complex tasks (Ch 5) → Temperature = 0 Generation / Creative → System prompt with persona (Ch 6) → Critic pattern for quality (Ch 8) → Temperature = 0.7-1.0 Conversation / Agent → System prompt + tools (Ch 6, 12) → Multi-turn management (Ch 11) → Temperature = 0.3-0.7
Q2: What Quality Level?
Quick & cheap (internal tools, prototypes) → Zero-shot or simple few-shot → Smaller model (GPT-4o-mini, Haiku) → Minimal testing Reliable (customer-facing, moderate risk) → Few-shot + format constraints → Mid-tier model (GPT-4o, Sonnet) → Test suite of 20-30 cases Production-critical (financial, medical, legal) → Full prompt engineering (CoT, patterns) → Best model available → LLM-as-judge + human review → Comprehensive test suite
Key insight: Not every task needs the full toolkit. A simple classification might need just a few-shot prompt with temperature 0. A production chatbot needs system prompts, tools, multi-turn management, evaluation, and monitoring. Match the investment to the stakes.
link
Prompt Chaining: When One Prompt Isn’t Enough
Break complex tasks into a pipeline of focused prompts, each one’s output feeding the next
Why Chain?
Some tasks are too complex for a single prompt. Signs you need chaining:

• The prompt is over 500 words and still not specific enough
• The task has distinct phases (analyze → decide → generate)
• You need different models or temperatures for different steps
• Quality degrades when you add more instructions to one prompt
Chain Architecture
Example: Automated Code Review Pipeline Prompt 1 — Analyze (GPT-4o, temp=0) "List all issues in this code: bugs, security, performance, style. Output as JSON array." ↓ issues_json Prompt 2 — Prioritize (GPT-4o, temp=0) "Given these issues, rank by severity. Group into: must-fix, should-fix, nice-to-fix. Output as JSON." ↓ prioritized_json Prompt 3 — Generate Fixes (GPT-4o, temp=0.2) "For each must-fix issue, generate the corrected code. Show the diff." ↓ fixes Prompt 4 — Write Review (GPT-4o, temp=0.5) "Write a code review comment for the PR author. Tone: constructive, specific. Include the fixes as suggestions."
Benefits of Chaining
1. Each prompt is focused: One task per prompt = higher quality per step

2. Different settings per step: Analysis at temp=0, writing at temp=0.5

3. Intermediate validation: Check the output of each step before proceeding. If step 1 finds no issues, skip steps 2–4.

4. Debuggable: When the final output is wrong, you can inspect each intermediate result to find where it went wrong.

5. Reusable: The “Prioritize” prompt works for any list of issues, not just code review.
Key insight: Prompt chaining is the prompt engineering equivalent of the Unix philosophy: do one thing well, then pipe the output to the next tool. Each prompt in the chain is simpler, more testable, and more reliable than a single mega-prompt.
precision_manufacturing
DSPy: Programmatic Prompt Optimization
Let the framework find the best prompt automatically — prompts as code, not strings
The Problem DSPy Solves
Manual prompt engineering is:

Brittle: Prompts break when you switch models
Tedious: Manually testing wording variations
Unscalable: Can’t optimize 50 prompts by hand

DSPy’s approach: Define what you want (input/output types, evaluation metric), and let the framework optimize the prompt automatically through compilation.
DSPy in 30 Seconds
import dspy # 1. Define the signature class ClassifyTicket(dspy.Signature): """Classify a support ticket.""" ticket = dspy.InputField() category = dspy.OutputField( desc="BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST, or NONE") # 2. Create a module classify = dspy.Predict(ClassifyTicket) # 3. Compile with examples optimizer = dspy.BootstrapFewShot( metric=exact_match) compiled = optimizer.compile( classify, trainset=labeled_tickets) # 4. Use it — DSPy found the best # prompt + examples automatically result = compiled( ticket="I was charged twice") print(result.category) # BILLING
When to Use DSPy vs Manual
Use manual prompt engineering when:
• You have 1–5 prompts
• The task is well-understood
• You need full control over the prompt text
• You’re prototyping

Use DSPy when:
• You have 10+ prompts to optimize
• You have labeled evaluation data
• You need to switch between models
• You want reproducible optimization
• You’re building a production pipeline
Other Tools in the Space
LangChain / LangGraph: Orchestration framework for chains and agents. Good for building complex pipelines.

LlamaIndex: Specialized for RAG pipelines. Best when your primary task is querying documents.

Instructor: Structured output extraction with Pydantic validation. Lightweight, focused.

Guidance: Template-based prompt construction with constrained generation.
Key insight: DSPy represents the future of prompt engineering: defining what you want and letting the system figure out how to prompt for it. But you still need to understand the fundamentals (this course) to define good signatures, choose the right modules, and debug when things go wrong.
engineering
Case Study: End-to-End Code Review Pipeline
Applying every technique from this course to a real-world task
The Pipeline
Input: A GitHub PR diff Step 1: Triage (Ch 3 — Classification) Few-shot classify: does this PR need a detailed review? (Trivial changes like typo fixes → auto-approve) Step 2: Analyze (Ch 4 — CoT) "Think step by step: identify bugs, security issues, performance problems, and style violations." Step 3: Prioritize (Ch 5 — Decomposition) Decompose issues by severity. Filter out noise. Keep must-fix and should-fix. Step 4: Generate Review (Ch 6 — Persona) System prompt: "You are a senior engineer. Tone: constructive, specific. For each issue, explain WHY it's a problem and provide the fix." Step 5: Format (Ch 7 — Structured Output) Output as GitHub review comments JSON, mapped to specific lines in the diff. Step 6: Safety Check (Ch 13 — Evaluation) LLM-as-judge: verify no hallucinated line numbers, no incorrect fixes.
Techniques Used
Ch 2 (Anatomy): Each prompt has clear role, context, task, format, constraints
Ch 3 (Few-shot): Triage step uses 3 examples of trivial vs non-trivial PRs
Ch 4 (CoT): Analysis step uses step-by-step reasoning
Ch 5 (Decomposition): Prioritization breaks down by category
Ch 6 (System prompt): Review generation uses a senior engineer persona
Ch 7 (Structured output): Final output is JSON for GitHub API
Ch 10 (RAG): Injects the team’s style guide as context
Ch 13 (Evaluation): LLM-as-judge validates the output
Key insight: Real-world prompt engineering is rarely a single prompt. It’s a pipeline where each step uses the right technique for that specific sub-task. The art is knowing which technique to apply where — and that’s what this course has been building toward.
auto_awesome
Model Selection: Matching the Model to the Task
Not every task needs the most expensive model — and sometimes the most expensive isn’t the best
The Model Tiers (as of 2025)
Tier 1: Frontier ($10-30/M tokens) GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro Best for: complex reasoning, code gen, nuanced analysis, agentic workflows Tier 2: Mid-range ($0.50-3/M tokens) GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash Best for: classification, extraction, simple generation, high-volume tasks Tier 3: Open-source (self-hosted cost) Llama 3.1 70B, Mistral Large, Qwen 2.5 Best for: on-premise, data privacy, fine-tuning, cost optimization Tier 4: Reasoning ($15-60/M tokens) o1, o3, Claude 3.5 with extended thinking Best for: math, logic, complex multi-step reasoning, scientific analysis
Selection Strategy
Start small, scale up: Try GPT-4o-mini first. If quality is insufficient, move to GPT-4o. Don’t start with the most expensive model — you might not need it.

Different models for different steps: In a chain, use a cheap model for classification (step 1) and a frontier model for generation (step 4).

Test across models: A prompt optimized for GPT-4o might not work well on Claude. Test your prompts on 2–3 models to avoid vendor lock-in.
Cost Optimization
Caching: If the same input appears often, cache the output. Most providers offer prompt caching.

Batching: Process multiple inputs in one API call where possible.

Prompt length: Shorter prompts = lower cost. Remove unnecessary instructions once the model consistently gets it right.
Key insight: Model selection is a prompt engineering decision. A well-crafted prompt on a mid-tier model often outperforms a lazy prompt on a frontier model — and costs 10x less. Invest in prompt quality before upgrading the model.
update
The Future: Will Prompt Engineering Still Matter?
As models get smarter, prompt engineering evolves from “tricks” to “system design”
What’s Changing
Models are getting better at understanding vague prompts. GPT-4 handles ambiguity better than GPT-3.5. Future models will handle it even better. Some “tricks” (like “take a deep breath”) will become unnecessary.

But the fundamentals remain:
• Clear instructions will always outperform vague ones
• Structured output will always need format specifications
• Complex tasks will always benefit from decomposition
• Safety constraints will always need explicit rules
• Evaluation will always need systematic testing
The Evolution
2022: Prompt engineering = "tricks" "Let's think step by step" "You are an expert..." "Take a deep breath" 2024: Prompt engineering = "craft" System prompts, few-shot, CoT, structured output, tool use, evaluation, testing 2026+: Prompt engineering = "system design" Pipeline architecture, model selection, orchestration, monitoring, cost optimization, safety frameworks, programmatic optimization (DSPy)
Key insight: Prompt engineering isn’t going away — it’s growing up. The “tricks” era is ending. The “system design” era is beginning. The engineers who understand both the fundamentals (how models process prompts) and the systems (how to build reliable AI pipelines) will be the most valuable.
school
The Complete Course Map
All 14 chapters connected into a mental model you can use for any prompting challenge
Foundation (Ch 1-2)
Ch 1: How prompts work (tokens, probability, temperature)
Ch 2: Anatomy of a great prompt (role, context, task, format, constraints)

These are the building blocks. Every technique in the course builds on them.
Core Techniques (Ch 3-7)
Ch 3: Zero-shot vs few-shot (when to add examples)
Ch 4: Chain-of-thought (make the model reason)
Ch 5: Advanced reasoning (decomposition, ToT, self-reflection)
Ch 6: System prompts & personas (shape behavior)
Ch 7: Output formatting (get structured data)
Applied Skills (Ch 8-11)
Ch 8: Prompt patterns (Critic, Persona Chain, Flip)
Ch 9: Prompting for code (specs, debugging, refactoring)
Ch 10: RAG & context injection (ground in documents)
Ch 11: Multi-turn conversations (manage context)
Mastery (Ch 12-14)
Ch 12: Tool use & function calling (agent engineering)
Ch 13: Evaluation & debugging (prompts as software)
Ch 14: The toolkit (decision trees, chaining, DSPy, the future)
Key insight: This course is not a list of tricks — it’s a progression. Foundation → Techniques → Application → Mastery. When you face a new challenge, start at the decision tree (this chapter), pick the relevant techniques, and combine them. The whole is greater than the sum of its parts.
rocket_launch
What’s Next: Your Prompt Engineering Journey
You have the toolkit — now go build something
Immediate Next Steps
1. Pick a real project Not a toy example. Something you actually need: a support bot, a content pipeline, a code review tool, a data extraction system. 2. Start with the decision tree What type of task? What quality level? Which techniques apply? 3. Build iteratively Start with a simple prompt. Test. Add techniques one at a time. Test after each addition. Don't over- engineer from the start. 4. Build a test suite early Even 10 test cases will save you hours of debugging later. 5. Share what you learn Prompt engineering is still a young field. Your discoveries help everyone.
The One Rule
Be explicit.

Every technique in this course — few-shot examples, chain-of-thought, system prompts, structured output, tool descriptions, evaluation rubrics — is a form of being explicit.

When a prompt fails, it’s almost always because something was left implicit. The model filled in the blanks differently than you expected.

The cure is always the same: make the implicit explicit.
Key insight: You now have a complete mental model for prompt engineering. Not a collection of tricks, but a systematic approach: understand the task, choose the technique, write the prompt, test it, debug it, ship it, monitor it. The best prompt engineers aren’t the ones who know the most tricks — they’re the ones who are the most systematic. Go build something great.