Ch 11: Prompt Engineering as Product Design

Ch 11 — Prompt Engineering as Product Design

For LLM products: prompts are your product logic. How to version, test, and iterate.

Index

High Level

code

Prompts = Code

arrow_forward

architecture

Anatomy

arrow_forward

psychology

Techniques

arrow_forward

history

Versioning

arrow_forward

science

Testing

arrow_forward

deployed_code

Production

Click play or press Space to begin...

Step- / 8

code

Prompts Are Product Logic

In LLM products, the prompt IS the specification — treat it like code, not like a message

The Paradigm Shift

In traditional software, product logic lives in code. The PM writes requirements, engineers write code, and the code determines behavior.

In LLM products, the prompt IS the product logic. The system prompt defines what the AI does, how it responds, what it refuses, what tone it uses, and how it handles edge cases. Change the prompt and you change the product.

This means prompt engineering is not a technical task delegated to engineers. It’s a product design discipline that the PM must understand and often directly participate in. The prompt is the most important artifact in an LLM product — more important than the UI, the API, or the model selection.

Why PMs Must Care

Prompts encode product decisions:
• “Always respond in a professional, empathetic tone” — brand voice decision
• “If you’re not sure, say so rather than guessing” — accuracy vs. coverage trade-off
• “Never provide medical, legal, or financial advice” — scope boundary
• “Keep responses under 200 words unless asked for detail” — UX decision
• “If the user seems frustrated, offer to connect them with a human agent” — escalation logic

Every line in a system prompt is a product decision. If the PM doesn’t review and own the prompt, they don’t own the product behavior.

The $2M lesson: An e-commerce company lost $2M in revenue from a single untested prompt change that altered how the AI handled product recommendations. The prompt is production code. Treat it with the same rigor: version control, testing, staged rollouts, and rollback capability.

architecture

Anatomy of a Production Prompt

The five components that make up a well-structured system prompt

1. Role & Identity

Define who the AI is and how it should behave:

“You are a customer support specialist for Acme Corp. You are knowledgeable about our products, policies, and billing processes. You are professional, empathetic, and concise.”

This grounds the model’s behavior in a specific persona and prevents it from acting as a generic chatbot.

2. Task Definition

Define what the AI does:

“Your job is to answer customer questions about billing, shipping, and returns. For each question, provide a clear, accurate answer based on the company knowledge base provided. If you need to look up order details, use the order_lookup function.”

Be specific about the task scope. Vague task definitions produce vague outputs.

3. Constraints & Boundaries

Define what the AI must not do:

“Never provide medical, legal, or financial advice. Never share internal company information. Never make promises about refunds without checking the order status. If asked about topics outside your scope, politely redirect to the appropriate channel.”

4. Output Format

Define how the AI responds:

“Respond in 1–3 paragraphs. Use bullet points for lists of steps. Include relevant order numbers or policy references when applicable. End with a question to confirm the issue is resolved.”

Format instructions dramatically improve consistency. Without them, the model’s output length, structure, and style vary unpredictably.

5. Few-Shot Examples

Provide examples of ideal behavior:

“Here are examples of good responses:

User: Where is my order?
Assistant: I’d be happy to help you track your order. Could you please share your order number? You can find it in your confirmation email. I’ll look up the current status right away.”

Few-shot examples are the most powerful tool for controlling output quality. 3–5 well-chosen examples often improve quality more than pages of instructions.

The 80/20 of prompting: Role + constraints + 3 examples gets you 80% of the way. The remaining 20% comes from iterating on edge cases, refining format instructions, and adding specific handling for failure modes. Start simple, then add complexity based on evaluation results.

psychology

Prompting Techniques

Five techniques that every PM building LLM products should understand

1. Zero-Shot

Give the model a task with no examples. Just instructions.

“Classify this email as spam or not spam.”

Fastest to implement. Works well for simple, well-defined tasks. Quality varies significantly across inputs.

2. Few-Shot

Provide 3–10 examples of input-output pairs before the actual task.

“Here are examples of how to classify emails: [examples]. Now classify this email.”

Dramatically improves consistency and quality. The examples teach the model the expected pattern, format, and quality bar. This is the most commonly used technique in production.

3. Chain-of-Thought (CoT)

Ask the model to reason step by step before giving its answer.

“Think through this step by step before providing your answer.”

Significantly improves accuracy on reasoning tasks (math, logic, multi-step analysis). The model “shows its work,” which also makes outputs more explainable and debuggable.

4. Structured Output

Instruct the model to respond in a specific format (JSON, XML, markdown, table).

“Respond with a JSON object containing: category (string), confidence (number 0–1), reasoning (string).”

Essential for programmatic consumption of model outputs. Many APIs now support forced JSON output, ensuring the response is always parseable.

5. System + User Separation

Use the system message for persistent instructions and the user message for per-request context.

System: “You are a legal document analyzer. Extract key terms, obligations, and deadlines.”
User: “[paste document here]”

This separation keeps the product logic (system) separate from the user input (user), making it easier to version, test, and maintain the product behavior independently of user inputs.

Technique selection: Start with zero-shot. If quality is insufficient, add few-shot examples. If reasoning is needed, add chain-of-thought. If you need parseable output, add structured output. Layer techniques incrementally — each adds tokens (cost) and latency. Only add what measurably improves quality.

history

Prompt Versioning & Management

Treating prompts with the same rigor as production code

Why Versioning Matters

Without version control, prompt management becomes chaos:

• Lost work: Someone overwrites a prompt that was working well. No way to recover it.
• No comparison: “The AI was better last week.” Which version was running last week? Nobody knows.
• No audit trail: Who changed the prompt? When? Why? Critical for compliance and debugging.
• Wasted compute: Teams re-run experiments they’ve already run because results weren’t tracked. Studies show 30–40% of prompt engineering time is wasted on re-work.

Every prompt change should create an immutable version with metadata: author, timestamp, changelog, and evaluation results.

Prompt Management in Practice

Version every change. Even a single-word edit creates a new version. Diff viewers show exactly what changed.

Track performance per version. Each version has associated metrics: evaluation scores, token usage, latency, cost, user feedback. You can compare any two versions side by side.

Rollback is non-destructive. Restoring a previous version creates a new version with the old content. The full history is preserved.

Separate prompts from code deploys. Prompt changes should be deployable independently of code changes. This allows product and content teams to iterate on prompts without engineering releases — fetching prompts at runtime with caching and fallback support.

Environment stages. Development → Staging → Production. A prompt must pass evaluation in staging before reaching production users.

Scale thresholds: Prompt management becomes urgent when you exceed: 10,000+ queries daily, 2+ people modifying prompts, or 5+ distinct prompts in the product. Below these thresholds, a shared document might suffice. Above them, you need proper tooling (Promptfoo, Humanloop, PostHog, or custom solutions).

science

Testing Prompts Systematically

Moving from “it looks good” to “it passes the eval suite”

The Evaluation Suite

Every production prompt needs an evaluation suite — a set of test cases that the prompt must pass before deployment:

Happy path tests (40%):
Common, straightforward inputs. “What’s your return policy?” The AI should handle these perfectly.

Edge case tests (30%):
Unusual but legitimate inputs. Misspellings, multiple questions in one message, very long inputs, inputs in unexpected languages.

Adversarial tests (20%):
Inputs designed to break the AI. Prompt injection attempts, jailbreaking, requests for out-of-scope content, attempts to extract the system prompt.

Regression tests (10%):
Cases that previously failed and were fixed. Ensure old bugs don’t return when the prompt changes.

The Testing Workflow

Step 1: Write the prompt change.
Modify the system prompt, few-shot examples, or output format.

Step 2: Run the eval suite.
Execute all test cases against the new prompt. Compare results to the previous version.

Step 3: Review failures.
Did any previously passing tests fail? (Regression.) Did any new tests pass? (Improvement.) Are there new failure modes?

Step 4: Human review.
For subjective quality (tone, helpfulness), have 2–3 people review a sample of outputs. Automated metrics catch factual errors; humans catch quality issues.

Step 5: Deploy to staging.
Run the new prompt on a small percentage of real traffic. Monitor metrics for 24–48 hours.

Step 6: Promote to production.
If staging metrics hold, roll out to all users. Keep the previous version ready for instant rollback.

The non-determinism challenge: LLMs produce different outputs for the same input. Run each test case 3–5 times and evaluate the distribution of responses, not a single response. A prompt that works 4 out of 5 times is different from one that works 5 out of 5 times — and both are different from one that works 2 out of 5.

token

Token Economics

Every word in your prompt costs money — the PM must manage the token budget

How Tokens Work

LLMs process text in tokens — roughly 3/4 of a word. You pay for both input tokens (your prompt + context) and output tokens (the model’s response).

Cost example (GPT-4o, 2026 pricing):
• Input: ~$2.50 per million tokens
• Output: ~$10 per million tokens
• A 2,000-token system prompt + 500-token user input + 500-token response = ~$0.008 per request
• At 100K requests/day = ~$800/day = ~$24K/month

The system prompt is sent with every single request. A 2,000-token system prompt at 100K requests/day costs ~$500/month just for the prompt alone. Every word matters.

Managing the Token Budget

Context window allocation:
Modern models have large context windows (128K–1M tokens), but using more tokens costs more and increases latency. Allocate your budget:

• System prompt: 500–2,000 tokens (fixed cost per request)
• Few-shot examples: 200–1,000 tokens (high impact per token)
• RAG context: 500–4,000 tokens (variable, based on retrieval)
• User input: Variable (set a max length)
• Response: Set a max output length

Optimization techniques:
• Compress instructions (remove redundancy, use concise language)
• Use dynamic few-shot (select relevant examples per query, not all examples every time)
• Cache system prompts where the API supports it
• Use smaller models for simpler tasks (GPT-4o-mini vs. GPT-4o)

The cost-quality trade-off: Longer prompts with more examples generally produce better outputs but cost more. The PM must find the sweet spot: the minimum prompt that achieves the quality threshold. Track cost per query alongside quality metrics. A 10% quality improvement that doubles cost may not be worth it.

warning

Prompt Pitfalls

Common mistakes that cause production failures in LLM products

Pitfalls 1–4

1. Prompt injection vulnerability.
Users can include instructions in their input that override your system prompt: “Ignore all previous instructions and...” Without defenses, your carefully crafted prompt is bypassed. Mitigation: input sanitization, instruction hierarchy, and adversarial testing.

2. Prompt-model coupling.
A prompt optimized for GPT-4 may perform poorly on Claude or Gemini. When the provider updates the model, your prompt may break. Mitigation: test prompts against multiple models regularly.

3. Over-engineering the prompt.
A 5,000-token system prompt with 50 rules and 20 examples. The model gets confused by contradictory instructions. Longer prompts don’t always mean better outputs — they can degrade quality by overwhelming the model’s attention.

4. No fallback for prompt failures.
The prompt assumes the model always follows instructions. It doesn’t. What happens when the model ignores a constraint? Without a programmatic fallback (output validation, content filtering), failures reach users.

Pitfalls 5–7

5. Testing on cherry-picked examples.
“The prompt works great on these 5 examples!” But fails on the 500 examples in the eval suite. Always test on a representative, diverse set — not the examples you used to write the prompt.

6. Ignoring temperature and parameters.
Temperature controls randomness. Temperature 0 = deterministic (good for classification). Temperature 0.7 = creative (good for writing). Using the wrong temperature for the task produces inconsistent or bland outputs.

7. No monitoring after deployment.
The prompt works today. But user inputs change, model behavior drifts with updates, and edge cases accumulate. Without monitoring, quality degrades silently. Track output quality metrics continuously, not just at deployment.

Defense in depth: Never rely on the prompt alone for safety or correctness. Layer defenses: prompt instructions (first line), output validation (check format, length, content), content filtering (block harmful outputs), and programmatic guardrails (never execute actions without confirmation). The prompt is one layer in a multi-layer safety system.

deployed_code

The Prompt Engineering Workflow

A production-grade process for managing prompts as product artifacts

The Workflow

1. Define the behavior (PM)
What should the AI do? What should it refuse? What tone? What format? Write this as product requirements, not as a prompt.

2. Draft the prompt (PM + Prompt Engineer)
Translate requirements into a system prompt with role, task, constraints, format, and examples. Start simple.

3. Evaluate against the test suite (Automated)
Run the prompt against 100–500 test cases. Measure quality metrics. Compare to the previous version.

4. Iterate (PM + Prompt Engineer)
Review failures. Adjust the prompt. Re-evaluate. Repeat until the quality threshold is met. Typically 5–15 iterations.

5. Human review (Domain experts)
Have qualified reviewers assess a sample of outputs for quality, accuracy, and safety.

6. Staged rollout (Engineering)
Deploy to 5% of traffic. Monitor for 24–48 hours. If metrics hold, expand to 100%.

7. Monitor continuously (Automated + PM)
Track quality, cost, latency, and user feedback. Alert on regressions.

PM Ownership

In this workflow, the PM owns:

• The behavior definition (step 1) — What the AI should do
• The evaluation criteria (step 3) — What “good” looks like
• The quality threshold (step 4) — When it’s good enough to ship
• The rollout decision (step 6) — When to go live
• The monitoring review (step 7) — Whether quality is holding

The prompt engineer owns the technical implementation (steps 2, 4). Engineering owns the infrastructure (steps 3, 6, 7). But the PM is the decision-maker at every gate.

The bottom line: Prompt engineering is the new product design for LLM products. The prompt defines behavior, the evaluation suite defines quality, and the versioning system provides control. PMs who master this workflow ship better LLM products faster, with fewer production incidents and more predictable quality. It’s not optional — it’s the core competency of AI product management in the LLM era.

arrow_back Ch 10: Evaluation & Metrics That Matter Ch 12: RAG & Knowledge Integration arrow_forward