The Debugging Workflow
LLM debugging is fundamentally different from traditional software debugging. There’s no stack trace, no line number, no deterministic reproduction. Instead, use this systematic approach:
1. Reproduce: Can you trigger the same failure with the same input?
2. Isolate: Which component is failing? Retrieval? Generation? Guardrails?
3. Compare: What’s different between working and failing cases?
4. Hypothesize: Form a theory about the root cause
5. Test: Modify one variable and re-run to confirm the hypothesis
Trace-Based Debugging
Traces are your primary debugging tool. For a failing request, examine:
• Input: Was the user query unusual or ambiguous?
• Retrieval: Were the right documents retrieved? Were they relevant?
• Prompt: What did the full prompt look like with context injected?
• Model response: What exactly did the model output?
• Post-processing: Did guardrails modify or block the response?
Most failures become obvious once you can see the full trace.
Pro tip: Build a “debug mode” that lets you replay any production trace locally with full visibility. This means storing the complete input, retrieved documents, and model output for every request (or a sample). The storage cost is minimal compared to the debugging time saved.