Why Trajectories Matter
Two agents might both complete a task, but one takes 5 steps and $0.10 while the other takes 50 steps and $5.00. Trajectory evaluation captures efficiency, reasoning quality, and cost — dimensions that task completion alone misses.
Shepherd’s Failure Patterns
Research analyzing 3,908 agent trajectories across 18 models identified three distinct failure patterns:
• Failure-to-Act: Agent fails to interact with the environment
• Out-of-Order Actions: Interdependent actions issued simultaneously
• False Termination: Agent prematurely assumes task is complete
LLM-as-Judge for Trajectories
Use an LLM judge to evaluate trajectory quality by asking:
1. Was the planning phase adequate?
2. Were tool calls appropriate and efficient?
3. Did the agent recover from errors?
4. Was the final answer correct?
Shepherd used this approach to improve agent performance from 21% to 31% while cutting costs by 57%.
Practical tip: Log every agent trajectory in production. When failures occur, trajectory logs are your debugging tool. Pattern-match failures to identify systematic issues (e.g., “always fails on multi-file edits”).