Method
Multi-agent conversations span many turns. Evaluate: conversation coherence (do agents stay on topic?), information flow (does relevant info reach the right agent?), turn efficiency (how many turns to reach a decision?), and termination quality (did the conversation end at the right time?). Use LLM-as-judge on sampled transcripts with rubrics for each dimension. Compare against golden transcripts from expert-annotated examples.
Pattern
Coherence: on-topic?
Info flow: right agent got it?
Efficiency: turns to decision
Termination: timely end?
// LLM-as-judge + golden transcripts
Key insight: A conversation that reaches the right answer in 20 turns vs 5 has a quality problem, not just a cost problem.