Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
New framework tackles the hardest problems in production AI: evaluating multi-turn conversations and optimizing complex agent systems.
A research team has published a comprehensive blueprint titled 'Build, Judge, Optimize' that addresses the critical, underexplored challenges of moving multi-agent AI systems from prototype to production, specifically for conversational shopping assistants (CSAs). The paper highlights that while CSAs represent a compelling use case for agentic AI, production deployment reveals significant hurdles in evaluating multi-turn interactions and optimizing tightly coupled multi-agent systems, challenges amplified in domains like grocery shopping where requests are underspecified and constrained by budget and inventory.
The researchers' solution is a two-part framework. First, they developed a structured, multi-faceted evaluation rubric that decomposes end-to-end shopping quality into specific dimensions, paired with a calibrated LLM-as-judge pipeline aligned with human annotations. Second, they investigate two complementary prompt-optimization strategies built on GEPA (a SOTA prompt-optimizer): Sub-agent GEPA for optimizing individual agent nodes, and a novel system-level approach called MAMuT GEPA that jointly optimizes prompts across agents using multi-turn simulation. The team is releasing rubric templates and design guidance to help practitioners build more reliable, production-scale CSAs.
- Introduces a structured evaluation rubric and LLM-judge pipeline to measure multi-turn conversational AI quality, a major hurdle for production systems.
- Proposes two novel optimization strategies using GEPA, including the system-level MAMuT GEPA for jointly optimizing multi-agent prompts via simulation.
- Provides a practical blueprint and releases templates to help developers build robust consumer assistants for complex, constrained tasks like grocery shopping.
Why It Matters
Provides a systematic, reproducible method for companies to build and improve reliable AI assistants that handle real-world complexity and user preferences.