Introducing agent quality optimization in AgentCore, now in preview
Replace manual prompt tweaks with data-backed recommendations from production traces.
Amazon Bedrock AgentCore has launched a suite of agent quality optimization features now in preview. These capabilities complete the observe-evaluate-improve loop for AI agents that reason, plan, and act across complex workflows. The service analyzes production traces and evaluation outputs to generate recommendations for improving system prompts or tool descriptions, targeting a specific reward signal defined by built-in or custom evaluators. Validation comes in two forms: batch evaluation against pre-defined test datasets or LLM-simulated scenarios, and A/B testing via AgentCore Gateway, which splits live production traffic at configurable percentages and reports results with confidence intervals and statistical significance. This replaces the traditional manual cycle of reading traces, guessing fixes, and deploying blindly.
In practice, the loop works via AgentCore's end-to-end traceability, which captures every model call, tool invocation, and reasoning step as OpenTelemetry-compatible data. Developers point the Recommendations API at their CloudWatch Log group, specify the evaluator to optimize (e.g., goal success rate, tool selection accuracy), and choose whether to optimize system prompts or tool descriptions. The service reflects on traces and proposes changes without touching tool implementations. Batch evaluation catches regressions on known cases, while A/B testing validates against live user traffic. Organizations like NTT Data report that what once required weeks of manual prompt tuning now becomes rapid, repeatable cycles. This data-driven approach enables continuous agent improvement at scale, addressing the silent degradation of agent quality as models evolve and user behavior shifts.
- Recommendations use production traces to optimize system prompts or tool descriptions for a specified reward signal
- Batch evaluation supports LLM-simulated datasets to catch regressions beyond hand-authored scenarios
- A/B testing via Gateway splits live traffic with configurable percentages and reports statistical significance
Why It Matters
Eliminates manual, intuition-based agent debugging, enabling continuous, data-driven optimization at scale for production AI systems.