Amazon Bedrock AgentCore launches custom code-based evaluators for deterministic agent quality checks
New Lambda evaluators validate JSON, numbers, workflow compliance, and PII—without costly LLM tokens.
Amazon Bedrock AgentCore Evaluations now supports custom code-based evaluators powered by AWS Lambda, giving developers deterministic, cost-effective quality gates for production agents. Unlike LLM-as-a-Judge approaches, which rely on language models and can be expensive for objective checks, code-based evaluators execute exact logic: regex and structural validation, external data lookups, business rule enforcement, and calls to other AWS services. For a financial market-intelligence agent, this means validating that stock price quotes fall within a configurable live band, ensuring a mandatory broker-identification workflow runs before accessing profiles, enforcing strict JSON schemas on tool outputs, and scrubbing personally identifiable information (PII) from responses. These evaluators run in two modes: on-demand for integration into CI/CD pipelines as a gate before deployment, and online for scoring live production traffic. Since they don't consume foundation model tokens, they slash evaluation costs for high-volume or repetitive checks.
The evaluator lifecycle starts with registering a Lambda function with the AgentCore control plane. Developers define scoring logic in Python or Node.js, handle trace inputs from different agent frameworks, and return structured results. The same evaluator can be reused across multiple evaluation scenarios. Built-in LLM-as-a-Judge evaluators remain valuable for judging language quality, helpfulness, and clarity, while code-based evaluators enforce contractual, numerical, and structural requirements. Together, they move agent reliability from 'sounds right' to 'contract-verified.' The post demonstrates four custom evaluators for a financial intelligence agent and explains how to combine them with built-in checks, call services like Amazon Comprehend for PII detection, and set up real-time alerting. This release addresses a key gap in agent evaluation for regulated industries where deterministic correctness is non-negotiable.
- Custom evaluators use AWS Lambda for deterministic checks: JSON schema validation, numerical accuracy (e.g., stock price within 0.1%), workflow compliance, and PII detection.
- Evaluators run in on-demand (CI/CD pipeline gate) and online (live production traffic scoring) modes, reducing cost by avoiding FM tokens per request.
- Lambda functions control scoring logic and can be reused across frameworks; they complement built-in LLM-as-a-Judge evaluators for holistic agent quality assessment.
Why It Matters
Enables contract-verified agent reliability in regulated domains, moving beyond heuristic scoring to deterministic, production-grade quality gates.