Strands Evals launches multimodal judges to catch image hallucinations
New MLLM-as-a-Judge evaluators verify model outputs against source images automatically.
Strands today unveiled four new multimodal evaluators in its Strands Evals SDK, designed to verify image-to-text outputs against the source image. Text-only judges fail to ground responses in visual data: they might approve a hallucinated chart trend, an invented product label, or an ignored instruction. With Gartner predicting 80% of enterprise software will be multimodal by 2030 (up from under 10% in 2024), automated multimodal evaluation is critical. The new evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—each send the image, query, response, and optional reference answer to an MLLM judge model on Amazon Bedrock. The judge returns a score (Likert 1–5 or binary) along with a reasoning string for debugging, enabling precise failure isolation.
These evaluators serve as drop-in replacements for text-only judges in Strands Evals' Case → Experiment → Report workflow and can be plugged into continuous integration to catch visual hallucinations automatically. They support both reference-based and reference-free evaluation with the same evaluator, and users can write custom multimodal rubrics for domain-specific criteria. The SDK also guides prompt-design choices that improved judge-to-human alignment in experiments, and lets developers choose a judge model on Amazon Bedrock balancing accuracy, cost, and latency. Use cases include image captioning, visual question answering, chart interpretation, document field extraction, OCR, and screenshot summarization. The evaluators are available now for Python 3.10+ with `pip install strands-agents-evals` and an AWS account with Bedrock access.
- Four evaluators target distinct failure modes: Overall Quality, Correctness, Faithfulness, and Instruction Following.
- MLLM judge models on Amazon Bedrock score outputs against the source image, returning a score and reasoning string.
- Supports both reference-based and reference-free evaluation, plus custom rubrics for domain-specific criteria.
Why It Matters
Automates detection of visual hallucinations and factual errors in multimodal apps, replacing costly human review and unreliable text-only proxies.