Research & Papers

Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

New method flags errors in GPT-5 and Gemini 3 outputs without needing labeled training data.

Deep Dive

Researchers Hui Wen Goh and Jonas Mueller have introduced CONSTRUCT, a novel method for assigning real-time trustworthiness scores to structured outputs from large language models (LLMs). The tool is designed to address the sporadic errors that plague LLM-generated JSON, XML, and other structured data, which is a major bottleneck for enterprise AI adoption. CONSTRUCT's key innovation is its ability to work with any LLM—including black-box APIs like Anthropic's models or reasoning-focused systems that don't provide log probabilities—without requiring labeled training data or custom model deployment. It can score entire outputs and individual fields within complex, nested schemas, pinpointing exactly where errors are most likely.

The researchers also released one of the first public benchmarks for LLM structured outputs with reliable, verified ground-truth data. Over this four-dataset benchmark, CONSTRUCT demonstrated significantly higher precision and recall in error detection compared to existing scoring methods when evaluating outputs from top models like Google's Gemini 3 and OpenAI's GPT-5. This performance makes it a practical tool for production pipelines, allowing development teams to efficiently triage their limited human review resources by focusing only on the low-scoring, high-risk outputs. The method provides a scalable solution to a core reliability problem, enabling more confident automation of data extraction and API-calling agent workflows.

Key Points
  • CONSTRUCT provides real-time trust scores for LLM outputs like JSON, flagging low-scoring results likely to contain errors.
  • The method is model-agnostic, working with black-box APIs (e.g., Anthropic, GPT-5) without needing labeled data or custom deployments.
  • Tested on a new 4-dataset benchmark, it outperformed other methods in precision/recall for detecting errors from models like Gemini 3 and GPT-5.

Why It Matters

Enables enterprises to deploy LLMs for critical data tasks by efficiently targeting human review, reducing risk and cost.