Benchmarking Real-Time Question Answering via Executable Code Workflows
A new benchmark reveals even top models like GPT-5.2 fail at real-time questions, scoring just 46%.
A research team led by Wenjie Zhou has unveiled RT-QA, a groundbreaking benchmark designed to test AI's ability to answer questions with real-time, up-to-date information. Unlike static datasets, RT-QA uses an agent-driven pipeline that autonomously generates executable code for web crawling and DOM-based answer extraction, creating a dynamic ground truth that evolves with the web. To ensure robustness, the system includes a self-repair mechanism to adapt to changing website structures. The benchmark spans 12 domains—including Finance and Sports—with 320 Chinese questions categorized by difficulty.
Extensive evaluations of state-of-the-art models like OpenAI's GPT-5.2 and GLM-4.7 exposed significant limitations in real-time adaptability. The top-performing model managed only 46% accuracy. The analysis identified two critical failure modes: 'Lazy Retrieval,' where agents rely on superficial search snippets instead of deeply scanning specific sites (20% of failures), and 'Temporal Confusion,' where agents retrieve historical dates and fail to re-anchor their reasoning to the present (2026). These findings suggest that future AI agents need more than better retrieval; they require robust temporal state management to function reliably in the real world.
- RT-QA is a dynamic benchmark using executable code workflows to test real-time QA, spanning 12 domains with 320 questions.
- Top models like GPT-5.2 and GLM-4.7 scored only 46% accuracy, failing due to 'Lazy Retrieval' and 'Temporal Confusion'.
- The benchmark's self-repairing pipeline autonomously generates and runs code for web crawling, creating evolving ground truth data.
Why It Matters
This exposes a critical gap in current AI agents, showing they cannot reliably answer time-sensitive questions without better temporal reasoning.