Research & Papers

Benchmarking Real-Time Question Answering via Executable Code Workflows

arXiv cs.IR April 21, 2026

⚡A new benchmark reveals even top models like GPT-5.2 fail at real-time questions, scoring just 46%.

Deep Dive

A research team led by Wenjie Zhou has unveiled RT-QA, a groundbreaking benchmark designed to test AI's ability to answer questions with real-time, up-to-date information. Unlike static datasets, RT-QA uses an agent-driven pipeline that autonomously generates executable code for web crawling and DOM-based answer extraction, creating a dynamic ground truth that evolves with the web. To ensure robustness, the system includes a self-repair mechanism to adapt to changing website structures. The benchmark spans 12 domains—including Finance and Sports—with 320 Chinese questions categorized by difficulty.

Extensive evaluations of state-of-the-art models like OpenAI's GPT-5.2 and GLM-4.7 exposed significant limitations in real-time adaptability. The top-performing model managed only 46% accuracy. The analysis identified two critical failure modes: 'Lazy Retrieval,' where agents rely on superficial search snippets instead of deeply scanning specific sites (20% of failures), and 'Temporal Confusion,' where agents retrieve historical dates and fail to re-anchor their reasoning to the present (2026). These findings suggest that future AI agents need more than better retrieval; they require robust temporal state management to function reliably in the real world.

Key Points

RT-QA is a dynamic benchmark using executable code workflows to test real-time QA, spanning 12 domains with 320 questions.
Top models like GPT-5.2 and GLM-4.7 scored only 46% accuracy, failing due to 'Lazy Retrieval' and 'Temporal Confusion'.
The benchmark's self-repairing pipeline autonomously generates and runs code for web crawling, creating evolving ground truth data.

Why It Matters

This exposes a critical gap in current AI agents, showing they cannot reliably answer time-sensitive questions without better temporal reasoning.

Read Original Article

Benchmarking Real-Time Question Answering via Executable Code Workflows

Why It Matters

Stay Ahead in AI