Research & Papers

Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]

Coding agent performance jumps by up to 32% with retrieval-augmented technique selection across 9 tasks.

Deep Dive

A new open-source benchmark suite, paper-lantern-challenges, provides a rigorous evaluation of coding-agent performance when augmented with retrieval-augmented technique selection. The suite measures the delta between a baseline agent and one with access to a retrieval tool across 9 diverse software tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, and summarization evaluation. The same coding agent (Claude Opus 4.6 as planner, Gemini Flash 3 as task model) runs on identical input data and evaluation scripts for both conditions. The independent variable is whether the agent can call a retrieval tool over computer science literature before writing its solution. Each task uses a standard quantitative metric, and the entire setup is reproducible on a free Gemini API key in roughly 10 minutes per task.

The results show consistent improvements across all tasks, with the largest gains in extraction_contracts (+0.320), extraction_schemas (+0.254), and test_generation (+0.245). The test-generation improvement came from the agent discovering mutation-aware prompting techniques (MuTAP and MUTGEN) through retrieval. Smaller gains were observed in text_to_sql (+0.040), routing (+0.017), and summeval (+0.010). The retrieval-augmented agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques, deep_dive(technique) provides implementation steps and failure modes, and compare_approaches(candidates) enables side-by-side comparisons. The baseline agent has no such tools but otherwise identical scaffolding. All prompts, predictions, and evaluation scripts are diffable in the repository, ensuring full reproducibility and transparency.

Key Points
  • The benchmark suite covers 9 everyday software tasks with standard quantitative metrics like mutation score and execution accuracy.
  • Retrieval-augmented coding agents show gains from +0.010 to +0.320, with the largest improvements in extraction and test generation tasks.
  • The retrieval system provides three tool calls (explore_approaches, deep_dive, compare_approaches) that the agent can use to find techniques from CS literature.

Why It Matters

This benchmark provides a reproducible way to measure how retrieval augmentation improves coding agent performance across real-world tasks.