Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]
Coding agent performance jumps by up to 32% with retrieval-augmented technique selection across 9 tasks.
A new open-source benchmark suite, paper-lantern-challenges, provides a rigorous evaluation of coding-agent performance when augmented with retrieval-augmented technique selection. The suite measures the delta between a baseline agent and one with access to a retrieval tool across 9 diverse software tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, and summarization evaluation. The same coding agent (Claude Opus 4.6 as planner, Gemini Flash 3 as task model) runs on identical input data and evaluation scripts for both conditions. The independent variable is whether the agent can call a retrieval tool over computer science literature before writing its solution. Each task uses a standard quantitative metric, and the entire setup is reproducible on a free Gemini API key in roughly 10 minutes per task.
The results show consistent improvements across all tasks, with the largest gains in extraction_contracts (+0.320), extraction_schemas (+0.254), and test_generation (+0.245). The test-generation improvement came from the agent discovering mutation-aware prompting techniques (MuTAP and MUTGEN) through retrieval. Smaller gains were observed in text_to_sql (+0.040), routing (+0.017), and summeval (+0.010). The retrieval-augmented agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques, deep_dive(technique) provides implementation steps and failure modes, and compare_approaches(candidates) enables side-by-side comparisons. The baseline agent has no such tools but otherwise identical scaffolding. All prompts, predictions, and evaluation scripts are diffable in the repository, ensuring full reproducibility and transparency.
- The benchmark suite covers 9 everyday software tasks with standard quantitative metrics like mutation score and execution accuracy.
- Retrieval-augmented coding agents show gains from +0.010 to +0.320, with the largest improvements in extraction and test generation tasks.
- The retrieval system provides three tool calls (explore_approaches, deep_dive, compare_approaches) that the agent can use to find techniques from CS literature.
Why It Matters
This benchmark provides a reproducible way to measure how retrieval augmentation improves coding agent performance across real-world tasks.