[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%
An AI agent with access to 2M+ CS papers found new techniques, including a 2025 method, to optimize a model.
A controlled experiment demonstrated that augmenting an LLM coding agent with access to academic literature can measurably improve automated machine learning tasks. Using Karpathy's 'autoresearch' framework, researchers tested two identical Claude Code agents optimizing a 7M-parameter GPT-2 model on TinyStories. The only difference was that one agent had access to 'Paper Lantern,' an MCP server providing full-text search and synthesis across over 2 million computer science papers. After 100 experiments each, the paper-augmented agent achieved a validation score 3.2% better than its counterpart, with the performance gap still widening at the two-hour mark.
The key advantage was the agent's ability to retrieve and apply techniques outside the standard ML playbook encoded in its training data. It successfully implemented methods like AdaGC (adaptive gradient clipping from February 2025) and a sqrt batch scaling rule (from June 2022). Crucially, when both agents attempted to halve the batch size, only the paper-augmented agent correctly adjusted the learning rate by retrieving the relevant scaling rule, preventing a divergent training run. The agent without literature access was limited to known techniques and failed this adjustment.
This experiment, conducted on the well-explored TinyStories dataset to create a harder test, suggests the benefit could be even larger for less-studied problems. While limitations include a single run and a small model, the results point to a significant role for retrieval-augmented generation (RAG) in automating research and optimization. The findings highlight how bridging the gap between an AI's training cutoff and the latest research can enhance its problem-solving capabilities in technical domains.
- The paper-augmented agent achieved a 3.2% better validation score (0.4475 vs. 0.4624) on a GPT-2 model after 2 hours.
- It accessed and applied techniques published after its training, like AdaGC from Feb 2025, trying 25 paper-sourced methods.
- A critical success was correctly applying a sqrt batch scaling rule to adjust the learning rate, preventing a divergent training run.
Why It Matters
This shows RAG can effectively bridge an AI's knowledge cutoff, automating the discovery and application of cutting-edge research to improve results.