Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
A new method prevents LLM research projects from drifting apart with 2x better F1 scores
A new paper from Halley Young and Nikolaj Björner (affiliated with Microsoft Research) tackles a growing problem in AI-assisted research software: LLMs can generate code and drafts, but the math, executable system, benchmarks, and claims often drift apart. They identify two specific failure modes: hallucination accumulation (claims exceeding what code supports, carrying errors across sessions) and desynchronization (code, theory, and the model's internal world model falling out of alignment). To solve this, they propose Comet-H, an iterative prompt automaton that treats ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state.
Comet-H works by having a controller select the next prompt based on what the workspace currently lacks, using a contextual bandit over prompt families with a hand-weighted linear scorer. It carries unfinished follow-up work forward with a half-life and re-checks the paper and README against code and benchmarks whenever documentation changes. The researchers created a portfolio of 46 research-software repositories across over two dozen domains. In a detailed case study of A3, a Python static-analysis tool built entirely within the loop, Comet-H achieved F1=0.768 on a 90-case benchmark compared to the next-best baseline of 0.364. Across roughly 400 commits, audit-and-contraction passes dominated later phases, suggesting the system self-corrects by repeatedly verifying and tightening claims.
- Comet-H orchestrates LLM actions as a prompt automaton across ideation, coding, benchmarking, and writing to prevent misalignment.
- F1 score of 0.768 vs 0.364 on a static-analysis tool (A3) built entirely within the loop over ~400 commits.
- Identifies two LLM failure modes: hallucination accumulation and desynchronization, then remedies them via automatic re-grounding.
Why It Matters
Enables reliable end-to-end research software generation where specs evolve, reducing costly manual realignment of code and claims.