Sibyl-AutoResearch: Self-Evolving Trial-and-Error Beats Paper Generators
New framework learns from failures with 1-iteration median conversion latency.
A team of researchers (Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, Chang Xu) have released Sibyl-AutoResearch, a new autonomous research framework that fundamentally rethinks how AI systems conduct scientific work. Current autonomous research agents can propose ideas, run code, inspect results, and draft papers—but they lack real research judgment. The key insight is that these systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains purely textual, and recurring process failures never change future behavior. Sibyl-AutoResearch addresses this by building what the authors call Scientific Trial-and-Error Harnesses. These harnesses let agents run bounded trials, preserve both positive and negative outcomes, then route those lessons into later planning, validation, claim scoping, scheduling, critique, writing, and even harness repair.
The framework formalizes this through two auditable conversion units: trial-to-behavior conversion (linking trial signals to later research actions) and trial-to-harness-behavior conversion (linking recurring process failures to system updates). The team implemented this in SIBYL, a file-backed autonomous research system that exposes state, roles, memory, gates, and artifact traces. A retrospective audit identified eight high-confidence conversion events with a median latency of just one iteration (maximum three iterations). A recovered-failure registry documents five naturally occurring failure classes—including duplicate results, stale numbers, and unsupported statistics—that were blocked, downgraded, or routed into repair workflows. The authors emphasize the system does not claim superiority over other methods; it demonstrates that the proposed conversion units are recoverable from realistic autonomous research workspaces. The SIBYL framework is open source.
- Sibyl-AutoResearch uses Scientific Trial-and-Error Harnesses to preserve both positive and negative outcomes from bounded trials.
- Retrospective audit found 8 high-confidence conversion events with median latency of 1 iteration (max 3 iterations).
- Five naturally occurring failure classes (duplicate results, stale numbers, unsupported statistics) were blocked, downgraded, or repaired.
Why It Matters
This framework gives AI research agents true self-correction ability, moving beyond paper generators toward genuine scientific judgment.