Research & Papers

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards

r/MachineLearning March 17, 2026

⚡New CLI tools tackle the 15% keep rate problem in automated AI experimentation, filtering noise from real breakthroughs.

Deep Dive

A developer known as dean0x has released a suite of command-line tools designed to solve a pervasive and costly problem in automated AI model research: false positive 'keeps.' Running ~100 experiments nightly on an H100 GPU, he observed a 15% keep rate, aligning with figures shared by AI researcher Andrej Karpathy. However, many of these 'improvements' proved to be statistical noise or GPU nondeterminism upon retesting, leading to wasted cycles and the risk of compounding errors by building new experiments on flawed foundations.

The core solution is `autojudge`, a tool that estimates the noise floor from recent experiments, checks if a result sits on the Pareto front of validation bits-per-byte (val_bpb) versus memory, and returns a script-friendly, confidence-scored verdict: STRONG_KEEP, KEEP, MARGINAL, RETEST, DISCARD, or CRASH. This replaces guesswork with data-driven decisions. Complementing it, `autosteer` analyzes which experiment categories (architecture, hyperparameters) historically yield real gains to suggest the next best step, switching between 'exploit' and 'explore' modes. The more experimental `autoevolve` pits multiple AI agent strategies against each other in separate git worktrees, cross-pollinating winning ideas.

In practice, this transforms the researcher's workflow from sifting through ambiguous tab-separated value files to reviewing ranked, confidence-scored results with a clear directive for the next experiment. The tools, available via pip install, address a fundamental scalability bottleneck in AI research, moving automation beyond simple execution toward intelligent, self-correcting experimentation.

Key Points

Autojudge assigns confidence scores (STRONG_KEEP to DISCARD) to experiment results by analyzing noise floors and Pareto fronts, preventing builds on false positives.
Autosteer analyzes historical experiment categories to suggest the next optimal step, stopping the 'random walk' in hyperparameter and architecture search.
The suite addresses Karpathy's observed problem where a 5% 'keep' actually hurt performance upon retest, turning a 15% keep rate into reliable insights.

Why It Matters

This moves AI research automation from simple batch execution to intelligent, noise-aware experimentation, saving massive GPU compute and researcher time.

Read Original Article

[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards

Why It Matters

Stay Ahead in AI