ACE Framework Lets LLMs Self-Improve by Generating Adversarial Tests
No ground-truth code needed—ACE boosts LLM coding by 3-7% with adversarial tests.
Current LLM code generation relies heavily on large-scale annotated solutions and verification-based supervision, which limits scalability and sustained self-improvement. Existing solver-verifier frameworks use program execution as automatic supervision, but their effectiveness plateaus as solvers improve—verifier-generated tests increasingly confirm semantic correctness instead of exposing failure modes. ACE introduces a solver-adversary architecture where a single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to cause execution failures (runtime errors, exceptions, non-termination). Supervision is derived purely from execution outcomes: robust programs are selected for supervised fine-tuning, while adversarial tests are optimized via Kahneman-Tversky Optimization using execution-derived preferences. No ground-truth code or external reward models are required, making the entire loop self-supervised.
The framework's self-evolving loop enables continuous improvement without human annotation. On benchmarks like CodeContests, MBPP, and LiveCodeBench, ACE consistently outperforms strong solver-verifier baselines with 3-7% absolute gains in pass@1, and achieves even larger improvements on out-of-distribution benchmarks—all while maintaining competitive inference efficiency. This adversarial approach shifts the paradigm from passive test confirmation to active failure discovery, potentially enabling more robust and autonomous code generation. For tech professionals, ACE represents a significant step toward self-improving coding agents that learn from their own mistakes without expensive human feedback or curated datasets.
- ACE uses a solver-adversary architecture where the LLM both writes code and generates adversarial unit tests to actively induce failures.
- Training requires no ground-truth code or external reward models; supervision is derived purely from execution outcomes.
- Achieves 3-7% absolute gains in pass@1 on CodeContests, MBPP, and LiveCodeBench, with larger improvements on out-of-distribution tasks.
Why It Matters
Self-evolving LLMs that learn from their own mistakes could reduce reliance on expensive human-annotated coding data.