Developer Tools

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Multi-agent system generates executable tests for bug reports, finding 122 new bugs in major software like Firefox and Chromium.

Deep Dive

A research team from multiple institutions has developed AnyPoC, a universal framework that solves a critical bottleneck in AI-powered software testing. While LLM-based agents like Claude Code can generate numerous bug reports, these remain unverified hypotheses requiring manual validation. AnyPoC transforms this process by automatically generating executable proof-of-concept tests—scripts, commands, or crafted inputs—that can trigger suspected defects, providing concrete evidence of bugs.

The system employs a sophisticated multi-agent architecture that fact-checks bug reports, iteratively synthesizes and executes PoCs while collecting execution traces, and independently re-executes tests to prevent hallucination and reward hacking. This approach addresses the tendency of naive LLM agents to produce plausible but non-functional proofs. AnyPoC also builds a continuously evolving knowledge base to handle diverse programming tasks across different languages and domains.

In practical testing across 12 critical software systems including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis—many with millions of lines of code—AnyPoC demonstrated remarkable performance. Compared to state-of-the-art coding agents, it produced 1.3x more valid PoCs for true-positive bug reports while rejecting 9.8x more false positives. The framework has already discovered 122 new bugs, with 105 confirmed by developers and 86 already fixed, demonstrating its real-world utility in enterprise software development.

Key Points
  • Generates 1.3x more valid proof-of-concept tests than agents like Claude Code and Codex
  • Rejects 9.8x more false-positive bug reports through multi-agent validation architecture
  • Discovered 122 new bugs (105 confirmed) in major systems like Firefox and Chromium, with 45 PoCs adopted as official tests

Why It Matters

Enables fully autonomous AI bug detection at scale, reducing manual validation work for developers working on critical software systems.