Developer Tools

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

The system automatically found and fixed systematic errors in a major AI benchmark, improving ranking accuracy.

Deep Dive

A research team has unveiled EigenData, a novel platform designed to solve a critical bottleneck in AI development: creating high-quality, domain-specific training data for function-calling agents. These agents are large language models (LLMs) that can invoke tools and APIs to perform tasks, but they require complex, realistic data that spans executable code environments, backing databases, and multi-step conversational trajectories. EigenData automates this entire lifecycle through a multi-agent architecture. A central orchestrator, called EigenCore, coordinates three specialized sub-systems: a DatabaseAgent for constructing realistic domain databases, a CodingAgent for generating verified executable environments through iterative test-debug loops, and a DataAgent for synthesizing multi-turn conversational data with self-optimizing prompts. Cross-component feedback ensures all generated artifacts remain consistent.

To demonstrate its power, the researchers applied EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), a major benchmark for evaluating AI agents. The platform systematically identified errors in function schemas, code implementations, and reference answer trajectories. It then automatically corrected these issues through coordinated schema refinement, code-level bug fixes, and trajectory modifications. Crucially, the team also introduced a new "outcome-aware" evaluation protocol that assesses an agent's success based on the correctness of the final database state, rather than just matching its steps to a predefined path. The result was a repaired benchmark that produces model rankings substantially better correlated with human judgments of functional correctness, highlighting flaws in previous evaluation methods.

Key Points
  • Automates the full data lifecycle for training function-calling AI agents through a coordinated multi-agent system (DatabaseAgent, CodingAgent, DataAgent).
  • Successfully audited and repaired the Berkeley Function-Calling Leaderboard (BFCL-V3), fixing systematic errors in schemas, code, and reference trajectories.
  • Introduced an outcome-aware evaluation protocol that improved the correlation between benchmark rankings and human judgment by assessing final database state.

Why It Matters

Provides a scalable, automated solution for creating reliable training data and benchmarks, which is essential for developing robust, real-world AI agents.