DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
New embedding model achieves 93.47% NDCG@10 by incorporating data distribution into R function retrieval.
A research team led by Maojun Sun has introduced DARE (Distribution-Aware Retrieval Embedding), a novel approach to bridge the gap between large language model (LLM) agents and the sophisticated R statistical ecosystem. The core problem addressed is that current AI agents struggle to reliably retrieve and use the thousands of specialized statistical functions in R's CRAN repository because standard retrieval methods ignore crucial data distribution context. DARE solves this by creating a curated R Package Knowledge Base (RPKB) from 8,191 high-quality packages and developing a plug-and-play embedding model that fuses function metadata with distributional features, enabling more context-aware tool selection for automated data science workflows.
The technical breakthrough lies in DARE's distribution-aware design, which moves beyond simple semantic matching to consider how data characteristics influence function relevance. This lightweight model achieves a remarkable 93.47% NDCG@10 score on package retrieval, outperforming state-of-the-art open-source embedding models by up to 17% while using fewer parameters. When integrated into their RCodingAgent system, DARE delivers significant improvements on downstream statistical analysis tasks, making LLM agents substantially more reliable for real-world R programming. This work represents a meaningful step toward fully automated, rigorous statistical analysis by better aligning AI capabilities with mature, domain-specific software ecosystems.
- DARE embedding model achieves 93.47% NDCG@10, beating existing models by 17% on R package retrieval
- Model incorporates data distribution features alongside function metadata from 8,191 CRAN packages
- Enables RCodingAgent to generate more reliable statistical code for automated data science workflows
Why It Matters
Enables more reliable AI automation of statistical analysis by bridging the gap between LLMs and R's extensive package ecosystem.