An Empirical Study of Proactive Coding Assistants in Real-World Software Development
1,246 developers tracked; simulated traces vastly overestimate AI performance.
Researchers from academia and industry conducted an empirical study on proactive coding assistants—AI tools that infer developer intent from IDE actions rather than waiting for explicit prompts. Using a custom VS Code extension, they collected real interaction traces from 1,246 professional developers over three consecutive days. For comparison, they also generated paired LLM-simulated traces using GPT-4o and other models. The analysis reveals a significant 'simulation-to-reality gap': simulated traces lack behavioral diversity, have artificial temporal structure, and miss the exploratory patterns seen in real coding sessions.
Based on the real-world data, the team introduced ProCodeBench, a benchmark for proactive intent prediction. Testing state-of-the-art LLMs, retrieval-augmented generation (RAG) methods, and agentic baselines showed that performance on real traces is far below what simulation-based evaluations suggest. The study also found that while simulated data alone is insufficient for training, it can serve as a useful pre-training step before fine-tuning on real data. These results underscore the critical need for real developer behavior data in both evaluating and training next-generation coding assistants—a wake-up call for the AI-assisted software engineering community.
- Collected real IDE traces from 1,246 experienced developers over 3 days using a VS Code extension
- Found simulated traces lack behavioral diversity, temporal structure, and exploratory patterns
- Current LLMs and agentic models show significantly worse performance on real data vs simulated data
Why It Matters
Proactive coding assistants promising but current evaluations overestimate real-world performance—real user data is essential.