Agile Story-Point Estimation: Is RAG a Better Way to Go?
New study applies bge-large-en-v1.5 and all-mpnet-base-v2 models to estimate 23 software projects.
A team of researchers from the University of Saskatchewan—Lamyea Maha, Tajmilur Rahman, and Chanchal Roy—has published a groundbreaking study exploring whether RAG (retrieval-augmented generation) can automate the time-consuming manual process of Agile story point estimation. The research, accepted for ICPC 2026, tested two leading embedding models—bge-large-en-v1.5 and Sentence-Transformers' all-mpnet-base-v2—on 23 open-source software projects of varying sizes. The study examined four critical aspects: how retrieval hyperparameters influence performance, whether estimation accuracy differs across project sizes, if embedding model choice affects accuracy, and how RAG compares to existing baselines.
While the RAG-based approach showed promise by occasionally outperforming baseline models, the results revealed no statistically significant differences in performance across different project sizes or between the two embedding models tested. This finding is particularly important because it suggests that current RAG implementations may not yet be ready for reliable, automated story point estimation in real-world Agile environments. The researchers concluded that further studies are needed to refine RAG techniques and develop better model adaptation strategies for achieving consistent accuracy in automated user story estimation.
The study's methodology involved applying RAG's two-component architecture—a 'Retriever' to find relevant historical data and a 'Generator' to produce estimates—to replace traditional consensus-based techniques like Planning Poker. Despite analyzing thousands of story points across diverse projects, the lack of statistical significance highlights the complexity of translating software task complexity into numerical estimates. This research represents an important step toward understanding the limitations and potential of AI in software project management, particularly for automating one of Agile's most subjective yet critical processes.
- Tested RAG on 23 open-source projects using bge-large-en-v1.5 and all-mpnet-base-v2 embedding models
- Found no statistically significant performance differences across project sizes or between embedding models
- RAG occasionally outperformed baselines but requires further refinement for reliable automation
Why It Matters
Could automate time-consuming Agile planning but needs more work before replacing human estimation.