New survey exposes reproducibility crisis in financial AI systems
Financial AI models fail reproducibility tests across credit scoring, fraud detection, and LLM workflows.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new survey by Ruizhe Zhou and six co-authors (arXiv:2605.23955) systematically examines the reproducibility crisis in financial AI systems. While early machine learning in finance primarily tackled statistical issues like backtest overfitting, the rise of deep neural networks and generative AI introduces mechanical nondeterminism rooted in hardware and architecture. The paper focuses on three modalities now dominant in financial AI: tabular models (with post-hoc explanation variance), graph neural networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). The authors argue that this nondeterminism poses serious risks for regulated environments such as credit risk, fraud detection, and anti-money laundering, where auditability and consistency are legally required.
The researchers supplement their literature review with first-party experiments on public financial datasets. They quantify explanation rank instability in credit scoring models using RBO (Rank-Biased Overlap), measure prediction flip rates in GNN-based fraud detection with D_cos (cosine distance between representations), and document tensor-parallel-induced output divergence in LLM entity extraction tasks. The study proposes a layered evaluation framework that links modality-specific metrics—including TDI (trajectory drift index) and PSD (prediction stability divergence)—to overall audit readiness. Empirically, they validate that logit-level and semantic-level determinism measures are complementary, not redundant. This work provides both a diagnostic toolkit and a call to action for financial institutions deploying AI under regulatory scrutiny.
- Quantified explanation rank instability in credit scoring using RBO metric
- Measured prediction flip rates up to 15% in GNN-based fraud detection
- Proposed layered framework with metrics RBO, D_cos, TDI, and PSD for audit readiness
Why It Matters
For regulated finance, AI determinism is now a compliance necessity, not just a technical nicety.