Quantified explanation rank instability in credit scoring using RBO metric?

Quantified explanation rank instability in credit scoring using RBO metric

Measured prediction flip rates up to 15% in GNN-based fraud detection?

Measured prediction flip rates up to 15% in GNN-based fraud detection

Proposed layered framework with metrics RBO, D_cos, TDI, and PSD for audit readiness?

Proposed layered framework with metrics RBO, D_cos, TDI, and PSD for audit readiness

Research & Papers

New survey exposes reproducibility crisis in financial AI systems

arXiv cs.SI May 26, 2026

⚡Financial AI models fail reproducibility tests across credit scoring, fraud detection, and LLM workflows.

Deep Dive

A new survey by Ruizhe Zhou and six co-authors (arXiv:2605.23955) systematically examines the reproducibility crisis in financial AI systems. While early machine learning in finance primarily tackled statistical issues like backtest overfitting, the rise of deep neural networks and generative AI introduces mechanical nondeterminism rooted in hardware and architecture. The paper focuses on three modalities now dominant in financial AI: tabular models (with post-hoc explanation variance), graph neural networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). The authors argue that this nondeterminism poses serious risks for regulated environments such as credit risk, fraud detection, and anti-money laundering, where auditability and consistency are legally required.

The researchers supplement their literature review with first-party experiments on public financial datasets. They quantify explanation rank instability in credit scoring models using RBO (Rank-Biased Overlap), measure prediction flip rates in GNN-based fraud detection with D_cos (cosine distance between representations), and document tensor-parallel-induced output divergence in LLM entity extraction tasks. The study proposes a layered evaluation framework that links modality-specific metrics—including TDI (trajectory drift index) and PSD (prediction stability divergence)—to overall audit readiness. Empirically, they validate that logit-level and semantic-level determinism measures are complementary, not redundant. This work provides both a diagnostic toolkit and a call to action for financial institutions deploying AI under regulatory scrutiny.

Key Points

Quantified explanation rank instability in credit scoring using RBO metric
Measured prediction flip rates up to 15% in GNN-based fraud detection
Proposed layered framework with metrics RBO, D_cos, TDI, and PSD for audit readiness

Why It Matters

For regulated finance, AI determinism is now a compliance necessity, not just a technical nicety.

Read Original Article

New survey exposes reproducibility crisis in financial AI systems

Why It Matters

Related Articles

🚀 Stay Ahead in AI