Research & Papers

SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets

arXiv cs.IR May 05, 2026

⚡Multi-seed aggregation and structure-aware clustering boost ranking stability by up to 40%.

Deep Dive

Sample-level rankings are widely used in data-centric NLP for filtering and debugging, but existing pipelines treat examples as independent, ignoring duplicates and near-duplicates. This leads to unstable relative orderings across random seeds. The new paper introduces SCARV (Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets) from authors Xu Zheng, Feiyu Wu, and colleagues. SCARV operates on top of any scoring proxy, combining robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. The framework is designed to improve stability without being a universal data selector.

Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, and end-to-end DistilBERT fine-tuning, SCARV substantially improves global and local stability compared to bare proxy rankings. The decomposition and compute-aware frontier reveal that robust multi-seed aggregation is the dominant stabilizer, while the structure-aware component adds value under low aggregation budgets or when redundancy clusters are informative. This positions SCARV as a stability-oriented aggregation layer for proxy-induced rankings, making ranking-based decisions like subset selection and suspicious-example retrieval more reproducible.

Key Points

SCARV combines multi-seed aggregation with a structure-aware step over redundancy clusters.
Tested on synthetic redundancy and naturally mined QQP with DistilBERT fine-tuning.
Improves stability in subset selection and suspicious-example retrieval for NLP datasets.

Why It Matters

More reliable data filtering and debugging in redundant NLP pipelines for reproducible AI development.

Read Original Article

SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets

Why It Matters

Stay Ahead in AI