SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
Multi-seed aggregation and structure-aware clustering boost ranking stability by up to 40%.
Sample-level rankings are widely used in data-centric NLP for filtering and debugging, but existing pipelines treat examples as independent, ignoring duplicates and near-duplicates. This leads to unstable relative orderings across random seeds. The new paper introduces SCARV (Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets) from authors Xu Zheng, Feiyu Wu, and colleagues. SCARV operates on top of any scoring proxy, combining robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. The framework is designed to improve stability without being a universal data selector.
Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, and end-to-end DistilBERT fine-tuning, SCARV substantially improves global and local stability compared to bare proxy rankings. The decomposition and compute-aware frontier reveal that robust multi-seed aggregation is the dominant stabilizer, while the structure-aware component adds value under low aggregation budgets or when redundancy clusters are informative. This positions SCARV as a stability-oriented aggregation layer for proxy-induced rankings, making ranking-based decisions like subset selection and suspicious-example retrieval more reproducible.
- SCARV combines multi-seed aggregation with a structure-aware step over redundancy clusters.
- Tested on synthetic redundancy and naturally mined QQP with DistilBERT fine-tuning.
- Improves stability in subset selection and suspicious-example retrieval for NLP datasets.
Why It Matters
More reliable data filtering and debugging in redundant NLP pipelines for reproducible AI development.