Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation
New algorithm exploits model locality to reduce exponential Shapley computations to linear time.
A research team led by Xuan Yang has published a breakthrough paper titled 'Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation' on arXiv, introducing a novel framework that dramatically accelerates the computationally intensive process of data valuation. The work addresses the fundamental challenge that calculating exact Shapley values—which quantify each data point's contribution to model performance—is #P-hard due to exponential coalition spaces. The researchers' key insight is that modern machine learning models exhibit 'model-induced locality,' meaning only small subsets of training data influence specific predictions, allowing them to reframe Shapley computation as a structured data processing problem over these influential subsets rather than exhaustive coalition enumeration.
The team formalized this through support sets defined by computational pathways (neighbors in KNN, leaves in trees, receptive fields in GNNs) and proved that Shapley computation can be projected onto these supports without loss when locality is exact. They developed LSMR (Local Shapley via Model Reuse), an optimal algorithm that trains each influential subset exactly once via support mapping and pivot scheduling, establishing an information-theoretic lower bound on retraining operations. For larger supports, they created LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration. Experiments across multiple model families demonstrate substantial reductions in retraining operations and computational speedups while preserving high valuation fidelity, making practical data valuation feasible for enterprise-scale applications.
- Exploits model-induced locality to reduce Shapley computation from exponential to linear complexity
- LSMR algorithm trains each influential data subset exactly once, achieving 90%+ retraining reduction
- Framework works across KNN, decision trees, and graph neural networks while maintaining valuation accuracy
Why It Matters
Enables practical data valuation for large datasets, helping organizations identify high-value training data and optimize AI development costs.