Research & Papers

On the Impact of the Utility in Semivalue-based Data Valuation

New method tackles a core problem in data valuation: results shouldn't change drastically when you tweak the evaluation metric.

Deep Dive

A team of researchers including Mélissa Tamine, Benjamin Heymann, Maxime Vono, and Patrick Loiseau has published a significant paper tackling a foundational problem in machine learning: how to reliably value individual data points. Their work, 'On the Impact of the Utility in Semivalue-based Data Valuation,' accepted at the prestigious ICLR 2026 conference, focuses on semivalue-based methods like the Shapley value. These methods use cooperative game theory to assign a 'value' to each data point based on its contribution to a model's performance on a downstream task, a process known as utility. However, a major weakness has been that these valuations are highly sensitive to the practitioner's specific choice of utility function—switching from accuracy to F1-score, for instance, can completely reshuffle which data points are deemed most valuable.

To solve this robustness issue, the researchers introduce a novel geometric framework. They define a dataset's 'spatial signature,' which involves embedding each data point into a lower-dimensional space. In this transformed space, any chosen utility function behaves as a simple linear functional. This elegant reformulation turns the abstract problem of utility sensitivity into a more tangible geometric one. Building on this, the team proposes a concrete, practical methodology centered on an explicit robustness metric. This metric quantitatively informs a practitioner whether, and by how much, their data valuation rankings will change if they adjust the utility, offering crucial stability guarantees before committing to costly data curation or acquisition decisions.

The 44-page study validates this approach across multiple datasets and different semivalues, demonstrating strong agreement with traditional rank-correlation analyses. Importantly, it also provides analytical insight into how the choice of a specific semivalue (e.g., Shapley value vs. Banzhaf value) can intrinsically amplify or diminish the robustness of the resulting valuations. This work provides a much-needed tool for data-centric AI, where understanding data quality and contribution is paramount for building efficient models, cleaning datasets, and designing fair data marketplaces.

Key Points
  • Introduces a 'spatial signature' framework embedding data points to make utility functions linear, simplifying analysis of valuation stability.
  • Proposes a practical robustness metric that predicts how much data value rankings will shift when the evaluation utility changes.
  • Validated across diverse datasets, showing how choice of semivalue (e.g., Shapley value) intrinsically affects robustness.

Why It Matters

Enables more reliable data curation and marketplace design by ensuring valuation isn't fragile to minor changes in evaluation metrics.