Research & Papers

New Quotient Semivalue Mechanism Blocks Data Sybil Attacks

Shapley-based data valuation made manipulation-proof against duplicate and pseudonymous submissions.

Deep Dive

A new paper on arXiv (2605.07663) tackles a critical vulnerability in data valuation: contributors can inflate their payments by splitting datasets across pseudonymous identities, duplicating high-value examples, or laundering near-duplicates. Authors Florian A. D. Burnat and Brittany I. Davidson formalize this as false-name manipulation and propose a quotient semivalue mechanism. Instead of attributing value to raw identities, the mechanism operates over evidence-backed attribution clusters, using a canonical-representative operator to absorb within-cluster duplication. They prove an impossibility result: on a fixed monotone data-value game, exact Shapley fairness over reported identities is incompatible with unrestricted false-name-proofness, even for binary-valued instances. The split-gain of a general semivalue is characterized via a unanimity counter-example.

The mechanism achieves exact false-name-proofness under two structural conditions—false-name-neutral within-cluster allocation and quotient-stable manipulations—and bounds manipulation gain and fairness loss when conditions hold approximately. Three measurable quantities (escaped-cluster mass, value-estimation error, clustering distance) quantify the trade-off. The authors instantiate the approach in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from 1.74 under baseline Shapley to 0.96, near the honest level. Cosine-threshold sweeps and false-merge/false-split rates map the fairness–Sybil frontier, offering practical guidelines for deploying attribution systems in decentralized ML marketplaces.

Key Points
  • Quotient semivalues prevent data contributors from inflating payments via pseudonym-splitting or duplication.
  • Impossibility proof shows exact Shapley fairness cannot coexist with unrestricted false-name-proofness.
  • On synthetic tasks, mechanism cuts manipulation gain from 1.74 (baseline) to 0.96, near honest levels.

Why It Matters

Ensures fair and attack-resistant data valuation for decentralized ML marketplaces and collaborative pipelines.