Research & Papers

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

New Spark-native primitives replace row-wise Python with 7.23M rows/s throughput.

Deep Dive

Zeyu Bai's Spark Policy Toolkit, detailed in arXiv:2604.25061, tackles two fundamental failures in custom policy-learning pipelines on Spark: impractical row-wise Python inference and fragile driver-side candidate materialization during split search. The toolkit introduces two Spark-native primitives—partition-initialized vectorized inference through mapInPandas and mapInArrow, and collect-less split search that scores candidates directly on executors. These are governed by a fixed-input semantic contract ensuring that identical rows, feature order, treatment vocabulary, preprocessing manifest, and split boundaries produce consistent per-row score vectors, best-split decisions, and end-to-end learned policy outputs. On a 40-worker Databricks cluster, mapInArrow achieves 4.72M rows/s at 10M matched rows and 7.23M rows/s at 50M rows, while collect-less split search remains valid across F=10 to F=1000 with 124,000 candidate rows, intentionally bypassing the driver-collect baseline.

Across 24 backend-ablation settings, mapInArrow wins 18 while mapInPandas wins 6, emphasizing that backend choice is workload-dependent. The fixed-input semantic contract proves robust: all six tested repartition/coalesce/shuffle perturbations preserve identical signatures when enforced, but drift when not. The paper's central insight is that throughput and collect-less execution are the mechanisms allowing policy semantics to survive at Spark scale. Evaluation includes practical baseline ladders, backend parity checks, measured split-search scale results, synthetic and Hillstrom end-to-end policy preservation, missingness stress, partition and order perturbation tests, quantile-boundary sensitivity, and a concrete adversarial failure catalog. This work directly benefits ML engineers and data scientists running policy learning on large-scale Spark clusters, enabling deterministic, high-throughput pipelines without fragile driver-side operations.

Key Points
  • MapInArrow achieves 7.23M rows/s on 50M rows with a 40-worker Databricks cluster.
  • Collect-less split search scores candidates on executors, skipping driver-collect for F=10 to F=1000.
  • Fixed-input semantic contract ensures identical outputs across all six repartition/shuffle perturbations when enforced.

Why It Matters

Makes policy learning on Spark 10x faster and deterministic, enabling scalable ML pipelines without driver bottlenecks.